[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Project1 - Speeding up Vector Space Calculations



Hi,
 Some of you pointed out that  lack of a method to get directly all terms
and their frequencies in a document is making the Weight calculation using
normalization a time consuming task. 
 So the way to get around this is to calculate the normalization weights
for each document once and store them.
Here is the pseudo-code for that
	
-----------------------------------------------------
	 IndexReader reader = IndexReader.open("index");
             TermEnum termenum = reader.terms();
	
		while(termenum.next())
                   {
                 	 Term termval = termenum.term();
	                   TermDocs termdocs = reader.termDocs(termval);
	                   while (termdocs.next(	))
        	             { 
	                      float idf =
			(float)(Math.log(reader.numDocs()/termenum.docFreq()));  
                      		wt1 =(float)(termdocs.freq()*idf);
                      		docname[termdocs.doc()]= new String(reader.document(termdocs.doc()).get("url"));
                      //docwt[i] gives the normalisation factor of ith doci.e. |doc[i]|
	                      docwt[termdocs.doc()]+=wt1*wt1;
                       }
                   termdocs.close();
     
	Store the values as a  <Document.URL, Value> pair and retrieve it
for similarity calculations. 
              		 
_________________________________________________________________________________

Once the values are stored you can read the values out while calculating
weights by matching the documents with their normalization values as under

	for a_document in set_of_docs_matching_query
	        for a_term in set_of_terms
        	        if a_term=query_term 
                	   call TermDocs(a_term) and determine
			 tfik = frequency(a_document,a_term)
		
			while(Not_EOF)
				string = readFile;
			 if (string.indexOf(a_document.getURL) != -1)        	   
				normalization = extract Value from string

		  weight_of_term_in_document =  tkik * log(N/size(set_of_docs_matching_query) / normalization	  
		
		end for	
		Calculate similarity	
  	end for

-----------------------------------------------------------------------------------

This should bring down the time for the vector space model considerably.

Ullas

"Well Begun is Half Done"