[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Project1 - Speeding up Vector Space Calculations

 Some of you pointed out that  lack of a method to get directly all terms
and their frequencies in a document is making the Weight calculation using
normalization a time consuming task. 
 So the way to get around this is to calculate the normalization weights
for each document once and store them.
Here is the pseudo-code for that
	 IndexReader reader = IndexReader.open("index");
             TermEnum termenum = reader.terms();
                 	 Term termval = termenum.term();
	                   TermDocs termdocs = reader.termDocs(termval);
	                   while (termdocs.next(	))
	                      float idf =
                      		wt1 =(float)(termdocs.freq()*idf);
                      		docname[termdocs.doc()]= new String(reader.document(termdocs.doc()).get("url"));
                      //docwt[i] gives the normalisation factor of ith doci.e. |doc[i]|
	Store the values as a  <Document.URL, Value> pair and retrieve it
for similarity calculations. 

Once the values are stored you can read the values out while calculating
weights by matching the documents with their normalization values as under

	for a_document in set_of_docs_matching_query
	        for a_term in set_of_terms
        	        if a_term=query_term 
                	   call TermDocs(a_term) and determine
			 tfik = frequency(a_document,a_term)
				string = readFile;
			 if (string.indexOf(a_document.getURL) != -1)        	   
				normalization = extract Value from string

		  weight_of_term_in_document =  tkik * log(N/size(set_of_docs_matching_query) / normalization	  
		end for	
		Calculate similarity	
  	end for


This should bring down the time for the vector space model considerably.


"Well Begun is Half Done"