[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Project1 - Speeding up Vector Space Calculations
Hi,
Some of you pointed out that lack of a method to get directly all terms
and their frequencies in a document is making the Weight calculation using
normalization a time consuming task.
So the way to get around this is to calculate the normalization weights
for each document once and store them.
Here is the pseudo-code for that
-----------------------------------------------------
IndexReader reader = IndexReader.open("index");
TermEnum termenum = reader.terms();
while(termenum.next())
{
Term termval = termenum.term();
TermDocs termdocs = reader.termDocs(termval);
while (termdocs.next( ))
{
float idf =
(float)(Math.log(reader.numDocs()/termenum.docFreq()));
wt1 =(float)(termdocs.freq()*idf);
docname[termdocs.doc()]= new String(reader.document(termdocs.doc()).get("url"));
//docwt[i] gives the normalisation factor of ith doci.e. |doc[i]|
docwt[termdocs.doc()]+=wt1*wt1;
}
termdocs.close();
Store the values as a <Document.URL, Value> pair and retrieve it
for similarity calculations.
_________________________________________________________________________________
Once the values are stored you can read the values out while calculating
weights by matching the documents with their normalization values as under
for a_document in set_of_docs_matching_query
for a_term in set_of_terms
if a_term=query_term
call TermDocs(a_term) and determine
tfik = frequency(a_document,a_term)
while(Not_EOF)
string = readFile;
if (string.indexOf(a_document.getURL) != -1)
normalization = extract Value from string
weight_of_term_in_document = tkik * log(N/size(set_of_docs_matching_query) / normalization
end for
Calculate similarity
end for
-----------------------------------------------------------------------------------
This should bring down the time for the vector space model considerably.
Ullas
"Well Begun is Half Done"