[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Vector space model task
In the TASK 1 description, it asks you to use vector space formula from Homework
or in the textbook. (Modern Information retrieval by Ricardo-Yates
Many students were asking me what teh textbook formula would look like and
if it was easier to implement it than the one in Homework.
I am including the textbook formula below and it is up to you to decide which
one is easier.
The textbook vector space formula goes like this:
t = Number of terms in the index
N= Number of documents in the collection
tfij= term frequency of term i in document j
ni=Number of documents term i occurs in
Weight of term i in document j (wij) = tfij*log(N/ni)
Similarity of document j to a query q Sim(Dj,q) = | Dj.q|/|Dj||q|
|Dj|=weight of the document j = SQRT(w1j^2 + w2j^2 + ........+ wtj^2)
In other words weight of a document is the squareroot of sum of squares
of weights of all terms contained in the document.
In the API(in the directory doc/api/index-all.html), the following classes
will be useful in implementing Vector Space ranking:
Termenum is the class representing all the terms in the index (An
enumeration of terms)
Termval is the class that holds term or keyword
Termdocs is a representation for documents
Sree