[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Vector space model task



In the TASK 1 description, it asks you to use vector space formula from Homework or in the textbook. (Modern Information retrieval by Ricardo-Yates
Many students were asking me what teh textbook formula would look like and if it was easier to implement it than the one in Homework.
I am including the textbook formula below and it is up to you to decide which one is easier.

The textbook vector space formula goes like this:

t = Number of terms in the index
N= Number of documents in the collection
tfij= term frequency of term i in document j
ni=Number of documents term i occurs in

Weight of term i in document j (wij) = tfij*log(N/ni)

Similarity of document j to a query q Sim(Dj,q) = | Dj.q|/|Dj||q|

|Dj|=weight of the document j = SQRT(w1j^2 + w2j^2 + ........+ wtj^2)
In other words weight of  a document is the squareroot of sum of squares of weights of all terms contained in the document.

In the API(in the directory doc/api/index-all.html), the following classes will be useful in implementing Vector Space ranking:

Termenum is the class representing all the terms in the index (An enumeration of terms)
Termval is the class that holds term or keyword
Termdocs is a representation for documents

Sree