[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: project1

To: <Timothy.Calhoun@asu.edu>
Subject: Re: project1
From: "Sree" <slakshmi@asu.edu>
Date: Tue, 8 Oct 2002 17:49:56 -0700
Cc: <cse494-f02@parichaalak.eas.asu.edu>
References: <1034107540.3da33a9447283@webmail.asu.edu>

In task1, you have to use: Wik = [TFik * log(N/nk)]/[sum (TFik)^2 +
(log(n/nk))^2]^1/2

You get the frequency of each term in each document, number of unique terms
etc using the methods provided in the API such as indexReader(), termEnum(),
Termdocs() etc. You can get a quick view of how to use them by seeing the
VectorViewer.java code.

Hashtable would be the best way to store the weights etc. However, you can
use any data structure - a simple array, a linked list whatever you are
comfortable with.

Sree
----- Original Message -----
From: <Timothy.Calhoun@asu.edu>
To: "Sree" <slakshmi@asu.edu>
Sent: Tuesday, October 08, 2002 1:05 PM
Subject: project1


> i have a few questions about project1. in task 1 he says you can use the
> weighting method from homework 2.1. does this mean the Wik = TFik * IDFk
or the
> long one that is normalized
> Wik = [TFik * log(N/nk)]/[sum (TFik)^2 + (log(n/nk))^2]^1/2
>
> also where do we get the values, such as the frequency of each term in
each
> document(TFik), the total number of docs(N), the number of unique
terms(M), are
> these values stored in a field in the vector viewer class, and how do we
get
> them out.
>
> also, once we determine the weights, are we to store them in an array, or
a
> linked list, ect. which is the best way to store them for computing the
> simularity?
>
> thank you,
> shane calhoun
>

Prev by Date: homework 1 solutions up online
Next by Date: Fall Seminar on Data and Knowledge Integration: this week
Previous by thread: homework 1 solutions up online
Next by thread: Fall Seminar on Data and Knowledge Integration: this week
Index(es):
- Date
- Thread