CSE 494 - Information Retrieval - Project Phase 3

If you haven't completed phases 1 and 2 completely, first make sure that you complete them

Attempt Task 1, any one problem from Task 2, and optionally, any one extra credit task.

Task 1: K-means

Given a query, obtain the top-N documents using TF/IDF. Cluster the results using the simple K-means algorithm which randomly picks the initial k centroids.

Sample queries

Task 2: Choose one problem statement

20 marks: Choose one of the following tasks (note that the last one is a "catch-all" for any extension you may want to do that is not covered in our options). Points will be awarded for implementation, code, performance and a thorough analysis.

Extra Credit

10 marks: Choose any one of the items marked as extra credit, or choose another one of the topics from Task 2, and implement it.

Download and setup

You should not need any additional files for this part. The project after part 2 should be a sufficient point to build upon. A reminder: the indexed HTML files are available from: Projectclass.jar. Note that although the filename says "jar", it is actually a zip file which you can extract. The HTML files are in the "result3" folder.

Demo (The demos will be held starting December 1st. You will be asked to sign up for slots)

Demonstrate to the TA all parts of the Project. You will be awarded points based on the completion of tasks from all the parts of the project - including Vector Similarity, Authorities/Hubs, PageRank and K-means. During this 10-15 minute demo you will be expected to run some queries, not limited to the ones that were mentioned on this or previous projects.