CSE 494 - Information Retrieval - Project Phase 3

If you haven't completed phases 1 and 2 completely, first make sure that you complete them

Attempt Task 1, any one problem from Task 2, and optionally, any one extra credit task.

Task 1: K-means

Given a query, obtain the top-N documents using TF/IDF. Cluster the results using the simple K-means algorithm which randomly picks the initial k centroids.

5 marks: Cluster the results of the queries given below, with
- Number of documents clustered, N = 50
- Number of clusters, k = 3
- Similarity algorithm = Vector similarity (TF-IDF) without PageRank
and submit a printout of the URLs of the top-3 documents in each cluster.
3 marks: For each cluster you obtained above, determine short "summaries" of the clusters, using keywords that most distinguish those clusters from other clusters. Explain how you obtained these summaries.
8 marks: Pick any two queries from the set given below. Change the value of 'k' between 3 and 10. What do you observe? Why?
- How does execution time change?
- How does the similarity of the document to the centroid of the cluster change?
- How did the value of k affect the clustering? Justify with a couple of examples.
- Do the clusters seem to roughly correspond to the natural category of the pages? Did the value of k affect this? Mention any other observations you have.
4 marks: Submit your code with comments.
Extra credit (better centroids): Extend the K-means algorithm implemented in Task 1 of this part, by (1) recomputing the centroid after every few changes, and (2) Using a heuristic to pick the centroids (e.g. Run K-means multiple times and pick the centroids from the best cluster.) Analyze the results obtained. Did these changes improve the quality of the clusters?
Extra credit (bisecting K-means): Cluster the results using the Bisecting K-means algorithm. Compare and contrast the resulting clusters with those given by K-means and Buckshot. Specifically, analyze which algorithm provides "better" clusters, K-means or Bisecting K-means. (Hint: Use cluster quality measures defined in class slides).

Sample queries

medic care
employee benefits
parking decal
admissions
languages

Task 2: Choose one problem statement

20 marks: Choose one of the following tasks (note that the last one is a "catch-all" for any extension you may want to do that is not covered in our options). Points will be awarded for implementation, code, performance and a thorough analysis.

Snippet generation: for every result document retrieved by TF/IDF, generate a snippet of text. The snippet must be a block of text from the document that helps the reader understand why the document is relevant to the query. Analyze the time taken to generate the snippet, describe the data structures and algorithm you used and judge the relevance of the snippet.
Buckshot: Repeat all the steps of the K-Means clustering question, this time, using Buckshot instead of K-means as the clustering algorithm. You must show the clusters both immediately after HAC and after the final K-means clustering. Compare Buckshot clustering algorithm to K-means clustering. Which is better? Why? Support your answer with a few examples that demonstrate your point.
- (Extra credit) Alternate Buckshot: Implement an alternate merging function for the HAC algorithm used in Buckshot clustering algorithm. Analyze the affect of this new method of merging on the clusters.
"Similar pages" feature, as seen on Google.
Implement scalar cluster analysis to suggest ways of elaborating the query terms.
Implement a feature which adds relevant terms to the keyword query by via some sort of user relevance feedback.
Improved GUI: Make significantly improved GUI. You are encouraged to add such things as AJAX based query completion etc. as you can i.e. simple interface, clicking on links to open source documents, anything else you fancy. You will be evaluated on anything you do over and above the average GUI submitted for Phase 2 project.
Propose any interesting extensions you want to do, and implement them.

Extra Credit

10 marks: Choose any one of the items marked as extra credit, or choose another one of the topics from Task 2, and implement it.

Download and setup

You should not need any additional files for this part. The project after part 2 should be a sufficient point to build upon. A reminder: the indexed HTML files are available from: Projectclass.jar. Note that although the filename says "jar", it is actually a zip file which you can extract. The HTML files are in the "result3" folder.

Demo (The demos will be held starting December 1st. You will be asked to sign up for slots)

Demonstrate to the TA all parts of the Project. You will be awarded points based on the completion of tasks from all the parts of the project - including Vector Similarity, Authorities/Hubs, PageRank and K-means. During this 10-15 minute demo you will be expected to run some queries, not limited to the ones that were mentioned on this or previous projects.