Information Retrieval Project

PART C

Project Description

This is the part C of the project for CSE 494/598. Due date for part C: 12/4/2008.

If you haven't completed parts A and B completely, first make sure that you complete them

TASK 5: Given a query, obtain the top-N documents (N >50 preferred) using vector space model (or vector space with page-rank model). Cluster the results using the simple K-means algorithm which "randomly" picks the initial k centroids.

You can experiment with different k values between 3 and 10. For each cluster, show only the top-3 representative documents of the cluster. Use the *vector similarity metric* that you have already implemented (Jaccard is not necessary) to measure the similarity between the documents and the centroids.
Provide short summaries of the clusters (using keywords that most distinguish those clusters from other clusters).

SUBMIT:

Hardcopy showing the clusters for the sample queries below.

Analyze the effect of number of clusters, and also your own evaluation of whether the clusters seem to correspond to any natural categories.

Hardcopy of your code with comments.

TASK 6: Cluster the answers using the Buckshot Algorithm. Use top-N documents (N > 50 preferred) returned using vector space model (or vector space with page-rank model). You can experiment with different k values between 3 and 10. For each cluster, show only the top-3 representative documents of the cluster. Use the *vector* Similarity metric to measure the similarity between the documents and the centroids.
SUBMIT:
- Hardcopy showing the clusters for the sample queries. Show both the k-seed clusters derived after HAC and the final clusters after doing K-means.
- Analyze whether the clusters seem to correspond to any natural categories.
- Compare the clusters with those given by the algorithms used in Task 5. Which is better? Why?
- Hardcopy of your code with comments.

Note that for both tasks you are expected to provide a thorough analysis of your results (i.e. simply showing the results obtained is not enough). You should also include a comparative analysis of the clustering algorithms you have implemented in the 2 tasks (as well as those that you would have implemented as extra credit). Your analysis should explain whether and why you believe the clusters are correct. Analysis will be an important part of the project grade.

TASK 7: Demonstration of Tasks 1 through 6. The Demo would be for around 10-15 minutes. You will be asked to run the code on some sample queries (not necessarily those listed below, so make sure you do not hard code the queries in your source). (Expected to be between 12/4-12/9)

Extra Credit:

*Propose* any other the extensions you want to make and make them.,
Implement "similar pages" feature,
Implement scalar cluster analysis to suggest ways of elaborating the query terms.
Run the queries on your own search engine, Google.com (using the special operator "site:asu.edu" after your query term), and ASU's Google Search Appliance. Note any differences in the rankings (obviously there will be some because the project crawl is not complete). If possible, try to speculate as to what may be causing the differences. Moreover, why do you think the results from Google.com and the Google Search Appliance are different? Which do you find to be better in terms of relevance to the query terms? .
Implement a feature which adds relevant terms to the keyword query by via some sort of user relevance feedback.
Extend the K-means algorithm implemented in Task 5,

Recomputing the centroid after every few changes

Using a heuristic to pick the centroids e.g. Run K-means multiple times and pick the centroids from the best cluster

Analyze the results obtained. Did these changes improve the quality of the clusters ?

Cluster the results using the Bisecting K-means algorithm. Compare and contrast the resulting clusters with those given by K-means and Buckshot. Specifically, analyze which algorithm provides "better" clusters, K-means or Bisecting K-means. (Hint: Use cluster quality measures defined in class slides).

Implement an alternate merging function for the HAC algorithm used in Task 6. Analyze the affect of this new method of merging on the clusters?

Implement a GUI for your search engine. The GUI should accept a query and be able to generate answers for any of the tasks you implemented.

The GUI should preferably be a stand alone application or applet. The GUI can be servlet based only if you have access to a personal Web server that can be accessed by the TA.

You are encouraged to add such things as AJAX based query completion etc. as you can i.e. simple interface, links open source documents, anything else you fancy.

Example Queries

Medical Care
Employee Benefits
Parking Decal
Admissions
Languages

Resource Description & Documentation

Knowledge of Java is a pre-requisite for the course. If you need to brush up your knowledge read the Java Tutorial. The Java API documentation which describes the class structure for various standard packages is available here.

Download & Setup

You must have the JDK installed on your system for you to proceed with installing the downloaded programs. If after installing the JDK you are getting command unknown for "java, javac, jar, javadoc etc." check your path and classpath settings. For Windows based systems check your Autoexec.bat (95,98) or Environment Variable Settings(Win NT/2000/XP). For Unix, Linux systems set path and classpath in your .cshrc.

Due date: 05/01/2007.

Last Modified: 04.06.2007