PART C


Project Description

You can experiment with different k values between 3 and 10.  For each cluster, show only the top-3 representative documents of the cluster.  Use the *vector similarity metric* that you have already implemented (Jaccard is not necessary) to measure the similarity between the documents and the centroids.

Provide short summaries of the clusters (using keywords that most distinguish those clusters from other clusters).

SUBMIT:

 

Note that for both tasks you are expected to provide a thorough analysis of your results (i.e. simply showing the results obtained is not enough). You should also include a comparative analysis of the clustering algorithms you have implemented in the 2 tasks (as well as those that you would have implemented as extra credit). Your analysis should explain whether and why you believe the clusters are correct. Analysis will be an important part of the project grade.

 

Extra Credit: 

  1. *Propose* any other the extensions you want to make and make them., 
  2. Implement "similar pages" feature, 
  3. Implement scalar cluster analysis to suggest ways of elaborating the query terms. 
  4. Run the queries on your own search engine, Google.com (using the special operator "site:asu.edu" after your query term), and ASU's Google Search Appliance. Note any differences in the rankings (obviously there will be some because the project crawl is not complete). If possible, try to speculate as to what may be causing the differences. Moreover, why do you think the results from Google.com and the Google Search Appliance are different? Which do you find to be better in terms of relevance to the query terms? . 
  5. Implement a feature which adds relevant terms to the keyword query by via some sort of user relevance feedback.  
  6. Extend the K-means algorithm implemented in Task 5, 
    • Recomputing the centroid after every few changes
    • Using a heuristic to pick the centroids e.g. Run K-means multiple times and pick the centroids from the best cluster

    Analyze the results obtained. Did these changes improve the quality of the clusters ?

  7. Cluster the results using the  Bisecting K-means algorithm. Compare and contrast the resulting clusters with those given by  K-means and Buckshot. Specifically, analyze which algorithm provides "better" clusters, K-means or Bisecting K-means.  (Hint: Use cluster quality measures defined in class slides).
  8. Implement an alternate merging function for the HAC algorithm used in Task 6.    Analyze the affect of this new method of merging on the clusters?
  9. Implement a GUI for your search engine.  The GUI should accept a query and be able to generate answers for any of the tasks you implemented.
    •  The GUI should preferably be a stand alone application or applet.  The GUI can be servlet based only if you have access to a personal Web server that can be accessed by the TA.
    •  You are encouraged to add such things as AJAX based query completion etc. as you can i.e. simple interface, links open source documents, anything else you fancy.

Example Queries

Resource Description & Documentation

Download & Setup

 

Due date: 05/01/2007.


Last Modified: 04.06.2007