This is the part C of the project for CSE 494/598. Due date for part C: 12/4/2008.
If you haven't completed parts A and B completely, first make sure that you complete them
You can experiment with different k values between 3 and 10. For each cluster, show only the top-3 representative documents of the cluster. Use the *vector similarity metric* that you have already implemented (Jaccard is not necessary) to measure the similarity between the documents and the centroids.
Provide short summaries of the clusters (using keywords that most distinguish those clusters from other clusters).
SUBMIT:
- Hardcopy showing the clusters for the sample queries below.
- Analyze the effect of number of clusters, and also your own evaluation of whether the clusters seem to correspond to any natural categories.
- Hardcopy of your code with comments.
SUBMIT:
Extra Credit:
(Hint: Use cluster quality measures defined in class slides).
- *Propose* any other the extensions you want to make and make them.,
- Implement "similar pages" feature,
- Implement scalar cluster analysis to suggest ways of elaborating the query terms.
- Run the queries on your own search engine, Google.com (using the special operator "site:asu.edu" after your query term), and ASU's Google Search Appliance. Note any differences in the rankings (obviously there will be some because the project crawl is not complete). If possible, try to speculate as to what may be causing the differences. Moreover, why do you think the results from Google.com and the Google Search Appliance are different? Which do you find to be better in terms of relevance to the query terms? .
- Implement a feature which adds relevant terms to the keyword query by via some sort of user relevance feedback.
- Extend the K-means algorithm implemented in Task 5,
- Recomputing the centroid after every few changes
- Using a heuristic to pick the centroids e.g. Run K-means multiple times and pick the centroids from the best cluster
Analyze the results obtained. Did these changes improve the quality of the clusters ?
- Cluster the results using the Bisecting K-means algorithm. Compare and contrast the resulting clusters with those given by K-means and Buckshot. Specifically, analyze which algorithm provides "better" clusters, K-means or Bisecting K-means.
Last Modified: 04.06.2007