CSE 494 - Information Retrieval - Project Phase 3 (Due 4/24)
If you haven't completed phases 1 and 2 completely, first make sure that you complete them
Task 1: K-means
Given a query, obtain the top-N documents using TF/IDF.
Cluster the results using the simple K-means algorithm which randomly picks
the initial k centroids.
- 5 marks: Cluster the results of the queries given below, with
- Number of documents clustered, N = 50
- Number of clusters, k = 3
- Similarity algorithm = Vector similarity (TF-IDF) without PageRank
and submit a printout of the document numbers of the top-3 documents in each cluster.
- 3 marks: For each cluster you obtained above, determine short "summaries" of the clusters,
using keywords that most distinguish those clusters from other clusters. Explain how you obtained these summaries. You are free to come up with your own strategy for finding these summaries, there
is no set algorithm that you have to use.
- 8 marks: Pick any two queries from the set given below. Change the value of 'k' between 3 and 10.
What do you observe? Why?
- How does execution time change?
- How does the similarity of the document to the centroid of the cluster change?
- How did the value of k affect the clustering? Justify with a couple of examples.
- Do the clusters seem to roughly correspond to the natural category of the pages? Did
the value of k affect this? Mention any other observations you have.
- 4 marks: Submit your code with comments.
Sample queries
- medic care
- employee benefits
- parking decal
- admissions
- languages
Task 2: Integrated Search Engine Interface
10 marks: Points will be awarded for implementation, code, performance and a thorough analysis.
- Snippet generation: for every result document retrieved by TF/IDF, generate a snippet of text. The snippet must be a block of text from the document that helps the reader understand why the document is relevant to the query. Analyze the time taken to generate the snippet, describe the data structures and algorithm you used and judge the relevance of the snippet.
- GUI: Implement a graphical user interface
for your search engine. The GUI should show the top 10
results for the user query along with their snippets
(basically like a typical search engine).
Submit your code and a softcopy of the report to cse494s15@gmail.com. Submit a hardcopy of the
report in class. All submissions are due at the beginning (9AM) of the
class on deadline day.
Download and setup
You should not need any additional files for this part. The project after part 2 should be a sufficient point to build upon. A reminder: the indexed HTML files are available from: Projectclass.jar. Note that although the filename says "jar", it is actually a zip file which you can extract. The HTML files are in the "result3" folder.