CSE 494 - Information Retrieval - Project Phase 3 (Due 4/24)

If you haven't completed phases 1 and 2 completely, first make sure that you complete them

Task 1: K-means

Given a query, obtain the top-N documents using TF/IDF. Cluster the results using the simple K-means algorithm which randomly picks the initial k centroids.

Sample queries

Task 2: Integrated Search Engine Interface

10 marks: Points will be awarded for implementation, code, performance and a thorough analysis.

  1. Snippet generation: for every result document retrieved by TF/IDF, generate a snippet of text. The snippet must be a block of text from the document that helps the reader understand why the document is relevant to the query. Analyze the time taken to generate the snippet, describe the data structures and algorithm you used and judge the relevance of the snippet.
  2. GUI: Implement a graphical user interface for your search engine. The GUI should show the top 10 results for the user query along with their snippets (basically like a typical search engine).
Submit your code and a softcopy of the report to cse494s15@gmail.com. Submit a hardcopy of the report in class. All submissions are due at the beginning (9AM) of the class on deadline day.

Download and setup

You should not need any additional files for this part. The project after part 2 should be a sufficient point to build upon. A reminder: the indexed HTML files are available from: Projectclass.jar. Note that although the filename says "jar", it is actually a zip file which you can extract. The HTML files are in the "result3" folder.