CSE 494 - Information Retrieval - Project Part 1

Due February 13th, 2015

Project Description

This is the part 1 of the project for CSE 494/598. You are provided with a system that can extract web pages and index them. Using this system you will experiment with various Ranking Algorithms.

You should consider finishing the coding for this project at least 2 days before the actual deadline, so you have enough time for the analysis. Note that most of the points in this assignment are awarded for the analysis

Task 0: Read the background description, download the code, and try the boolean ranking code in SearchFiles.java made available as part of the code. You may try example queries given below.
Task 1: Implement the Vector Space Model (using just TF weights) to rank the documents. Hint: See class notes "Retrieval using inverted files".
NOTE: For a query q and document di, Similarity (q,di) = q.di / |q||di|. Computation of |di|, the 2-norm of the term weights of all terms contained in the document takes more time as you need to scan through every hit document and then all terms it contains to get the normalization factor. So better precompute the 2-norms for all documents, rather than doing it every time you compute similarity of a document.
Task 2: Redo task 1, but now with TF/IDF weighting of documents. (notice that while TF is already given, you do need to compute IDF yourself. Since the document corpus is not going to change much in this project, you can precompute IDF). Evaluate the TF vs. TF/IDF ranking to draw any empirical conclusions you can about which is more effective. If you want to see the actual pages that this index is based on, look at the folder called "result3" in the file Projectclass.jar you can download from the end of this page.
IMPORTANT: Lucene has its own functions to do advanced TF/IDF based querying. Please do not use those functions. Use only the lucene functions that are shown in the sample code. Your results may not match the results of the lucene in-built search engine, because they use a complex algorithm, and approximate many quantities in order to optimize lucene for speed.

SUBMIT:

4 points: Email just your code (not the supporting libraries or data) to cse494s15@gmail.com with the subject line: [494 code].
4 points: Submit a print out of the top 10 documents ranked by Vector Space model (once for task 1, and once for task 2) for the 5 example queries below. Print ONLY the document numbers.
2 points: A few sentences explaining your algorithm.
Analyze the speed/efficiency of your algorithm.
- 5 points: How long did it take to compute document norms? How long did it take to get the results? To sort them?
- 5 points: Plot a bar chart showing the time taken to perform a search of the five queries above for TF and for TF/IDF. Is the difference significant? Is it expected?
- 5 points: Plot a chart showing the time taken to perform a search vs the number of words in the query. Explain the result.
- (Extra credit 3 points): Plot a curve showing the time taken to do a query vs the number of results in the query. To do this question, you will need to figure out query keywords of your own that have various number of results, so that you can plot a meaningful curve.
- (Extra credit 3 points): Artificially restrict the size of the document set by ignoring any documents with id above a certain number 'n'. Plot the time taken to do the query against n. Explain the result.
- (Extra credit 3 points): Provide a theoretical complexity analysis of the algorithm.
Analyze the correctness of the results.
- 5 points: Are the TF results relevant, as judged by a human? Are the TF/IDF results relevant? Which is better?
- 5 points: Compare TF vs TF/IDF: Is the order of the results different with a single keyword? two keywords?
- 5 points: Which term(s) have the lowest IDF in the corpus? Is this expected?
(Extra Credit): Any interesting analysis or insight you have can be submitted for extra credit.

Example Queries

grades
newsletter
carl hayden
fall semester
stimulant web

Download and set up

The code made available as part of the project is written in Java, in an Eclipse project. You can download the latest version of Eclipse here. [Use the third link, the one that says Eclipse IDE for Java Developers (150 MB).] If you choose not to use Eclipse, add the lucene-core-3.6.2.jar file in the lib folder to your classpath, and you should be ready to compile your code using any compiler of your choice.

Knowledge of Java is a pre-requisite for this course. If you need to brush up your Java skills, you should read The Java Tutorial. A detailed, class-by-class reference for the Java language and libraries can be found here. You could also simply search for the name of the class on your favorite internet search engine.

Getting Started:

Download the following file to your desktop irs13.zip.
Extract the contents of the zip file to a folder of your choice.
Start Eclipse. Click on File > New > Java Project.
In the dialog box that appears, enter the path to the folder in the Location text box.
Click Finish. A new project should now appear in your Package Explorer panel.
Expand the src folder in that tree, and double click on SearchFiles.java inside edu.asu.cse494.
Click on Run > Run As > Java Application.
You will now be able to type queries in the "Console" area at the bottom of the screen.

Description and documentation

The file SearchFiles.java gives a query interface that accepts the queries and retrieves the matching documents. The interface accepts Boolean Queries. The program assumes "index" to be the name of the directory where the index is stored. To use your own index, change the default name to point to "yourindexname" and recompile. The file also has commented code that shows you how to use the various permitted APIs of the lucene project. Please use only the functions you find in this file, and do not use any other APIs. This restriction will be lifted in the future parts of the project.
The file CallIndexFiles.java creates a lucene index from a set of crawled html files. If you want to test your code on a different dataset, you should store all the html files in a folder, and then run this code to create the index.
The documentation of the lucene API is attached in the javadoc in the lib/ folder, and you should also be able to find it online.

Additional Files:

Projectclass.jar. This is a zip file containing a full crawl of the asu website that the index is based on. You should get this file if you are planning to recreate the index, or want to look into the HTML files for further processing. Please ignore all the other files in this jar file except for the result3 folder - they are older versions of the code.