CSE 494 - Information Retrieval - Project Part 1 (Due Feb 23rd, 2010)

Project Description

This is the part A of the project for CSE 494/598. You are provided with a system that can extract web pages and index them. Using this system you will experiment with various Ranking Algorithms.

Task 0: Read the background description, download the code, and try the boolean ranking code in SearchFiles.java made available as part of the code. You may try example queries given below. You can also try negated keywords (e.g. -Decal).
Task 1: Implement the Vector Space Model (using *just* tf weights) to rank the documents. Hint: See class notes: Retrieval using inverted files.
NOTE: For a query q and document di, Similarity (q,di) = q.di / |q||di|. Computation of |di|, the 2-norm of the term weights of all terms contained in the document takes more time as you need to scan through every hit document and then all terms it contains to get the normalization factor. So better precompute the 2-norms for all documents, rather than doing it every time you compute similarity of a document. You can start with VectorViewer.java code for Vector space ranking.
Task 2: Redo task 1, but now with tf/idf weighting of documents. (notice that while tf is already given, you do need to compute idf yourself. Since the document corpus is not going to change much in this project, you can precompute idf).
Task 3 (extra credit):Evaluate the tf vs. tf/idf ranking to draw any empirical conclusions you can about which is more effective.

SUBMIT:

A writeup explaining your algorithm and brief evaluation of its performance.
Hardcopy showing the top 10 documents ranked by Vector Space model (once for task 1, and once for task 2) for the 10 example queries below.
Details of any evaluation you did for the optional task 3.
Hardcopy of your code with comments.

Example Queries

Fall semester
Grades
SRC
newsletter
Decal
Parking
Hayden Library
Transcripts
Scholarship
Admissions

Download and set up

The code made available as part of the project is written in Java, in an Eclipse project. You can download the latest version of Eclipse here. [Use the second link, the one that says Eclipse IDE for Java Developers (92 MB).] If you choose not to use Eclipse, add the lucene.jar file in the lib folder to your classpath, and you should be ready to compile your code using any compiler of your choice.

Knowledge of Java is a pre-requisite for this course. If you need to brush up your Java skills, you should read The Java Tutorial. A detailed, class-by-class reference for the Java language and libraries can be found here. You could also simply search for the name of the class on your favorite internet search engine.

Getting Started:

Download the following file to your desktop cse494-v1.zip.
Extract the contents of the zip file to a folder of your choice.
Start Eclipse. Click on File > New > Java Project.
In the dialog box that appears, click Create project from existing source. Click Browse... and point to the folder you extracted your files in.
Click Finish. A new project should now appear in your Package Explorer panel.
Expand the src folder in that tree, and double click on SearchFiles.java inside edu.asu.cse494.
Click on Run > Run As > Java Application.
You will now be able to type queries in the "Console" area at the bottom of the screen.

Description and documentation

The file SearchFiles.java gives a query interface that accepts the queries and retrieves the matching documents. The interface accepts Boolean Queries. The program assumes "result3index" to be the name of the directory where the index is stored. To use your own index, change the default name to point to "yourindexname" and recompile.
The program VectorViewer.java is a demo program that lists the total number of documents in the index and the lists all the terms in the index with their frequencies. This program demonstrates how you can retrieve the underlying Document Vector Model for use in the various ranking algorithms that you develop. Read the commented section inside the code to extract the (document, freq) for each term in the index to generate the document vector model.
The package you downloaded uses the lucene api for indexing and searching. The full documentation for the supplied classes as well as the lucene api can be found here.

Additional Files:

Additional Code. This is a zip file containing some additional classes:
- Webcrawl.java: A crawler that crawls the web starting from the seed URL provided on invocation. This crawler stores the URLs that it successfully crawls. The result of one such crawl is made available as part of the class files below.
- IndexHTML.java: Contains code to create the index from the crawled URLs. The index generated is made available as part of the Projectclass.jar. The index is generated using the lucene API. The result of this program is made available as part of the cse494-v1.zip above. You can recreate your own index by running this program on the crawled html files.
- Other classes: Some other classes that you may use to index html pages.
Projectclass.jar. This is a zip file containing compiled versions of the code above, as well as a full crawl of the asu website. You should get this file if you are planning to recreate the index.