CSE 494 - Information Retrieval - Project Part 1 (Due Date Here)

Project Description

This is the part A of the project for CSE 494/598. You are provided with a system that can extract web pages and index them. Using this system you will experiment with various Ranking Algorithms.

You should consider finishing the coding for this project at least 2 days before the actual deadline, so you have enough time for the analysis. Note that most of the points in this assignment are awarded for the analysis

Task 0: Read the background description, download the code, and try the boolean ranking code in SearchFiles.java made available as part of the code. You may try example queries given below. You can also try negated keywords (e.g. -Decal).
Task 1: Implement the Vector Space Model (using *just* tf weights) to rank the documents. Hint: See class notes: Retrieval using inverted files.
NOTE: For a query q and document di, Similarity (q,di) = q.di / |q||di|. Computation of |di|, the 2-norm of the term weights of all terms contained in the document takes more time as you need to scan through every hit document and then all terms it contains to get the normalization factor. So better precompute the 2-norms for all documents, rather than doing it every time you compute similarity of a document. You can start with VectorViewer.java code for Vector space ranking.
Task 2: Redo task 1, but now with tf/idf weighting of documents. (notice that while tf is already given, you do need to compute idf yourself. Since the document corpus is not going to change much in this project, you can precompute idf).
Task 3 (extra credit):Evaluate the tf vs. tf/idf ranking to draw any empirical conclusions you can about which is more effective.

SUBMIT:

5 points: Email just your code (not the supporting libraries or data) to sushovan@asu.edu with a subject line [494 code].
5 points: Hardcopy showing the top 10 documents ranked by Vector Space model (once for task 1, and once for task 2) for the 10 example queries below.
5 points: A writeup explaining your algorithm.
10 points: Analyze the speed/efficiency of your algorithm. You may choose to do various experiments, such as plot a graph of the time taken vs number of results, compare the time taken by various phases of the algorithm, and comparing time taken by TF-only search vs TF-IDF search.
5 points: Analyze the correctness of the results. Do they make sense? Does adding a second word to your search change the results? Does adding IDF change the results?
Details of any evaluation you did for the optional task 3.

Example Queries

Grades
newsletter
Decal
Parking
Transcripts
Scholarship
Admissions
Carl Hayden
Fall semester
stimulant web

Download and set up

The code made available as part of the project is written in Java, in an Eclipse project. You can download the latest version of Eclipse here. [Use the third link, the one that says Eclipse IDE for Java Developers (122 MB).] If you choose not to use Eclipse, add the lucene.jar file in the lib folder to your classpath, and you should be ready to compile your code using any compiler of your choice.

Knowledge of Java is a pre-requisite for this course. If you need to brush up your Java skills, you should read The Java Tutorial. A detailed, class-by-class reference for the Java language and libraries can be found here. You could also simply search for the name of the class on your favorite internet search engine.

Getting Started:

Download the following file to your desktop cse494-v1.zip.
Extract the contents of the zip file to a folder of your choice.
Start Eclipse. Click on File > New > Java Project.
In the dialog box that appears, enter the path to the folder in the Location text box.
Click Finish. A new project should now appear in your Package Explorer panel.
Expand the src folder in that tree, and double click on SearchFiles.java inside edu.asu.cse494.
Click on Run > Run As > Java Application.
You will now be able to type queries in the "Console" area at the bottom of the screen.

Description and documentation

The file SearchFiles.java gives a query interface that accepts the queries and retrieves the matching documents. The interface accepts Boolean Queries. The program assumes "result3index" to be the name of the directory where the index is stored. To use your own index, change the default name to point to "yourindexname" and recompile.
The program VectorViewer.java is a demo program that lists the total number of documents in the index and the lists all the terms in the index with their frequencies. This program demonstrates how you can retrieve the underlying Document Vector Model for use in the various ranking algorithms that you develop. Read the commented section inside the code to extract the (document, freq) for each term in the index to generate the document vector model.
The package you downloaded uses the lucene api for indexing and searching. The full documentation for the supplied classes as well as the lucene api can be found here.

Additional Files:

Additional Code. This is a zip file containing some additional classes:
- Webcrawl.java: A crawler that crawls the web starting from the seed URL provided on invocation. This crawler stores the URLs that it successfully crawls. The result of one such crawl is made available as part of the class files below.
- IndexHTML.java: Contains code to create the index from the crawled URLs. The index generated is made available as part of the Projectclass.jar. The index is generated using the lucene API. The result of this program is made available as part of the cse494-v1.zip above. You can recreate your own index by running this program on the crawled html files.
- Other classes: Some other classes that you may use to index html pages.
Projectclass.jar. This is a zip file containing compiled versions of the code above, as well as a full crawl of the asu website. You should get this file if you are planning to recreate the index.