Information Retrieval Project

CSE 494/598 - Project

Project Description

This is the part A of the project for CSE 494/598. You are provided with a system that can extract web pages and index them. Using this system you will experiment with various Ranking Algorithms.

TASK 0: Read the background description, download the code, and try the boolean ranking code in SearchFiles.java made available as part of the code. You may try example queries given below. You can also try negated keywords (e.g. -Decal).
TASK 1: Implement the Vector Space Model to rank the documents. Hint: See class notes: Retrieval using inverted files.
- NOTE: For a query q and document di, Similarity(q,di)=q.di/|q||di|. Computation of |di|, the 2-norm of the term weights of all terms contained in the document takes more time as you need to scan through every hit document and then all terms it contains to get the normalization factor. So better precompute the 2-norms for all documents, rather than doing it every time you compute similarity of a document. You can start with VectorViewer.java code for Vector space ranking.
- SUBMIT:
  - A writeup explaining your algorithm and brief evaluation of its performance.
  - Hardcopy showing the top 10 documents ranked by Vector Space model for the 10 example queries below.
  - Hardcopy of your code with comments.

Example Queries for TASK 1

Fall semester
Grades
SRC
newsletter
Decal
Parking
Hayden Library
Transcripts
Scholarship
Admissions

Resource Description & Documentation

The code made available as part of the project is written in Java. A latest version of Sun's Java Development Kit can be downloaded from this site. If you have a earlier version of JDK installed on your system there is no need to upgrade.
Knowledge of Java is a pre-requisite for the course. If you need to brush up your knowledge read the Java Tutorial. The Java API documentation which describes the class structure for various standard packages is available here.
The following functionality is provided by the classes provided as part of the project.
- A crawler implemented by Webcrawl.java that crawls the web starting from the seed URL provided on invocation. This crawler stores the URLs that it successfully crawls. The result of one such crawl is made available as part of the code.
- The IndexHTML.java contains code to create the Index from the crawled URLs. The index generated is made available as part of the Projectclass.jar. The index is generated using the LUCENE API. To use this API as part of your code you have to import com.lucene.*.
- The program SearchFiles.java gives a query interface that accepts the queries and retrieves the matching documents. The interface accepts Boolean Queries. The program assumes "index" to be the name of the directory where the index is stored. To use your own index, change the default name to point to "yourindexname" and recompile.
- The program VectorViewer.java is a demo program that lists the total number of documents in the index and the lists all the terms in the index with their frequencies. Since the given code does not contain any data structure giving you the Document Vector Model, this program demonstrates how you can retrieve the underlying Document Vector Model for use in the various ranking algorithms that you develop. Read the commented section inside the code to extract the (document,freq) for each term in the index to generate the document vector model.
The API documentation for the given classes is available for download and also can be viewed here.

Download & Setup

To download the code, Right Click on the Links below and select Save As from the Popup Menu .

Projectclass.jar contains the class files for the above mentioned interfaces and also indexed documents from an example crawl of ASU.
Projectdocs.jar gives the API documentation for the given code.
Projectsrc.jar contains the source code which can be extended for the project.
You must have the JDK installed on your system for you to proceed with installing the downloaded programs. If after installing the JDK you are getting command unknown for "java, javac, jar, javadoc etc." check your path and classpath settings. For Windows based systems check your Autoexec.bat (95,98) or Environment Variable Settings(Win NT/2000/XP). For Unix, Linux systems set path and classpath in your .cshrc.
To extract the contents of the downloaded files. Move to the directory where you want to install them and type
- jar xvf Project1*.jar. Add this directory to your classpath to avoid problems.
Project1class.jar contains compiled code for the interfaces in the source and also documents crawled by the crawler and an index generated from it. If you are interested in checking the query engine you can try
Invoke all the programs from the directory where you executed the "jar" command. This way all the path setting inside the class files will work properly.

Deadline : Due date for part A : 30th September, 2008

Last Modified: 02.05.2004