Information Retrieval Project

CSE 494/598 - Project1

Description Resource/Documentation Download/Setup Notes Deadline

Project Description

This is the first project for CSE 494/598. You are provided with a system that can extract web pages and index them. Using this system you will experiment with various Ranking Algorithms.

TASK 0: Read the background description, download the code, and try the boolean ranking code made available as part of the code. You may try queries made up of keywords related to AI (planning, information retrieval, bayes network etc. etc.). You can also try negated keywords (e,g, -network).
TASK 1: Implement the Vector Space Model to rank the documents. You can use the weighting method given in the text or the one given in the homework question 1.1.
TASK 2: Implement the Authorities/Hub computation as discussed in the class. Specifically, for a given query Q, you first find top K most similar links using vector space model. Next, you grow these pages into a base set of pages. You make the adjacency matrix for these base set pages using the linkextraction module provided below. You will then do the authority/hub computation on the adjancency matrix, and return the top m authorities and top m hubs. You will compare the results from pure vector space to the results from authorities/hub computation for a set of queries.
- Note that we have provided a new .jar file that contains a crawl of asu.edu domain, as well as its link matrix. You need to use this data rather than the aaai.org crawl data that was given earlier. You may want to use this same data for task 1 too.
TASK 3: (Common to the two tasks above). Evaluate your search engine on a set of at least five representative queries from the ASU context (assuming you are using the ASU crawl). Compare the results of the queries, in terms of precision with simple vector similarity and with similarity + authorities hubs. Comment on whether the additional time taken by authorities/hubs computation was found to be worth it. You need to use *your* sense of what are good pages to compare the precision.
TASK 4: We want to suggest some "related" terms to the user to let him/her refine the query. To support this, implement the scalar cluster technique discussed in the class. Use it to suggest keywords that the user can use to refine the query. For each of the representative queries from task 3, evaluate the scalar clustering.

Extra Credit Tasks

The prefix stars represent the degree of difficulty

(*)Try implementing relevance feedback techniques (the Roccochio method), play with it yourself--by identifying the good vs. bad pages, to see how good it is in practice.
(**) Develop a good interface to display the search results, as well as the query elaboration results.
(***)Do the pagerank calculation for the entire ASU crawl, and use it as the basis for search. Experiment with different ways of combining pagerank and vector-space similarity.
(***) Implement and compare latent semantic indexing to vector space ranking. (You will have to search and get an SVD package for this).

Resource Description & Documentation

Class Accounts have been provided for accessing the computing facilities. These accounts are for GWC 250/242. Already Java is pre-installed for each account. Do check your account once you login. There is no need to install a newer version of Java for this project.

The code made available as part of the project is written in Java. A latest version of Sun's Java Development Kit can be downloaded from this site. If you have a earlier version of JDK installed on your system there is no need to upgrade.
Knowledge of Java is a pre-requisite for the course. If you need to brush up your knowledge read the Java Tutorial. The Java API documentation which describes the class structure for various standard packages is available here.

The following functionality is provided by the classes provided as part of the project.

A crawler implemented by Webcrawl.java that crawls the web starting from the seed URL provided on invocation. This crawler stores the first 1000 URLs that it successfully crawls. The result of one such crawl is made available as part of the code. Files from one such run on http://www.aaai.org/ were used to generate the index file given as part of the project.
The IndexHTML.java contains code to create the Index from the crawled URLs. The index generated is made available as part of the Project1class.jar The index is generated using the LUCENE API. To use this API as part of your code you have to import com.lucene.*.
The program SearchFiles.java gives a query interface that accepts the queries and retrives the matching documents.The interface accepts BooleanQueries. Currently the interface is plain text, a HTML version will be uploaded soon. Also the program assumes "index" to be the name of the directory where the index is stored. To use your own index, change the default name to point to "yourindexname" and recompile.
The program VectorViewer.java is a demo program that lists the total number of documents in the index and the lists all the terms in the index with their frequencies. Since the given code does not contain any data structure giving you the Document Vector Model, this program demonstrates how you can retrieve the underlying Document Vector Model for use in the various ranking alogrithms that you develop. Read the commented section inside the code to extract the (document,freq) for each term in the index to generate the document vector model.

The API documentation for the given classes is available for download and also can be viewed here.

LinkGen.java and LinkExtract.java are provided to help in computing PageRank and Authority/Hub for the crawled pages. LinkGen generates the Link Matrix from the files crawled and saves it to a file. LinkExtract uses the file written by LinkGen. Using LinkGen you can get information about pages pointed to by a page and pages pointing to a page. Pre-compiled classes for both LinkGen and LinkExtract are made available in the jar files below. A directory containing 3000+ webpages crawled on www.asu.edu is made available too. Also the file generated by LinkGen on the new crawl is present. This enables you to unjar the files and immediately see how LinkExtract works.

After unjaring the files type
to see a demo of the program.

Download & Setup

To download the code, Right Click on the Links below and select Save As from the Popup Menu .

Project1class.jar contains the class files for the above mentioned interfaces and also the index file.
Project1docs.jar gives the API documentation for the given code.
Project1src.jar contains the source code which can be extened for the project.
You must have the JDK installed on your system for you to proceed with installing the downloaded programs. If after installing the JDK you are getting command unknown for "java, javac,jar,javadoc etc." check your path and classpath settings. For Windows based systems check your Autoexec.bat (95,98) or Environment Variable Settings(Win NT). For Unix,Linux systems set path and classpath in your .cshrc.
To extract the contents of the downloaded files. Move to the directory where you want to install them and type
- jar xvf Project1*.jar
Project1class.jar contains compiled code for the interfaces in the source and also documents crawled by the crawler and an index generated from it. If you are interested in checking the query engine you can try
To check the WebCrawler on a site of your choice invoke the WebCrawler as
Invoke all the programs from the directory where you executed the "jar" command. This way all the path setting inside the class files will work properly.Note that the Crawler stores the documents in the same directory from where it has been invoked.
Prj1pgrank.jar contains the source and class files for the LinkGen and LinkExtract. It also contains the documentation for these files.
Prj1pgrankdata.jar contains the files from which the link matrix is generated. These files are used by Task 2. This file also contains "Hashedlinks", the link matrix generated by LinkGen. A repository of web pages crawled from ASU is provided as part of the jar. The link matrix has been generated on this repository under directory "result2". For TASK 2 you should use this respository since the crawl was localised to www.asu.edu to extract better link graph.

Working Notes

The system provided has been tested on a Red Hat Linux system with 512MB Memory on a T1 connection. Usual times for the crawler to extract files have been in the range of 15-30 minutes. Be aware of these if you are testing the code. For ease the index has been made available to you already.

The web crawler made available as part of the project has been based on the code provided by Sun.
An interesting project that helps visualize the web as graph structure is here

Deadline

First Deadline: You need to turn in the working code for vectorspace ranking with an example query by 2nd March, 2001 (Friday). (This is an intermediate deadline--just to make sure that people or not waiting until the last minute. It is enough to just turn in the code you wrote, and one example query that you ran with your code. The final deadline will be about a week from 2nd March. You can do a more complete report for the whole project along with the final deadline.)
Full deadline: 28th March.

Ullas Nambiar

Last modified: Mon Aug 26 14:30:50 MST 2002