Information Retrieval Project

CSE 494/598 - Project

PART A PART B

Project Description

This is the part A of the project for CSE 494/598. You are provided with a system that can extract web pages and index them. Using this system you will experiment with various Ranking Algorithms.

TASK 0: Read the background description, download the code, and try the boolean ranking code in SearchFiles.java made available as part of the code. You may try queries made up of keywords related to AI (planning, information retrieval, bayes network etc. etc.). You can also try negated keywords (e,g, -network).
TASK 1: Implement the Vector Space Model to rank the documents. You can use the weighting method given in the text or the one given in the homework question 2.1.
- NOTE: For a query q and document di Similarity(q,di)=q.di/|q||di|. Computation of |di|, the 2-norm of the term weights of all terms contained in the document takes more time as you need to scan through every hit document and then all terms it contains to get the normalization factor. So better precompute the 2-norms for all documents, rather than doing it every time you compute similarity of a document.You can start with VectorViewer.java code for Vector space ranking.
- SUBMIT:
  - Hardcopy showing the top 10 documents using Vector Space model for the 10 example queries below
  - Hardcopy of your code
TASK 2: Implement the LSI algorithm discussed in the class to rank the documents. First generate the top 50 documents using the Vector Space Model for a given query. Form a term-document matrix for these 50 documents, use the SVD decomposition method from JAMA package below and then find the cosine similarity. Compare the LSI and Vector Space Ranking results.You will have to submit a hardcopy showing the top 10 documents for any 10 sample queries using LSI.
- SUBMIT:
  
  Hardcopy showing the top 10 documents using LSI for the 10 example queries below
  Comparison of these results to the results in TASK 1 in your own words
  Hardcopy of your code.

Example Queries for TASK 1 and TASK 2

Fall semester
Grades
SRC
Computer Science
Decal
Parking
Want to know where the computer science department is
Transcripts
Timecards
Admissions

Resource Description & Documentation

The code made available as part of the project is written in Java. A latest version of Sun's Java Development Kit can be downloaded from this site. If you have a earlier version of JDK installed on your system there is no need to upgrade.
Knowledge of Java is a pre-requisite for the course. If you need to brush up your knowledge read the Java Tutorial. The Java API documentation which describes the class structure for various standard packages is available here.

The following functionality is provided by the classes provided as part of the project.

A crawler implemented by Webcrawl.java that crawls the web starting from the seed URL provided on invocation. This crawler stores the first 1000 URLs that it successfully crawls. The result of one such crawl is made available as part of the code. Files from one such run on http://www.aaai.org/ were used to generate the index file given as part of the project.
The IndexHTML.java contains code to create the Index from the crawled URLs. The index generated is made available as part of the Project1class.jar The index is generated using the LUCENE API. To use this API as part of your code you have to import com.lucene.*.
The program SearchFiles.java gives a query interface that accepts the queries and retrives the matching documents.The interface accepts BooleanQueries. Currently the interface is plain text, a HTML version will be uploaded soon. Also the program assumes "index" to be the name of the directory where the index is stored. To use your own index, change the default name to point to "yourindexname" and recompile.
The program VectorViewer.java is a demo program that lists the total number of documents in the index and the lists all the terms in the index with their frequencies. Since the given code does not contain any data structure giving you the Document Vector Model, this program demonstrates how you can retrieve the underlying Document Vector Model for use in the various ranking alogrithms that you develop. Read the commented section inside the code to extract the (document,freq) for each term in the index to generate the document vector model.

The API documentation for the given classes is available for download and also can be viewed here.

Download & Setup

To download the code, Right Click on the Links below and select Save As from the Popup Menu .

Project1class.jar contains the class files for the above mentioned interfaces and also around 800 indexed documents in AI domain.
Project1docs.jar gives the API documentation for the given code.
Project1src.jar contains the source code which can be extened for the project.
JAMA package : Download the source code from this website and look at the documentation to use the SVD decomposition method for LSI Algorithm
You must have the JDK installed on your system for you to proceed with installing the downloaded programs. If after installing the JDK you are getting command unknown for "java, javac,jar,javadoc etc." check your path and classpath settings. For Windows based systems check your Autoexec.bat (95,98) or Environment Variable Settings(Win NT). For Unix,Linux systems set path and classpath in your .cshrc.
To extract the contents of the downloaded files. Move to the directory where you want to install them and type
- jar xvf Project1*.jar
Project1class.jar contains compiled code for the interfaces in the source and also documents crawled by the crawler and an index generated from it. If you are interested in checking the query engine you can try
To check the WebCrawler on a site of your choice invoke the WebCrawler as
Invoke all the programs from the directory where you executed the "jar" command. This way all the path setting inside the class files will work properly.Note that the Crawler stores the documents in the same directory from where it has been invoked.
Pgrankdata.jar contains the files from which the link matrix is generated. These files are used by Task 3. This file also contains "Hashedlinks", the link matrix generated by LinkGen. A repository of web pages crawled from ASU is provided as part of the jar. The link matrix has been generated on this repository under directory "result3".

Deadline : Due date for part A of the project : 11/04.

PART B

Project Description (Part B)

This is the part B of the project for CSE 494/598.

TASK 3: Implement the Authorities/Hub computation as discussed in the class. Specifically, for a given query Q, you first find top K most similar links using vector space model. Next, you grow these pages into a base set of pages. You make the adjacency matrix for these base set pages using the link extraction module provided below. You will then do the authority/hub computation on the adjancency matrix, and return the top m authorities and top m hubs. You will compare the results from pure vector space to the results from authorities/hub computation for a set of queries.
- NOTE: The new .jar file that contains a crawl of asu.edu domain (around 3000 pages), as well as its link matrix. You need to use this data and not the data from part A for this as well as the later tasks.
- SUBMIT:
  - Hardcopy showing the top 10 authorities and top 10 hubs using Authority Hub computation for the 10 example queries below
  - Comparison of these results to the results in TASK 1 and TASK 2 in your own words
  - Hardcopy of your code
TASK 4: Do the pagerank calculation for the entire ASU crawl. Then given a query, obtain the top 10 documents by combining the pagerank and vector space similarity using w*(pagerank)+(1-w)*(Vector space similarity) where w varies between 0 and 1. You may want to experiment with different weights. (Hint: Make sure to normalize he page rank such that it, like vector similarity, is between 0 and 1.)
- SUBMIT:
  - Hardcopy showing the results for the below queries when combinining pagerank with vector space model
  - A report on your observations as w varies from 0 to 1; and also comparison with A/H method.
  - Hardcopy of your code
TASK 5: Cluster the query results before presenting them to user. Given a query, obtain the top 50 or 100 documents using vector space model (or vector space with page-rank model) and use a clustering algorithm (bisecting k-means is preferred) to cluster the top documents into k clusters. You can experiment with different k values between 3 and 10.
- SUBMIT:
  
  Hardcopy showing the clusters for the sample queries below
  Analyze the effect of number of clusters, and also your own evaluation of whether the clusters seem to correspond to any natural categories.
  Hardcopy of your code
TASK 6: Demonstration of Tasks 1 through 5. The Demo would be for around 10-15 minutes. You will be asked to run the code on some sample queries (in the week of December 9th).
Extra credit possibilities: You can consider the following for extra credit:

A nice GUI for your search engine
Use of query expansion techniques (such as scalar clusters) on the top 50 search results.
Trying out other clustering techniques--such as Agglomerative or scalar clustering to cluster search results.
Doing LSI analysis before doing clustering (in task 5) and seeing if the clustering improves if we retain only top few dimensions in LSI
Anything else that you can think of that makes your search engine cooler..

Example Queries

Fall semester
Grades
SRC
Computer Science
Decal
Parking
Want to know where the computer science department is
Transcripts
Engineering
Admissions

Resource Description & Documentation

The API documentation for the given classes is available for download and also can be viewed here.
LinkGen.java and LinkExtract.java are provided to help in computing PageRank and Authority/Hub for the crawled pages. LinkGen generates the Link Matrix from the files crawled and saves it to a file. LinkExtract uses the file written by LinkGen. Using LinkGen you can get information about pages pointed to by a page and pages pointing to a page. Pre-compiled classes for both LinkGen and LinkExtract are made available in the jar files below. A directory containing 3000+ webpages crawled on www.asu.edu is made available too. Also the file generated by LinkGen on the new crawl is present. This enables you to unjar the files and immediately see how LinkExtract works.
- After unjaring the files type
  to see a demo of the program.

Download & Setup

Prj1pgrank.jar contains the source and class files for the LinkGen and LinkExtract. It also contains the documentation for these files.
Prj1pgrankdata.jar contains the files from which the link matrix is generated. These files are used by Tasks 3 and 5. This file also contains "Hashedlinks", the link matrix generated by LinkGen. A repository of web pages crawled from ASU is provided as part of the jar. The link matrix has been generated on this repository under directory "result2".
Index.jar is the index for the 3000 asu pages and contains around 37854 terms. This is the index you ahve to use when computing vector space similarity to combine with pagerank and A/H computation.

Deadline : 9th December

Sree

Last modified: Mon Feb 5 22:34:10 MST 2001