CSE 494 - Information Retrieval - Project Part 2

Project Description

Task 0: Getting Started

Download and extract the file f11-part2.zip
Copy the IntCitations.txt and IntLinks.txt files to your project directory. It should be a peer of the src, bin, result3 folders.
Copy the LinkAnalysis.java file to your source code folder (src/edu/asu/cse494)
Run LinkAnalysis.java and verify that it produces the document numbers that point to and are pointed by doc#3.
Read the main( ) function of LinkAnalysis.java and familiarize yourself with how the functions are called.

Task 1

Implement the Authorities/Hub computation as discussed in the class. Specifically, for a given query Q,
- Construct the "root set" by identifying the "top K" answers using TF-IDF, vector space search developed in Part 1 (experiment with values of K, Tip: larger K equals more processing time, start with K = 10).
- Then grow these pages into a "base set" of pages. To make the adjacency matrix for the base set of pages use the "LinkAnalysis.java" file and the functions getLinks and getCitations.
- Next do authority / hub computation using the adjacency matrix and return the "top N" authorities and "top N" hubs. (N = 10 would be a good bet).
- Compare and analyze the results given by Vector Space for query Q, with those obtained from the authority / hub computation.
Submit:
- 5 points - Authorities / Hubs code. (comments are important, submit by email)
- 2 points - Top 10 authorities output, for K=10, N=10, tf-idf similarity using the sample queries below. (hard copy, URLs or Doc Id)
- 2 points - Top 10 hubs output.
- 3 points - Are the Authorities / Hubs results more meaningful than Vector Space results? Why?
- 3 points - What effect does changing the root set size have on quality of results? Why?
- 3 points - What effect does changing the root set size have on time taken? Why?
- 3 points - Between Auth and Hubs, which is more relevant? Why do you think so? Support your answer with evidence from your output.
- 2 points - Compare the time taken by various phases of the algorithm.

Task 2

Compute PageRank for all the files from the ASU crawl given as part of the code.
(Note: You are not required to process the crawled files. The LinkAnalysis.java code together with the files IntLinks and IntCitations can be used to identify the link structure necessary to determine PageRank).
Given a query Q, return the "Top 10" results for the query by combining PageRank and Vector Space similarity values
- To combine the similarities, use the formula: w * (PageRank) + (1-w)* (Vector Space Similarity) where 0 < w < 1.
- Do make provisions for varying the value of 'w' at query time.
- To combine with Vector Space rank, normalize the PageRank such that it lies between 0 and 1.
- Note: Because of the size of the data that we are handling in this part of the project, you might encounter OutOfMemory errors. You should increase the amount of memory available to your java application by using the -Xmx512m switch. (See this Screencast for more information on how to do that, skip to the 2:00 minute mark).
Submit:
- 5 points - Pagerank / similarity code. (comments are important)
- 3 points - Pagerank / similarity output (sample queries below. use w = 0.4).
- 3 points - Which results do you think are more relevant - Auth/Hubs or Pagerank/Similarity? Why?
- 3 points - What effect does varying w have on the relevance of the results? Why?
- 3 points - What effect does varying c have on the relevance of the results? Time taken? Why?
- 3 points - Did your PageRank computation converge? How many iterations did it take?

Extra Credit

10 points: Implement a GUI for your search engine.
- Make provisions for selecting Vector Space, A/H, PageRank + Vector Space model for ranking your answers.
- The GUI should preferably be a stand alone application or applet. The GUI can be servlet based only if you have access to a personal Web server that can be accessed by the TA.
- You are encouraged to make the output as "Googlish" as you can i.e. simple interface, links open source documents, anything else you fancy.
- Note: A end-of-semester Demo of all tasks will be required. A GUI could come in handy at that time. Hence you are highly encouraged to implement a GUI.
Additional extra credit tasks are certainly possible; consult the instructor and or TAs.

Sample Queries

These queries are the only ones you have submit the output for. However, you should consider using some more queries to help answer the result relevance questions.

Campus Tour
Transcripts
Admissions
Employee Benefits
Parking Decal
SRC

Optional background

In this part, the link structure has been precomputed for you. If you are curious how that was accomplished, read this section.

This part of the experiment requires access to the entire index of the html files crawled from asu, unlike Part 1, where you could have just used the lucene index. Do the following to get access to this crawl
Download this package Projectclass.jar. Although the filename says "jar", it is actually a zip file.
Open the zip file using your favorite zip extractor. Select the folder result3 and extract that into the cse494-v1 folder in your Eclipse workspace. After this step, the result3 folder should be a peer of the bin, lib, src folders.
There are a lot of files (25054), this operation might take up to 15 minutes to complete.
Download the package cse494-v2.zip.
This contains two java files (LinkGen and LinkExtract) and one HashedLinks file.
The java files should be pasted in the folder \cse494-v1\src\edu\asu\cse494 in your Eclipse workspace.
The Hashedlinks file should be pasted in the folder \cse494-v1 in your Eclipse workspace.
LinkGen.java is used to generate the file HashedLinks from the link matrix of the crawled webpages. The file Hashedlinks is already provided to you. You can look through this java file if you are curious as to how the file HashedLinks is generated and how it behaves.
LinkExtract.java provides methods to get information about pages pointed to by a given page and pages pointing to a given page. Run the program for a demo.

Good luck!