CSE 494 - Information Retrieval - Project Part 2

Due March 20, 2015

Project Description

Task 0: Getting Started

Download and extract the file s13-part2.zip
Copy the IntCitations.txt and IntLinks.txt files to your project directory. It should be a peer of the src, bin, result3 folders.
Copy the LinkAnalysis.java file to your source code folder (src/edu/asu/irs13)
Run LinkAnalysis.java and verify that it produces the document numbers that point to and are pointed by doc#3.
Read the main( ) function of LinkAnalysis.java and familiarize yourself with how the functions are called.

Task 1

Implement the Authorities/Hub computation as discussed in the class. Specifically, for a given query Q,
- Construct the "root set" by identifying the "top K" answers using TF/IDF, vector space search developed in Part 1 (experiment with values of K, Tip: larger K equals more processing time, start with K = 10).
- Then grow these pages into a "base set" of pages. The base set comprises all the documents in the root set and all the documents that they link to and are linked by. To make the adjacency matrix for the base set of pages use the "LinkAnalysis.java" file and the functions getLinks and getCitations.
- Next do authority / hub computation using the adjacency matrix and return the top N authorities and top N hubs. (N = 10 would be a good bet).
- Compare and analyze the results given by Vector Space for query Q, with those obtained from the authority / hub computation.
Submit:
- 2 points - Authorities / Hubs code. (comments are important, submit by email cse494s15@gmail.com )
- 2 points - Top 10 authorities output, for K=10, N=10, TF/IDF similarity using the sample queries below. (hard copy, URLs or Doc Id)
- 2 points - Top 10 hubs output.
- 3 points - What effect does changing the root set size have on quality (relevance) of results? Why?
- 3 points - What effect does changing the root set size have on time taken? Why?
- 3 points - Compare the time taken by various phases of the algorithm.

Task 2

Compute PageRank for all the files from the ASU crawl given as part of the code.
(Note: You are not required to process the crawled files. The LinkAnalysis.java code together with the files IntLinks and IntCitations can be used to identify the link structure necessary to determine PageRank).
Given a query Q, return the "Top 10" results for the query by combining PageRank and Vector Space similarity values
- To combine the similarities, use the formula: w * (PageRank) + (1-w)* (Vector Space Similarity) where 0 < w < 1.
- Vector space similarity should be computed with tf/idf
- Do make provisions for varying the value of 'w' at query time.
- To combine with Vector Space rank, normalize the PageRank such that it lies between 0 and 1.
- Note: Because of the size of the data that we are handling in this part of the project, you might encounter OutOfMemory errors. You should increase the amount of memory available to your java application by using the -Xmx512m switch. (See this Screencast for more information on how to do that, skip to the 2:00 minute mark).
Submit:
- 2 points - Pagerank / similarity code. (comments are important)
- 2 points - Pagerank / similarity output (sample queries below. use w = 0.4).
- 6 points - Which results do you think are more relevant - TF/IDF, Authorities, Hubs or Pagerank? Why? Support your answer with examples or statistics.
- 3 points - What effect does varying w have on the relevance of the results? Why?
- 3 points - What effect does varying c have on the relevance of the results? Time taken? Why?
- 3 points - Did your PageRank computation converge? How many iterations did it take? How much memory did it take?
- 3 points - Which document had the highest PageRank? Does that make sense?
- 3 points - Is there any correlation between high authority values and high PageRank values? What about hub values and PageRank values? Does this make sense?

Sample Queries

These queries are the only ones you have submit the output for. However, you should consider using some more queries to help answer the result relevance questions.

Campus Tour
Transcripts
Admissions
Employee Benefits
Parking Decal
SRC

Optional background

In this part, the link structure has been precomputed for you. If you are curious how that was accomplished, read this section.

This part of the experiment requires access to the entire index of the html files crawled from asu, unlike Part 1, where you could have just used the lucene index. Do the following to get access to this crawl
Download this package Projectclass.jar. Although the filename says "jar", it is actually a zip file.
Open the zip file using your favorite zip extractor. Select the folder result3 and extract that into the irs13 folder in your Eclipse workspace. After this step, the result3 folder should be a peer of the bin, lib, src folders.
There are a lot of files (25054), this operation might take up to 15 minutes to complete.
Download the package cse494-v2.zip.
This contains two java files (LinkGen and LinkExtract) and one HashedLinks file.
The java files should be pasted in the folder \cse494-v1\src\edu\asu\cse494 in your Eclipse workspace.
The Hashedlinks file should be pasted in the folder \cse494-v1 in your Eclipse workspace.
LinkGen.java is used to generate the file HashedLinks from the link matrix of the crawled webpages. The file Hashedlinks is already provided to you. You can look through this java file if you are curious as to how the file HashedLinks is generated and how it behaves.
LinkExtract.java provides methods to get information about pages pointed to by a given page and pages pointing to a given page. Run the program for a demo.

Good luck!