CSE 494 - Information Retrieval - Project Part 2
Project Description
Task 1
Implement the Authorities/Hub computation as discussed in the class. Specifically, for a given query Q,
Construct the "root set" by identifying the "top K" answers using vector space search developed
in Part 1 (experiment with values of K, Tip: larger K equals more processing time, start with K = 10).
Then grow these pages into a "base set" of pages. To make the adjacency matrix for the base set
of pages use the "LinkExtract.java" and the file "HashedLinks" provided as part of the code (see details below).
Next do authority / hub computation using the adjacency matrix and return the "top N" authorities and
"top N" hubs. (N = 10 would be a good bet).
Compare and analyze the results given by Vector Space for query Q, with those obtained from the
authority / hub computation.
Submit:
Hardcopy showing the Top N authorities and hubs using the A/H computation for the sample queries
given below. For this part, you must set K = 10, N = 10 and use the tf-idf similarity to build your root set.
A report comparing and analyzing the A/H results with those given by pure Vector Space based search.
How does varying the size of "root set" affect results of A/H computation? Which results are more relevant:
Authorities or Hubs? Comments?
Hardcopy of your code with comments.
Task 2
Compute PageRank for all the files from the ASU crawl given as part of the code.
(Note: You are not required to process the crawled files. The LinkExtract.java code together with the file
HashedLinks can be used to identify the link structure necessary to determine PageRank).
Given a query Q, return the "Top 10" results for the query by combining PageRank and Vector
Space similarity values
To combine the similarities, use the formula:
w * (PageRank) + (1-w)* (Vector Space Similarity)
where 0 < w < 1.
Do make provisions for varying the value of 'w' at query time.
To combine with Vector Space rank, normalize the PageRank such that it lies between 0 and 1.
Submit:
The results for sample queries derived using a combination of PageRank and
Vector Space ranking. For this part you must set w to 0.4.
A report comparing and analyzing the A/H results with those given by PageRank+VectorSpace.
Comment on the effects of varying 'w' between 0 and 1.
Comment on the effects of varying the value of "c" (see formula for PageRank computation in class notes).
Does PageRank computation converge?
Hardcopy of your code with comments.
Extra Credit
Implement a GUI for your search engine.
Make provisions for selecting Vector Space, A/H, PageRank + Vector Space model for ranking your answers.
The GUI should preferably be a stand alone application or applet. The GUI can be servlet based only if
you have access to a personal Web server that can be accessed by the TA.
You are encouraged to make the output as "Googlish" as you can i.e. simple interface, links open source
documents, anything else you fancy.
Note: A end-of-semester Demo of all tasks will be required. A GUI could come in handy at that time.
Hence you are highly encouraged to implement a GUI.
Additional extra credit tasks are certainly possible; consult the instructor and or TAs.
Sample Queries
Campus Tour
Medical Care
Transcripts
Admissions
Employee Benefits
Languages
Hayden Librari
Parking Decal
SRC
Resources and getting started
This part of the project requires access to the entire index of the html files crawled from asu, unlike
Part A, where you could have just used the lucene index. Do the following to get access to this crawl:
Download this package Projectclass.jar. Although the filename says "jar", it is actually
a zip file.
Open the zip file using your favorite zip extractor. Select the folder result3 and extract that
into the cse494-v1 folder in your Eclipse workspace. After this step, the result3 folder should be a peer of
the bin, lib, src folders.
There are a lot of files (25054), this operation might take up to 15 minutes to complete.
Download the package cse494-v2.zip.
This contains two java files (LinkGen and LinkExtract) and one HashedLinks file.
The java files should be pasted in the folder \cse494-v1\src\edu\asu\cse494 in your Eclipse workspace.
The Hashedlinks file should be pasted in the folder \cse494-v1 in your Eclipse workspace.
LinkGen.java is used to generate the file HashedLinks from the link matrix of the crawled
webpages. The file Hashedlinks is already provided to you. You can look through this java file if you
are curious as to how the file HashedLinks is generated and how it behaves.
LinkExtract.java provides methods to get information about pages pointed to by a given page and
pages pointing to a given page. Run the program for a demo.
Note: Because of the size of the data that we are handling in this part of the project, you
might encounter OutOfMemory errors. You should increase the amount of memory available to your java
application by using the -Xmx512m switch. (See this for more information on how to do that, skip to the 2:00 minute mark).
Good luck!