This is the first project for CSE 494/598. You are provided with a system
that can extract web pages and index them. Using this system you will
experiment with various Ranking Algorithms.
TASK 0: Read the background description,
download the code, and try the boolean ranking code made available as
part of the code. You may try queries made up of keywords related to
AI (planning, information retrieval, bayes network etc. etc.). You can
also try negated keywords (e,g, -network).
TASK 1: Implement the Vector Space Model
to rank the documents. You can use the weighting method given in the
text or the one given in the homework question 1.1.
TASK
2:
Implement the Authorities/Hub
computation as discussed in the class. Specifically, for a given query
Q, you first find top K most similar links using vector space
model. Next, you grow these pages into a base set of pages. You make
the adjacency matrix for these base set pages using the linkextraction
module provided below. You will
then do the authority/hub computation on the adjancency matrix, and
return the top m authorities and top m hubs. You will compare the
results from pure vector space to the results from authorities/hub
computation for a set of queries.
Note that we have provided a new .jar file that contains a crawl
of asu.edu domain, as well as its link matrix. You need to use this
data rather than the aaai.org crawl data that was given earlier. You
may want to use this same data for task 1 too.
TASK
3: (Common to the two tasks above). Evaluate your search engine
on a set of at least five representative queries from the ASU
context (assuming you are using the ASU crawl). Compare the results of
the queries, in terms of precision with simple vector
similarity and with similarity + authorities hubs. Comment on whether
the additional time taken by authorities/hubs computation was found to
be worth it. You need to use
*your* sense of what are good pages to compare the precision.
TASK
4: We want to suggest some "related" terms to the user to
let him/her refine the query. To support this, implement the scalar
cluster technique discussed in the class. Use it to suggest keywords
that the user can use to refine the query. For each of the
representative queries from task 3, evaluate the scalar clustering.
Extra Credit Tasks
The prefix stars represent the degree of difficulty
(*)Try implementing relevance feedback techniques (the Roccochio
method), play with it yourself--by identifying the good vs. bad pages,
to see how good it is in practice.
(**) Develop a good interface to display the search results, as well
as the query elaboration results.
(***)Do the pagerank calculation for the entire ASU crawl, and use
it as the basis for search. Experiment with different ways of
combining pagerank and vector-space similarity.
(***) Implement and compare latent semantic indexing to vector space
ranking. (You will have to search and get an SVD package for this).
Knowledge of Java is a pre-requisite for the course. If you need to brush up your knowledge read the Java Tutorial. The Java API documentation which describes the class structure for various standard packages is available here.
The following functionality is provided by the classes provided as part of the project.
A crawler implemented by Webcrawl.java that crawls the web starting from the seed URL provided on invocation. This crawler stores the first 1000 URLs that it successfully crawls. The result of one such crawl is made available as part of the code. Files from one such run on http://www.aaai.org/ were used to generate the index file given as part of the project.
The IndexHTML.java contains code to create the Index from the crawled
URLs. The index generated is made available as part of the Project1class.jar
The index is generated using the LUCENE API. To use this API as part of your code
you have to import com.lucene.*.
The program SearchFiles.java gives a query interface that accepts the
queries and retrives the matching documents.The interface accepts BooleanQueries.
Currently the interface is plain text, a HTML version will be uploaded soon. Also the program assumes "index" to be the name of the directory where the index is stored. To use your own index, change the default name to point to "yourindexname" and recompile.
The program VectorViewer.java is a demo program that lists the total
number of documents in the index and the lists all the terms in the index with their frequencies. Since the given code does not contain any data structure giving you the Document Vector Model, this program demonstrates how you can retrieve the underlying Document Vector Model for use in the various ranking alogrithms that you develop. Read the commented section inside the code to extract the (document,freq) for each term in the index to generate the document vector model.
The API documentation for the given classes is available for download and also can be viewed here.
LinkGen.java and LinkExtract.java are provided to help in computing PageRank and Authority/Hub for the crawled pages. LinkGen generates the Link Matrix from the files crawled and saves it to a file. LinkExtract uses the file written by LinkGen. Using LinkGen you can get information about pages pointed to by a page and pages pointing to a page. Pre-compiled classes for both LinkGen and LinkExtract are made available in the jar files below. A directory containing 3000+ webpages crawled on www.asu.edu is made available too. Also the file generated by LinkGen on the new crawl is present. This enables you to unjar the files and immediately see how LinkExtract works.
Project1class.jar contains the class files for the above mentioned interfaces and also the index file.
Project1docs.jar gives the API documentation for the given code.
Project1src.jar contains the source code which can be extened for the project.
You must have the JDK installed on your system for you to proceed with installing the downloaded programs. If after installing the JDK you are getting command unknown for "java, javac,jar,javadoc etc." check your path and classpath settings. For Windows based systems check your Autoexec.bat (95,98) or Environment Variable Settings(Win NT). For Unix,Linux systems set path and classpath in your .cshrc.
To extract the contents of the downloaded files. Move to the directory where you want to install them and type
jar xvf Project1*.jar
Make sure to add this directory to your classpath to avoid problems.
Project1class.jar contains compiled code for the interfaces in the
source and also documents crawled by the crawler and an index generated
from it. If you are interested in checking the query engine you can
try
java CSE494.SearchFiles
To check the WebCrawler on a site of your choice invoke the WebCrawler as
java CSE494.Webcrawl http://siteURL
Invoke all the programs from the directory where you executed the
"jar" command. This way all the path setting inside the class files
will work properly.Note that the Crawler
stores the documents in the same directory from where it has been invoked.
Prj1pgrank.jar contains the source and class files for the LinkGen and LinkExtract. It also contains the documentation for these files.
Prj1pgrankdata.jar contains the files from which the link matrix is generated. These files are used by Task 2. This file also contains "Hashedlinks", the link matrix generated by LinkGen. A repository of web pages crawled from ASU is provided as part of the jar. The link matrix has been generated on this repository under directory "result2". For TASK 2 you should use this respository since the crawl was localised to www.asu.edu to extract better link graph.