CSE 494/598 - Project
PART
A
Project Description
This is the part A of the project for CSE 494/598. You are
provided with a system that can extract web pages and index them. Using this
system you will experiment with various Ranking Algorithms.
- TASK 0: Read the background description,
download the code, and try the boolean ranking code in
SearchFiles.java made available as part of the code. You may try
example queries given below. You can also try negated keywords (e.g.
-Decal).
- TASK 1: Implement the Vector Space
Model to rank the documents. Hint: See class notes: Retrieval using inverted
files.
- NOTE: For a query q and document di, Similarity(q,di)=q.di/|q||di|.
Computation of |di|, the 2-norm of the term weights of all terms contained
in the document takes more time as you need to scan through every hit
document and then all terms it contains to get the normalization factor.
So better precompute the 2-norms for all documents, rather than doing it
every time you compute similarity of a document. You can start with
VectorViewer.java code for Vector space ranking.
- SUBMIT:
- A writeup explaining your algorithm and brief evaluation of its
performance.
- Hardcopy showing the top 10 documents ranked by Vector Space model
for the 10 example queries below.
- Hardcopy of your code with comments.
Example Queries for TASK 1
- Fall semester
- Grades
- SRC
- newsletter
- Decal
- Parking
- Hayden Library
- Transcripts
- Scholarship
- Admissions
Resource Description & Documentation
- The code made available as part of the project is written in
Java. A latest version of Sun's Java Development Kit can be downloaded from this site. If you have a
earlier version of JDK installed on your system there is no need to upgrade.
- Knowledge of Java is a pre-requisite for the course. If you
need to brush up your knowledge read the Java Tutorial.
The Java API documentation which describes the class structure for various
standard packages is available here.
- The following functionality is provided by the classes
provided as part of the project.
- A crawler implemented by Webcrawl.java that crawls the web
starting from the seed URL provided on invocation. This crawler stores the
URLs that it successfully crawls. The result of one such crawl is made
available as part of the code.
- The IndexHTML.java contains code to create the Index from the
crawled URLs. The index generated is made available as part of the
Projectclass.jar. The index is generated using the LUCENE API. To use
this API as part of your code you have to import com.lucene.*.
- The program SearchFiles.java gives a query interface that accepts
the queries and retrieves the matching documents. The interface accepts
Boolean Queries. The program assumes "index" to be the name of the
directory where the index is stored. To use your own index, change the
default name to point to "yourindexname" and recompile.
- The program VectorViewer.java is a demo program that lists the
total number of documents in the index and the lists all the terms in the
index with their frequencies. Since the given code does not contain any data
structure giving you the Document Vector Model, this program demonstrates
how you can retrieve the underlying Document Vector Model for use in the
various ranking algorithms that you develop. Read the commented section
inside the code to extract the (document,freq) for each term in the
index to generate the document vector model.
- The API documentation for the given classes is available for download and
also can be viewed here.
Download & Setup
- To download the code, Right Click on the Links below and
select Save As from the Popup Menu .
- Projectclass.jar
contains the class files for the above mentioned interfaces and also indexed
documents from an example crawl of ASU.
- Projectdocs.jar
gives the API documentation for the given code.
- Projectsrc.jar
contains the source code which can be extended for the project.
- You must have the JDK installed on your system for you to proceed with
installing the downloaded programs. If after installing the JDK you are
getting command unknown for "java, javac, jar, javadoc etc." check your path
and classpath settings. For Windows based systems check your Autoexec.bat
(95,98) or Environment Variable Settings(Win NT/2000/XP). For Unix, Linux
systems set path and classpath in your .cshrc.
- To extract the contents of the downloaded files. Move to the directory
where you want to install them and type
- jar xvf Project1*.jar. Add this directory to your classpath
to avoid problems.
- Project1class.jar contains compiled code for the interfaces in the source
and also documents crawled by the crawler and an index generated from it. If
you are interested in checking the query engine you can try
- Invoke all the programs from the directory where you executed the "jar"
command. This way all the path setting inside the class files will work
properly.
Deadline : Due date for part A : 30th
September, 2008
Last Modified: 02.05.2004