[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CSE 598/494: Problem with Part B Dataset



Hi,

Some of you noticed that the project 2 data set contains some inconsitency of 
the file names in the index vs. in the HashedLinks. These problems are due to 
the way some special characters (such as "?", "<" etc) were handled. Basically 
you can just ignore this inconsistency - I have tested on my own 
implementation and that would NOT affect the test queries and most of other 
queries I've tested. 

Another problem is that the index has the file names 
as "folder_name/page_name" but the hashedLinks file only has "page_name". This 
is easy to handle and you have to map these file names correctly to get the 
desired results (otherwise no page will have both vector similarity and 
pageRank/AH values).

Let me know if you have further questions.

Thanks,

Jianchun

Quoting Nicholas Radtke <radtken@aztecfreenet.org>:

> Jianchun, 
>  
> Further analysis shows that it's not just question marks.  Here's 
> the stats: 
>  
> There are 205 '?' that were converted to something else. 
> There are 15 '<' that were converted to something else. 
> There are 12 '\' that were converted to something else. 
>  
> Then, there are a few oddballs (formatted as Hashedlinks <=> 
> crawledpages): 
>  
> mati.eas.asu.edu:8421 <=> mati.eas.asu.edu\@\@\@8421 
>  
> This last one is three separate documents in Hashedlinks and one 
> in crawledpages (separated apparently on ','). 
> www.asu.edu%%clas%%shs%%pages%%Language 
> Listening 
> Learning Clinic <=>  
> www.asu.edu%%clas%%shs%%pages%%Language, Listening, Learning Clinic 
>  
> Note that modifying the filenames in the crawledpages directory 
> is not an option for some of the character as they cannot be  
> part of a filename. 
>  
> Again, I'm looking for instructions of how to handle this situation. 
>  
> Thanks, 
>  
> Nicholas 
>  
>  
> ---------- Original Message ---------------------------------- 
> From: "Nicholas Radtke" <radtken@aztecfreenet.org> 
> Reply-To: <radtken@aztecfreenet.org> 
> Date: Mon, 24 Oct 2005 01:27:29 -0700 
>  
> >Jianchun,  
> >  
> >For part B of the project, there seems to be a major problem between 
> 
> >the web crawl (the crawledpages directory) and the Hashedlinks file. 
> 
> >Specifically, the documents with the character '?' lost this  
> >character in the web crawl but retained it in the Hashedlinks file.  
> >For example, the following two documents appear differently between  
> >the two places:  
> >  
> >Hashedlinks file:  lists.asu.edu%%cgi-bin%%wa?A0=ASULUG  
> >crawledpages dir:  lists.asu.edu%%cgi-bin%%wa@@A0=ASULUG  
> >  
> >The "?" seemed to turn into "@@".  
> >  
> >This obviously causes problems.  For example, the Fileslist() method 
> 
> >in class LinkExtract returns the names without question marks, but  
> >the Links() and Citations() methods in the same class return the  
> >names with question marks.  There is no way to map between the two  
> >since the names are different.  
> >  
> >This affects the Authorities and Hubs algorithm as well as the  
> >PageRank+vector similarity.  Both rely on using the vector similarity 
> 
> >search (which won't include question marks) and the forward and  
> >backward links (which will include question marks).  
> >  
> >What do you suggest I do?  Some possibilites I've considered are:  
> >  
> >1)  Modify the web crawl by adding the question marks back in.  This 
> 
> >    involves modifying the data provided with the project, so I 
> wanted  
> >    to get your permission before doing it.  
> >2)  Treat the two names as separate documents, even though they  
> >    technically are the same.  This'll have the effect of modifying  
> >    things M* in PageRank and missing documents when expanding  
> >    the root set in Authorities and Hubs.  
> >  
> >In short, the data provided is bad so I need to know what to do ASAP. 
> 
> >  
> >Thanks,  
> >  
> >Nicholas  
> >  
> >  
> >  
> > 
>