[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Google and weighting query terms



Stan:

That is a good observation. two comments:

1. The fact that repetition of cdrom changes the query results shows
that at least part of Google is more closely based on vector model
than on boolean model. (notice that boolean model thinks of a page as
a *set* of keywords, while the vector model looks at a page as a *bag*
of keywords, thus takes frequency information into account).

   In the vector model, since query is seen as a mini-document, it is
no surprise that mentioning cdrom twice changes the weight of cdrom in
the query

2. It is of course possible to let users provide weights to the
keywords in the query directly and manually. However, the general
assumption/experience in the IR as well as web-search communities is
that the user is a doofus and can hardly use sophisticated strategies
for specifying query. So, what is done instead is to modify weights
based on what user finds relevant (by his/her actions). This is what
relevance feedback is all about... we will talk about it today or next 
class. 

Rao
[Jan 31, 2001]


From: volsung@asu.edu
Subject: CSE494: Google and weighting query terms
Date: Mon, 29 Jan 2001 16:18:23 -0700 (MST)
Message-ID: <Pine.GSO.4.21.0101291512560.7786-100000@general2.asu.edu>

volsung> Today in lecture I wondered if letting the user influence the term weights in
volsung> their query would be worthwhile.  I realized later that a crude way to
volsung> increase the weight of one query term relative to the others is to simply type
volsung> it more times.  I wasn't sure if the search engines would make use of this
volsung> information, but apparently Google does.
volsung> 
volsung> For example (this was the first query I thought of):
volsung> 
volsung> Query: "linux cdrom"
volsung> Top 10 Results:
volsung> * Walnut Creek CDROM - The Walnut Creek CDROM Collection
volsung> * Infomagic - Software, Selection, Value
volsung> * linuxppc :: The Home of the PowerPC Linux Port
volsung> * Yggdrasil Computing Inc.
volsung> * The Slackware Linux Project
volsung> * CheapBytes Home Page
volsung> * Linux Central the /root for Linux resources
volsung> * Trinux: A Linux Security Toolkit
volsung> * Linux Joliet CDROM Support
volsung> * A REQUEST FOR A FREE LINUX CDROM
volsung> 
volsung> Query: "linux cdrom cdrom"
volsung> Top 10 Results:
volsung> * Walnut Creek CDROM - The Walnut Creek CDROM Collection
volsung> * FRANK CDROM Linux Software und Fachbcher
volsung> * Linux for Astronomy
volsung> * Linux Joliet CDROM Support
volsung> * Enhanced Linux IDE/ATAPI multiplatter cdrom project
volsung> * Walnut Creek CDROM - Linux CDROM Titles
volsung> * cdrom-standard.tex - Linux HeadQuarters
volsung> * patch-2.2.16 linux/drivers/cdrom/cdrom.c - Linux HeadQuarters
volsung> * Linux-Kernel Archive: CDROM Oops patch
volsung> * Re: [ale] Need a special Linux CDROM
volsung> 
volsung> The lists are different, so Google is taking the extra "cdrom" into
volsung> consideration when ranking the pages.  It even looks like the first list has
volsung> mostly pages where you can get different versions of Linux on CD-ROM.  The
volsung> second list has a greater number of pages on CD-ROM drivers for Linux, pages
volsung> that would have greater relevance to the term "cdrom".
volsung> 
volsung> Anyway, I thought this was neat because now it gives you a way to see the
volsung> effect of different query weights (roughly).
volsung> 
volsung> ---
volsung> Stan Seibert
volsung> 
volsung>