[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cse494 stemming and information retrieval




 
m> I was just thinking about the methods that search engines use to parse, index,
m> and process index terms. In the early lectures, you mentioned, that some
m> engines stemm the terms and assign synonims. How does this work, with the
m> pages that are generated on other than .us domains, or simply in different
m> languages, assuming, these languages use English alphabet. And what if they
m> don't use English alphabet?
m> 


in general, stemming rules are very much language dependent. The best 
stemming algorithm for English is waht is known as the Porter Stemming 
algorithm--which is described in the MIR book. I presume there are
such stemming algorithms for other languages--I don't know them myself.

Before you can stem other languages, you need to first detect the
language and then apply the appropriate stemmer.

Language detection--even if the alphabet is not the english (latin)
alphabet is possible in theory if all pages are written in unicode
fonts (see http://www.unicode.org ), since the font encoding itself
tells you what language you are in. 

In general, there has been very little work that I know of in terms of 
multi-language web-search, partly because the "english web" is
overwhelimingly large compared to any other language webs (I think
something like 90% or more of web pages are still in english as of
last year). 

Altavista, for example, uses a translator, babelfish (
http://babelfish.altavista.digital.com/r?F09 )to translate known
non-english european pages into english before indexing them. But, no
one really did a study of how effective it is (particularly given that
machine translation is not exactly a solved problem).

(for those of you non-cognoscenti, babelfish is a small fish that you
can put in your ears and it automatically translates all intergalactic
languages for you--courtesy hitch hikers guide to the galaxy)

Rao
[Mar 22, 2001]


ps: I am glad to see people asking meaningful questions in the
aftermath of an exam...


pps: Donald Knuth is a big proponent of Unicode. To celebrate the
international flavor of computer science, he wants to have the names
of the CS researchers, whose work he is discussing in his books,
written in unicode and displayed in the native language of the
speaker. See the end of the page at
http://www-cs-faculty.stanford.edu/~knuth/help.html )