[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
re: linux clusters at google etc talk
Here are some random notes on the talk
--He tries a sort of funny "flow" based explanation of page
rank (which doesnt quite work and he has to revert to the random surfer
model anyway).
--The main talk is about how they use cheap PCs in clusters for hardware
support. The technical part that you can take away is that due to the
read-only nature, and multiple independent queries of websearch engines,
it is quite easy to parallelize the heck out of the problem. An incoming
query is switched to one of N different machine clusters using a fast
switcher/load balancer, and is then taken over by that cluster. There are
nice photos of these machine clusters.
--They go with cheap PCs as their workhorses--and deal with machine
failure by a lots and lots of replication (of the index and document
servers). Much of the talk is a sort of justification of why this works
out well for Google $ wise.
--Couple cute jargon words: Sharding--spanning a large file over multiple
systems
--Some interesting observations on scale--if you have a disk that is
rated for 250,000 hours mean time between failures, and have a 50,000
disks you expect a disk failure every 5 hours (this sort of scale
arguments also come out in the Haveliwala global clustering
paper).
--He gives some "Google" perspectives on what is good research
and what is bad research on search engines. User modeling and adaptive
search engines are considred good, while semantic web, deep web and P2P
are considered bad (the arguments are so-so).
Rao