[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Google Linux Cluster talk by Urs Hoelzle--Media player archive...



So not having much homework or project stuff hanging on my head, I just completed watching the Hoelzle talk.
Shows how you can talk for 50 minutes without giving away any of the companies real secrets ;-)

Here are some random notes:

--He tries a sort of funny "flow" based explanation of page rank (which doesnt quite work and he has to revert to the random surfer model anyway).

--The main talk is about how they use cheap PCs in clusters for hardware support. The technical part that you can take away is that due to the read-only nature, and multiple independent queries of websearch engines, it is quite easy to parallelize the heck out of the problem. An incoming query is switched to one of N different machine clusters using a fast switcher/load balancer, and is then taken over by that cluster. There are nice photos of these machine clusters.

--They go with cheap PCs as their workhorses--and deal with machine failure by a lots and lots of replication (of the index and document servers). Much of the talk is a sort of justification of why this works out well for Google $ wise.

--Couple cute jargon words: Sharding--spanning a large file over multiple systems

--Some interesting observations on scale--if you have a disk that is rated for 250,000 hours mean time between failures, and have a 50,000 disks you expect a disk failure every 5 hours (this sort of scale arguments also come out in the Haveliwala global clustering paper).

--He gives some "Google" perspectives on what is good research and what is bad research on search engines. User modeling and adaptive search engines are considred good, while semantic web, deep web and P2P are considered bad (the arguments are so-so).


G'night
Rao