[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: cse494-s01@asu.edu*Subject*: LSH ideas*From*: Subbarao Kambhampati <SUBBARAO.KAMBHAMPATI@asu.edu>*Date*: Wed, 07 Mar 2001 16:42:25 -0700 (MST)*Reply-to*: rao@asu.edu

I am sorry we didn't get to answer the questions on LSH today. For the benefit of those who have read the paper and will have trouble sleeping tonight (or having any fun at all over the weekend) if this thing is left hanging, here is a short explanation. You are also urged to look at the slides I prepared for today's class. point 1 --> we are trying to find clusters of related pages. Most clustering --> algorithms will start with the assumption that the pairwise --> distance ("dissimilarity") between pages is known apriori. To --> compute pairwise similarity of pages, we will need o(n^2) --> computations and this is a no-no when your n is 25million So we want to try and see if we can do work that is O(n) instead of O(n^2) The idea we will try to use is to do some sort of hashing scheme and hash every document, as it comes, into a bucket. Docs pushed into the same buckets (i.e., those that "collide") are considered to be "similar" To do this we need to make a hashing function such that two docs collide w.r.t that hash function only if they are similar. --Small issue: Notice that we are acting as if similarity is a binary concept. We can do this by assuming that anything pair of pages whose numerical similarity metric is above a particular threshold are considered similar, and dissimilar otherwise. It is impossible to gurantee this exactly for any definition of similarity. But, we can try to guarantee this in a "probabilistic" sense. Specifically, we may want hash functions w.r.t. which the probability of collision between two docs is proportional to their similarity. The "min-wise" hash functions have this property. Specifically, we generate minwise hash values of a set of two docs as follows: Step 1. Make a random permutation p of all the terms Step 2. The hash value of a doc d under p is the term t in d that appears first in the permutation p Suppose we have the terms cat, dog, mouse and banana in our universe we have two docs d1={mouse,dog} d2={cat,mouse} A random permutation p on the terms is {cat--dog--mouse--banana} according to p, hash value of d1 is dog and hash value of d2 is cat. Another random permutation p2 is {banana--mouse--cat--dog} and hash values w.r.t. p2 for d1 is mouse and for d2 is mouse. Suppse we make m different hash value for each of the documents Let m' be the number of cases in which d1 and d2 have the same value. Now for something **IMPORTANT** It can be shown that m'/m tends to |d1 .intersection. d2|/|d1 .union. d2| for large values of m In other words the probability of collision is equal to the similarity between the docs (under our similarity measure). How convenient, eh? (this explains why we happen to conveniently define similarity in terms of intersection and union). ============================== Now, the problem with just using min-wise hash functions is that while they do put the similar docs in the same bucket, they may also put dissimilar docs in the same bucket. To reduce the probability that two dissimilar docs will collide, we need to make it hard for docs to become similar. So, we say that it is not enough for two docs to agree on at least one min-wise hash value, but they have to agree on some k number of min-wise hash values. This is formalized by consider the Locality based hashing scheme. Suppose we have made m different hash values for each doc. We can make a k-sized LSH signature by first picking k random numbers between 1 and m, and concatenating those k min-wise hash values to get the hash signature. For example, suppose we have doc A: {mouse, dog, horse, ant} and doc B: {cat, ice, shoe, mouse} suppose we generated m=4 hash values for each document (using 4 random permutations over the words mouse, dog, horse, ant, cat, ice, shoe for doc A MH1 = horse MH2 = mouse MH3 = ant MH4 = dog for docB MH1 = cat MH2 = mouse MH3 = ice MH4 = shoe To get an LSH signature of length k=3, we first pick 3 numbers between 1 and 4 (say, 1,3,4), and get the LSH signatures for docs A and B as LSH134 = cat-ice-shoe (doc A) LSH134 = horse-ant-dog (doc B) Now, we want A and B to be considered similar only if they agree on the LSH signature. Since agreement on a k-length LSH signature requires independent agreement on k different minwise hash values, we are driving down the false positive probability. Specifically if probability of agreement on a single hash value is p, then the probability of agreement over a k-size LSH signature will be p^k **Okay, we got the false positive probability down. But it also made false negative probability--that is two truly similar docs will be found to be dissimilar by our hash method-- high To deal with this we will make l different LSH signatures and say that if a pair of docs agree on at least one of the l different LSH signatures then they are similar. Specifically, suppose we are defining to pages to be similar if they have a similarity value (according to the intersection/union formula)of at least .2 For the set of pages that have similarity value of >= .2, the probability p that they agree on a single min-hash value is >= .2. The probability p' that two similar pages will collide over a k-size LSH signature is p^k The probability that two similar pages will collide at least once over l different LSH signatures is l*p' = l*p^k we want this probability to be close to 1 (so similar pages will actually collide with near perfect probability) thus l = 1/p^k for p=.2, an k = 3, l will be 125 (you have to make 125 different LSH signatures for each document). The *REALLY NEAT* thing in the above is that we can increase k (length of LSH signature) to drive the false positive probability down, and simultaneously increase l to get the false negative probability down. May be this made better sense...? Rao [Mar 7, 2001]

- Prev by Date:
**The project 1 deadline is March 28th.. *not* March 21st...** - Next by Date:
**vox populi** - Prev by thread:
**The project 1 deadline is March 28th.. *not* March 21st...** - Next by thread:
**vox populi** - Index(es):