# LSH ideas

```
I am sorry we didn't get to answer the questions on LSH today. For the
benefit of those who have read the paper and will have trouble
sleeping tonight (or having any fun at all over the weekend) if this
thing is left hanging, here is a short explanation. You are also urged
to look at the slides I prepared for today's class.

point 1

--> we are trying to find clusters of related pages. Most clustering
--> distance ("dissimilarity") between pages is known apriori. To
--> compute pairwise similarity of pages, we will need o(n^2)
--> computations and this is a no-no when your n is 25million

So we want to try and see if we can do work that is O(n) instead of
O(n^2)

The idea we will try to use is to do some sort of hashing scheme and
hash every document, as it comes, into a bucket. Docs pushed into the
same buckets (i.e., those that "collide") are considered to be
"similar"

To do this we need to make a hashing function such that
two docs collide w.r.t that hash function only if they are similar.

--Small issue: Notice that we are acting as if similarity is a binary
concept. We can do this by assuming that anything pair of pages whose
numerical similarity metric is above a particular threshold are
considered similar, and dissimilar otherwise.

It is impossible to gurantee this exactly for any definition of
similarity. But, we can try to guarantee this in a "probabilistic"
sense. Specifically, we may want hash functions w.r.t. which the
probability of collision between two docs is proportional to their
similarity.

The "min-wise" hash functions have this property.

Specifically, we generate minwise hash values of a set of two docs as
follows:

Step 1. Make a random permutation p of all the terms
Step 2. The hash value of a doc d under p is the term t in d that
appears first in the permutation p

Suppose we have the terms cat, dog, mouse and banana in our universe

we have two docs d1={mouse,dog} d2={cat,mouse}

A random permutation p on the terms is {cat--dog--mouse--banana}
according to p, hash value of d1 is dog and hash value of d2 is cat.

Another random permutation p2 is {banana--mouse--cat--dog} and hash
values w.r.t. p2 for d1 is mouse and for d2 is mouse.

Suppse we  make m different hash value for each of the documents

Let m' be the number of cases in which d1 and d2 have the same value.

Now for something  **IMPORTANT**
It can be shown that m'/m tends to
|d1 .intersection. d2|/|d1 .union. d2| for large values of m

In other words the probability of collision is equal to the similarity
between the docs (under our similarity measure). How convenient, eh?

(this explains why we happen to conveniently define similarity in
terms of intersection and union).

==============================

Now, the problem with just using min-wise hash functions is that while
they do put the similar docs in the same bucket, they may also put
dissimilar docs in the same bucket.

To reduce the probability that two dissimilar docs will collide, we
need to make it hard for docs to become similar.

So, we say that it is not enough for two docs to agree on at least one
min-wise hash value, but they have to agree on some k number of
min-wise hash values.

This is formalized by consider the Locality based hashing scheme.
Suppose we have made m different hash values for each doc.

We can make a k-sized LSH signature by first picking k random numbers
between 1 and m, and concatenating those k min-wise hash values to get
the hash signature.

For example, suppose we have

doc A:
{mouse, dog, horse, ant}

and doc B:
{cat, ice, shoe, mouse}

suppose we generated m=4 hash values for each document (using 4 random
permutations over the words mouse, dog, horse, ant, cat, ice, shoe

for doc A
MH1 = horse
MH2 = mouse
MH3 = ant
MH4 = dog

for docB
MH1 = cat
MH2 = mouse
MH3 = ice
MH4 = shoe

To get an LSH signature of length k=3, we first pick 3 numbers between
1 and 4 (say, 1,3,4), and get the LSH signatures for docs A and B as

LSH134 = cat-ice-shoe  (doc A)
LSH134 = horse-ant-dog  (doc B)

Now, we want A and B to be considered similar only if they agree on
the LSH signature. Since agreement on a k-length LSH signature
requires independent agreement on k different minwise hash values,
we are driving down the false positive probability. Specifically if
probability of agreement on a single hash value is p, then the
probability of agreement over a k-size LSH signature will be p^k

**Okay, we got the false positive probability down. But it also made
false negative probability--that is two truly similar docs will be
found to be dissimilar by our hash method-- high

To deal with this we will make l different LSH signatures and say that
if a pair of docs agree on at least one of the l different LSH
signatures then they are similar.

Specifically, suppose we are defining to pages to be similar if they
have a similarity value (according to the intersection/union
formula)of at least .2

For the set of pages that have similarity value of >= .2, the
probability p  that they agree on a single min-hash value is >= .2.

The probability p' that two similar pages will collide over a k-size LSH
signature is p^k

The probability that two similar pages will collide at least once over l
different LSH signatures is l*p' = l*p^k

we want this probability to be close to 1 (so similar pages will
actually collide with near perfect probability)

thus l = 1/p^k

for p=.2, an k = 3, l will be 125 (you have to make 125 different LSH
signatures for each document).

The *REALLY NEAT* thing in the above is that we can increase  k (length
of LSH signature) to drive the false positive probability down, and
simultaneously increase l to get the false negative probability down.

May be this made better sense...?

Rao
[Mar  7, 2001]
```