The LSHFunction API

Under construction

This section is currently being developed. If you're interested in helping write this section, feel free to open a pull request; otherwise, please check back later.

LSHFunction

The LSHFunctions module exposes a relatively easy interface for constructing new hash functions. Namely, you call LSHFunction with

  • the similarity statistic you want to hash on;
  • the number of hash functions you want to generate; and
  • keyword parameters specific to the LSH function family that you're sampling from.
LSHFunction(similarity, n_hashes::Integer=1; kws...)

For instance, in the snippet below we create a single hash function corresponding to cosine similarity:

julia> using LSHFunctions

julia> hashfn = LSHFunction(cossim);

julia> typeof(hashfn)
SimHash{Float32}

julia> n_hashes(hashfn)
1

julia> similarity(hashfn)
cossim (generic function with 2 methods)

As another example, following code snippet creates 10 hash functions for inner product similarity. All of the generated hash functions are bundled together into a single SignALSH struct. We specify the following keyword arguments:

  • dtype: the data type to use internally in the SignALSH struct.
  • maxnorm: an upper bound on the norm of the data points we're hashing, and a required parameter for SignALSH.
julia> using LSHFunctions

julia> hashfn = LSHFunction(inner_prod, 10; dtype=Float64, maxnorm=5.0);

julia> n_hashes(hashfn)
10

julia> typeof(hashfn)
SignALSH{Float64}

julia> hashfn.maxnorm
5.0
Creating multiple hash functions

In practice, you usually want to use multiple hash functions at the same time, and combine their hashes together in order to form a key with which to index into the hash table. To create N hash functions simultaneously, run

hashfn = LSHFunction(similarity, N; kws...)

hashfn will automatically generate and compute N different hash functions. It will then return a Vector of those hashes (unless hashtype(hashfn) is Bool, in which case it will return a BitArray).

  • See the FAQ for the reasoning behind using multiple locality-sensitive hash functions simultaneously.

If you want to know what hash function will be created for a given similarity, you can use lsh_family:

julia> lsh_family(jaccard)
MinHash

julia> lsh_family(ℓ1)
L1Hash

Utilities

LSHFunctions.jl provides a few common utility functions that you can use across LSHFunction subtypes:

  • n_hashes: returns the number of hash functions computed by an LSHFunction.

    julia> hashfn = LSHFunction(jaccard);
    
    julia> n_hashes(hashfn)
    1
    
    julia> hashfn = LSHFunction(jaccard, 10);
    
    julia> n_hashes(hashfn)
    10
    
    julia> hashes = hashfn(randn(50));
    
    julia> length(hashes)
    10
  • similarity: returns the similarity statistic on which your hash function is locality-sensitive:

    julia> hashfn = LSHFunction(cossim);
    
    julia> similarity(hashfn)
    cossim (generic function with 2 methods)
  • hashtype: returns the type of hash computed by the input hash function. Note that in practice hashfn(x) (or index_hash(hashfn,x) and query_hash(hashfn,x) for an AsymmetricLSHFunction) will return an array of hashes, one for each hash function you sampled when you called LSHFunction. hashtype is the data type of each element of hashfn(x).

    julia> hashfn = LSHFunction(cossim, 5);
    
    julia> hashtype(hashfn)
    Bool
    
    julia> hashes = hashfn(rand(100));
    
    julia> typeof(hashes)
    BitArray{1}
    
    julia> typeof(hashes[1]) == hashtype(hashfn)
    true
  • collision_probability: returns the probability of collision for two inputs with a given similarity. For instance, the probability that a single MinHash hash function causes a collision between inputs A and B is equal to jaccard(A,B):

    julia> hashfn = MinHash();
    
    julia> A = Set(["a", "b", "c"]);
    
    julia> B = Set(["b", "c", "d"]);
    
    julia> collision_probability(hashfn, A, B) ==
           collision_probability(hashfn, jaccard(A,B)) ==
           jaccard(A,B)
    true

    We often want to compute the probability that not just one hash collides, but that multiple hashes collide simultaneously. You can calculate this using the n_hashes keyword argument. If left unspecified, this parameter will default to n_hashes(hashfn).

    julia> hashfn = MinHash(5);
    
    julia> A = Set(["a", "b", "c"]);
    
    julia> B = Set(["b", "c", "d"]);
    
    julia> collision_probability(hashfn, A, B) ==
           collision_probability(hashfn, A, B; n_hashes=5) ==
           collision_probability(hashfn, A, B; n_hashes=1)^5
    true
    
    julia> sim = jaccard(A,B);
    
    julia> collision_probability(hashfn, sim) ==
           collision_probability(hashfn, sim; n_hashes=5) ==
           collision_probability(hashfn, sim; n_hashes=1)^5
    true