API reference

LSHFunction API

LSHFunctions.LSHFunctionType
LSHFunction(similarity, args...; kws...)

Construct the default LSHFunction subtype that corresponds to the similarity function similarity.

Arguments

  • similarity: the similarity function you want to use. Can be any of the following:

    • cossim
    • inner_prod
    • jaccard
    • ℓ1
    • ℓ2
  • args...: arguments to pass on to the default LSHFunction constructor corresponding to similarity.

  • kws...: keyword parameters to pass on to the default LSHFunction constructor corresponding to similarity.

Returns

Returns a subtype of LSH.LSHFunction that hashes the similarity function similarity.

Examples

In the snippet below, we construct LSHFunctions.SimHash (the default hash function corresponding to cosine similarity) using LSHFunction():

julia> hashfn = LSHFunction(cossim);

julia> typeof(hashfn) <: LSHFunctions.SimHash <: LSHFunction
true

We can provide arguments and keyword parameters corresponding to the hash function that we construct:

julia> hashfn = LSHFunction(inner_prod, 100; dtype=Float64, maxnorm=10);

julia> n_hashes(hashfn) == 100 &&
       typeof(hashfn) <: SignALSH{Float64} &&
       hashfn.maxnorm == 10
true

See also: lsh_family

source
LSHFunctions.lsh_familyFunction
lsh_family(similarity)

Return the default constructor or LSHFunction subtype used to construct a hash function for the similarity function similarity.

The main use of lsh_family is to make it easier to find the documentation for the hash function that's constructed when you call LSHFunction. For instance, if you want to know more about the arguments and keyword parameters that can be given to LSHFunction(inner_prod), you can run

julia> lsh_family(inner_prod)
SignALSH

help?> SignALSH

Examples

julia> lsh_family(cossim)
SimHash

julia> lsh_family(ℓ1)
L1Hash

See also: LSHFunction

source
LSHFunctions.hashtypeFunction
hashtype(hashfn::LSHFunction)

Returns the type of hash generated by a hash function.

Examples

julia> hashfn = LSHFunction(cossim);

julia> hashtype(hashfn)
Bool

julia> hashfn = LSHFunction(ℓ1);

julia> hashtype(hashfn)
Int32
source
LSHFunctions.n_hashesFunction
n_hashes(hashfn::LSHFunction)

Return the number of hashes computed by hashfn.

Examples

julia> hashfn = SimHash();

julia> n_hashes(hashfn)
1

julia> hashfn = SimHash(12);

julia> n_hashes(hashfn)
12

julia> hashes = hashfn(rand(25));

julia> length(hashes)
12
source
LSHFunctions.similarityFunction
similarity(hashfn::LSHFunction)

Returns the similarity function that hashfn hashes on.

Arguments

  • hashfn::AbstractLSHFunction: the hash function whose similarity we would like to retrieve.

Examples

julia> hashfn = LSHFunction(cossim);

julia> similarity(hashfn) == cossim
true

julia> hashfn = LSHFunction(ℓ1);

julia> similarity(hashfn) == ℓ1
true
source
LSHFunctions.collision_probabilityFunction
collision_probability(
    hashfn::H,
    sim;
    n_hashes::Union{Symbol,Integer}=:auto
) where {H <: LSHFunction}

Compute the probability of hash collision between two inputs with similarity sim for an LSHFunction of type H. This function returns the probability that n_hashes hashes simultaneously collide.

Arguments

  • hashfn::LSHFunction: the LSHFunction for which we want to compute the probability of collision.
  • sim: a similarity (or vector of similarities), computed using the similarity function returned by similarity(hashfn).

Keyword arguments

  • n_hashes::Union{Symbol,Integer} (default: :auto): the number of hash functions to use to compute the probability of collision. If the probability that a single hash collides is p, then the probability that n_hashes hashes simultaneously collide is p^n_hashes. As a result,

    collision_probability(hashfn, sim; n_hashes=N)

    is the same as

    collision_probability(hashfn, sim; n_hashes=1).^N

    If n_hashes = :auto then this function will select the number of hashes to be n_hashes(hashfn) (using the n_hashes function from the LSHFunction API).

Examples

The probability that a single MinHash hash function causes a hash collision between inputs A and B is equal to jaccard(A,B):

julia> hashfn = MinHash();

julia> A = Set(["a", "b", "c"]);

julia> B = Set(["b", "c", "d"]);

julia> jaccard(A,B)
0.5

julia> collision_probability(hashfn, jaccard(A,B); n_hashes=1)
0.5

If our MinHash struct keeps track of N hash functions simultaneously, then the probability of collision is jaccard(A,B)^N:

julia> hashfn = MinHash(10);

julia> A = Set(["a", "b", "c"]);

julia> B = Set(["b", "c", "d"]);

julia> collision_probability(hashfn, jaccard(A,B)) ==
       collision_probability(hashfn, jaccard(A,B); n_hashes=10) ==
       collision_probability(hashfn, jaccard(A,B); n_hashes=1)^10
true

See also: n_hashes, similarity

source
collision_probability(hashfn::LSHFunction, x, y;
                      n_hashes::Union{Symbol,Integer} = :auto)

Computes the probability of a hash collision between two inputs x and y for a given hash function hashfn. This is the same as calling

collision_probability(hashfn, similarity(hashfn)(x,y); n_hashes=n_hashes)

Examples

The following snippet computes the probability of collision between two sets A and B for a single MinHash. For MinHash, this probability is just equal to the Jaccard similarity between A and B.

julia> hashfn = MinHash();

julia> A = Set(["a", "b", "c"]);

julia> B = Set(["a", "b", "c"]);

julia> similarity(hashfn) == jaccard
true

julia> collision_probability(hashfn, A, B) ==
       collision_probability(hashfn, jaccard(A,B)) ==
       jaccard(A,B)
true

We can use the n_hashes argument to specify the probability that n_hashes MinHash hash functions simultaneously collide. If left unspecified, then we'll simply use n_hashes(hashfn) as the number of hash functions:

julia> hashfn = MinHash(10);

julia> A = Set(["a", "b", "c"]);

julia> B = Set(["a", "b", "c"]);

julia> collision_probability(hashfn, A, B) ==
       collision_probability(hashfn, A, B; n_hashes=10) ==
       collision_probability(hashfn, A, B; n_hashes=1)^10
true
source
LSHFunctions.SymmetricLSHFunctionType
abstract type SymmetricLSHFunction <: LSHFunction end

A symmetric locality-sensitive hashing function. A SymmetricLSHFunction uses the same hash function to insert items into a hash table as well as query the table for collisions. If hashfn is a SymmetricLSHFunction, you can compute the hash for an input x as hashfn(x).

See also: AsymmetricLSHFunction

source
LSHFunctions.AsymmetricLSHFunctionType
abstract type AsymmetricLSHFunction <: LSHFunction end

An asymmetric locality-sensitive hashing function. An AsymmetricLSHFunction uses one hash function to insert items into a hash table, and a different hash function to query the table for collisions. If hashfn is an AsymmetricLSHFunction, you can compute the indexing hash for an input x with index_hash(hashfn,x), and the querying hash with query_hash(hashfn,x).

See also: SymmetricLSHFunction

source

Hash functions

LSHFunctions.SimHashType
SimHash(n_hashes::Integer = 1;
        dtype::Type = Float32,
        resize_pow2::Bool = false)

Creates a locality-sensitive hash function for cosine similarity.

Arguments

  • n_hashes::Integer (default: 1): the number of hash functions to generate.

Keyword parameters

  • dtype::Type (default: Float32): the data type to use in the LSHFunctions.SimHash internals. For performance reasons you should pick dtype to match the type of the data you're hashing.
  • resize_pow2::Bool (default: false): affects the way in which the returned LSHFunctions.SimHash resizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to set resize_pow2 = true.

Examples

Construct a hash function by calling SimHash with the number of hash functions you want to generate:

julia> hashfn = SimHash(24);

julia> n_hashes(hashfn) == 24 &&
       similarity(hashfn) == cossim
true

You can then call hashfn(x) in order to compute hashes:

julia> hashfn = SimHash(32);

julia> x = randn(30);

julia> hashes = hashfn(x);

References

  • Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC '02, page 380–388, New York, NY, USA, 2002. Association for Computing Machinery. 10.1145/509907.509965.

See also: cossim

source
LSHFunctions.MinHashType
MinHash(n_hashes::Integer = 1;
        dtype::Type = Any,
        symbols::Union{Vector,Set} = Set())

Construct a locality-sensitive hash function for Jaccard similarity.

Arguments

  • n_hashes::Integer (default: 1): the number of hash functions to generate.

Keyword parameters

  • dtype::Type (default: Any): the type of symbols in the sets you're hashing. This is overriden by the data type contained in symbols when symbols is non-empty.
  • symbols::Union{Vector,Set}: a Vector or Set containing all of the possible elements ("symbols") of the sets that you will be hashing. If left empty, MinHash will instead expand its dictionary when it sees new symbols (at small additional computational expense).

Examples

Construct a hash function to hash sets whose elements are integers between 1 and 50:

julia> hashfn = MinHash(40; symbols = Set(1:50));

julia> n_hashes(hashfn) == 40 && similarity(hashfn) == jaccard
true

julia> hashfn(Set([38, 14, 29, 48, 11]));

julia> hashfn([1, 1, 2, 3, 4]); # You can also hash Vectors

julia> hashfn(Set([100]))
ERROR: Symbol 100 not found

If you aren't sure ahead of time exactly what kinds of elements will be in the sets you're hashing, you can opt not to specify symbols, in which case MinHash will lazily update its hash functions as it encounters new symbols:

julia> hashfn = MinHash();

julia> hashfn(Set([1, 2, 3]));

julia> hashfn(Set(["a", "b", "c"]));

If you don't know what elements you'll encounter, but you know that they'll all be of a specific data type, you can specify the dtype argument for increased efficiency:

julia> hashfn = MinHash(10; dtype = String);

julia> hashfn(Set(["a", "b", "c"]));

References

  • Broder, A. On the resemblance and containment of documents. Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997. doi:10.1109/SEQUEN.1997.666900.

See also: jaccard

source
LSHFunctions.L1HashType
L1Hash(
    n_hashes::Integer = 1;
    dtype::Type = Float32,
    r::Real = 1.0,
    resize_pow2::Bool = false
)

Constructs a locality-sensitive hash for $\ell^1$ distance ($\|x - y\|_1$), defined as

$\|x - y\|_1 = \sum_i |x_i - y_i|$

Arguments

  • n_hashes::Integer (default: 1): the number of hash functions to generate.

Keyword parameters

  • dtype::Type (default: Float32): the data type to use in the LSHFunctions.L1Hash internals. For performance reasons you should pick dtype to match the type of the data you're hashing.
  • r::Real (default: 1.0): a positive coefficient whose magnitude influences the collision rate. Larger values of r will increase the collision rate, even for distant points. See references for more information.
  • resize_pow2::Bool (default: false): affects the way in which the returned LSHFunctions.L1Hash resizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to set resize_pow2 = true.

Examples

Construct an L1Hash with the number of hash functions you want to generate:

julia> hashfn = L1Hash(128);

julia> hashfn.power == 1 &&
       n_hashes(hashfn) == 128 &&
       similarity(hashfn) == ℓ1
true

After creating a hash function, you can compute hashes with hashfn(x):

julia> hashfn = L1Hash(20);

julia> x = rand(4);

julia> hashes = hashfn(x);

References

  • Datar, Mayur & Indyk, Piotr & Immorlica, Nicole & Mirrokni, Vahab. (2004). Locality-sensitive hashing scheme based on p-stable distributions. Proceedings of the Annual Symposium on Computational Geometry. 10.1145/997817.997857.

See also: ℓ1

source
LSHFunctions.L2HashType
L2Hash(
    n_hashes::Integer = 1;
    dtype::Type = Float32,
    r::Real = 1.0,
    resize_pow2::Bool = false
)

Constructs a locality-sensitive hash for $\ell^2$ distance ($\|x - y\|_2$), defined as

$\|x - y\|_2 = \left(\sum_i |x_i - y_i|^2\right)^{1/2}$

Arguments

  • n_hashes::Integer (default: 1): the number of hash functions to generate.

Keyword parameters

  • dtype::Type (default: Float32): the data type to use in the LSHFunctions.L2Hash internals. For performance reasons you should pick dtype to match the type of the data you're hashing.
  • r::Real (default: 1.0): a positive coefficient whose magnitude influences the collision rate. Larger values of r will increase the collision rate, even for distant points. See references for more information.
  • resize_pow2::Bool (default: false): affects the way in which the returned LSHFunctions.L2Hash resizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to set resize_pow2 = true.

Examples

Construct an L2Hash with the number of hash functions you want to generate:

julia> hashfn = L2Hash(128);

julia> hashfn.power == 2 &&
       n_hashes(hashfn) == 128 &&
       similarity(hashfn) == ℓ2
true

After creating a hash function, you can compute hashes with hashfn(x):

julia> hashfn = L2Hash(20);

julia> x = rand(4);

julia> hashes = hashfn(x);

References

  • Datar, Mayur & Indyk, Piotr & Immorlica, Nicole & Mirrokni, Vahab. (2004). Locality-sensitive hashing scheme based on p-stable distributions. Proceedings of the Annual Symposium on Computational Geometry. 10.1145/997817.997857.

See also: ℓ2

source
LSHFunctions.SignALSHType
SignALSH(n_hashes::Integer = 1,
         dtype::Type = Float32,
         maxnorm::Union{Nothing,Real} = nothing,
         m::Integer = 3,
         resize_pow2::Bool = false)

Create a SignALSH hash function for hashing on inner product similarity.

Arguments

  • n_hashes::Integer (default: 1): the number of hash functions to generate.

Keyword parameters

  • dtype::Type (default: Float32): the data type to use in the LSHFunctions.SignALSH internals. For performance reasons you should pick dtype to match the type of the data you're hashing.
  • maxnorm::Union{Nothing,Real} (default: nothing): an upper bound on the $\ell^2$-norm of the data points.
Warning: maxnorm must be set

The maxnorm keyword parameter must be explicitly specified. If it is left unspecified (or set to nothing), SignALSH() will raise an error.

  • m::Integer (default: 3): parameter m that affects the probability of a hash collision.
  • resize_pow2::Bool (default: false): affects the way in which the returned LSHFunctions.SignALSH resizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to set resize_pow2 = true.

Examples

SignALSH is an AsymmetricLSHFunction, and hence hashes must be computed using index_hash and query_hash.

julia> hashfn = SignALSH(12; maxnorm=10);

julia> x = rand(4);

julia> ih = index_hash(hashfn, x); qh = query_hash(hashfn, x);

julia> length(ih) == length(qh) == 12
true

julia> typeof(ih) == typeof(qh) == BitArray{1}
true

You need to explicitly specify the maxnorm keyword parameter when constructing SignALSH, otherwise you will get an error.

julia> hashfn = SignALSH(12)
ERROR: maxnorm must be specified for SignALSH

You'll also get an error if you try to hash a vector that has norm greater than the maxnorm that you specified.

julia> hashfn = SignALSH(; maxnorm=1);

julia> index_hash(hashfn, ones(4))
ERROR: norm 2.0 exceeds maxnorm (1.0)

References

  • Anshumali Shrivastava and Ping Li. Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS). In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI'15, page 812–821, Arlington, Virginia, USA, 2015. AUAI Press. 10.5555/3020847.3020931. arXiv:1405.5869

See also: inner_prod, ℓ2_norm

source

Similarity functions

LSHFunctions.L1Method
Lp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)

Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p), ℓ1(x,y), and ℓ2(x,y), respectively.

See also: ℓp

Lp(f, g, interval::LSHFunctions.RealInterval, p)
L1(f, g, interval::LSHFunctions.RealInterval)
L2(f, g, interval::LSHFunctions.RealInterval)

Computes the $L^p$ distance between two functions, given by

$L^p(f,g) \coloneqq \|f - g\|_p = \left(\int_a^b \left|f(x) - g(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$

Examples

Below we compute the $L^1$, $L^2$, and $L^3$ distances between $f(x) = x^2 + 1$ and $g(x) = 2x$ over the interval $[0,1]$. The distances are computed by evaluating the integral

$\left(\int_0^1 \left|f(x) - g(x)\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 \left|x^2 - 2x + 1\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 (x - 1)^{2p} \hspace{0.15cm}dx\right)^{1/p}$

for $p = 1$, $p = 2$, and $p = 3$.

julia> f(x) = x^2 + 1; g(x) = 2x;

julia> interval = @interval(0 ≤ x ≤ 1);

julia> Lp(f, g, interval, 1) ≈ L1(f, g, interval) ≈ 3^(-1)
true

julia> Lp(f, g, interval, 2) ≈ L2(f, g, interval) ≈ 5^(-1/2)
true

julia> Lp(f, g, interval, 3) ≈ 7^(-1/3)
true

See also: Lp_norm, ℓp

source
LSHFunctions.L1Method
Lp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)

Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p), ℓ1(x,y), and ℓ2(x,y), respectively.

See also: ℓp

source
LSHFunctions.L1_normMethod
Lp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)

Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p), ℓ1_norm(x), and ℓ2_norm(x), respectively.

See also: ℓp_norm

source
LSHFunctions.L1_normMethod
Lp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)

Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p), ℓ1_norm(x), and ℓ2_norm(x), respectively.

See also: ℓp_norm

Lp_norm(f, interval::LSHFunctions.RealInterval, p::Real=2)
L1_norm(f, interval::LSHFunctions.RealInterval)
L2_norm(f, interval::LSHFunctions.RealInterval)

Computes the $L^p$ function-space norm of a function $f$, which is given by the equation

$\|f\|_p = \left(\int_a^b \left|f(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$

L1_norm(f, interval) is the same as Lp_norm(f, interval, 1), and L2_norm(f, interval) is the same as Lp_norm(f, interval, 2).

Examples

julia> f(x) = x;

julia> interval = @interval(0 ≤ x ≤ 1);

julia> Lp_norm(f, interval, 1) ≈ L1_norm(f, interval) ≈ 2^(-1/1)
true

julia> Lp_norm(f, interval, 2) ≈ L2_norm(f, interval) ≈ 3^(-1/2)
true

julia> Lp_norm(f, interval, 3) ≈ 4^(-1/3)
true
source
LSHFunctions.L2Method
Lp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)

Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p), ℓ1(x,y), and ℓ2(x,y), respectively.

See also: ℓp

Lp(f, g, interval::LSHFunctions.RealInterval, p)
L1(f, g, interval::LSHFunctions.RealInterval)
L2(f, g, interval::LSHFunctions.RealInterval)

Computes the $L^p$ distance between two functions, given by

$L^p(f,g) \coloneqq \|f - g\|_p = \left(\int_a^b \left|f(x) - g(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$

Examples

Below we compute the $L^1$, $L^2$, and $L^3$ distances between $f(x) = x^2 + 1$ and $g(x) = 2x$ over the interval $[0,1]$. The distances are computed by evaluating the integral

$\left(\int_0^1 \left|f(x) - g(x)\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 \left|x^2 - 2x + 1\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 (x - 1)^{2p} \hspace{0.15cm}dx\right)^{1/p}$

for $p = 1$, $p = 2$, and $p = 3$.

julia> f(x) = x^2 + 1; g(x) = 2x;

julia> interval = @interval(0 ≤ x ≤ 1);

julia> Lp(f, g, interval, 1) ≈ L1(f, g, interval) ≈ 3^(-1)
true

julia> Lp(f, g, interval, 2) ≈ L2(f, g, interval) ≈ 5^(-1/2)
true

julia> Lp(f, g, interval, 3) ≈ 7^(-1/3)
true

See also: Lp_norm, ℓp

source
LSHFunctions.L2Method
Lp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)

Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p), ℓ1(x,y), and ℓ2(x,y), respectively.

See also: ℓp

source
LSHFunctions.L2_normMethod
Lp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)

Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p), ℓ1_norm(x), and ℓ2_norm(x), respectively.

See also: ℓp_norm

source
LSHFunctions.L2_normMethod
Lp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)

Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p), ℓ1_norm(x), and ℓ2_norm(x), respectively.

See also: ℓp_norm

Lp_norm(f, interval::LSHFunctions.RealInterval, p::Real=2)
L1_norm(f, interval::LSHFunctions.RealInterval)
L2_norm(f, interval::LSHFunctions.RealInterval)

Computes the $L^p$ function-space norm of a function $f$, which is given by the equation

$\|f\|_p = \left(\int_a^b \left|f(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$

L1_norm(f, interval) is the same as Lp_norm(f, interval, 1), and L2_norm(f, interval) is the same as Lp_norm(f, interval, 2).

Examples

julia> f(x) = x;

julia> interval = @interval(0 ≤ x ≤ 1);

julia> Lp_norm(f, interval, 1) ≈ L1_norm(f, interval) ≈ 2^(-1/1)
true

julia> Lp_norm(f, interval, 2) ≈ L2_norm(f, interval) ≈ 3^(-1/2)
true

julia> Lp_norm(f, interval, 3) ≈ 4^(-1/3)
true
source
LSHFunctions.LpFunction
Lp(f, g, interval::LSHFunctions.RealInterval, p)
L1(f, g, interval::LSHFunctions.RealInterval)
L2(f, g, interval::LSHFunctions.RealInterval)

Computes the $L^p$ distance between two functions, given by

$L^p(f,g) \coloneqq \|f - g\|_p = \left(\int_a^b \left|f(x) - g(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$

Examples

Below we compute the $L^1$, $L^2$, and $L^3$ distances between $f(x) = x^2 + 1$ and $g(x) = 2x$ over the interval $[0,1]$. The distances are computed by evaluating the integral

$\left(\int_0^1 \left|f(x) - g(x)\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 \left|x^2 - 2x + 1\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 (x - 1)^{2p} \hspace{0.15cm}dx\right)^{1/p}$

for $p = 1$, $p = 2$, and $p = 3$.

julia> f(x) = x^2 + 1; g(x) = 2x;

julia> interval = @interval(0 ≤ x ≤ 1);

julia> Lp(f, g, interval, 1) ≈ L1(f, g, interval) ≈ 3^(-1)
true

julia> Lp(f, g, interval, 2) ≈ L2(f, g, interval) ≈ 5^(-1/2)
true

julia> Lp(f, g, interval, 3) ≈ 7^(-1/3)
true

See also: Lp_norm, ℓp

source
LSHFunctions.LpMethod
Lp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)

Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p), ℓ1(x,y), and ℓ2(x,y), respectively.

See also: ℓp

source
LSHFunctions.Lp_normFunction
Lp_norm(f, interval::LSHFunctions.RealInterval, p::Real=2)
L1_norm(f, interval::LSHFunctions.RealInterval)
L2_norm(f, interval::LSHFunctions.RealInterval)

Computes the $L^p$ function-space norm of a function $f$, which is given by the equation

$\|f\|_p = \left(\int_a^b \left|f(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$

L1_norm(f, interval) is the same as Lp_norm(f, interval, 1), and L2_norm(f, interval) is the same as Lp_norm(f, interval, 2).

Examples

julia> f(x) = x;

julia> interval = @interval(0 ≤ x ≤ 1);

julia> Lp_norm(f, interval, 1) ≈ L1_norm(f, interval) ≈ 2^(-1/1)
true

julia> Lp_norm(f, interval, 2) ≈ L2_norm(f, interval) ≈ 3^(-1/2)
true

julia> Lp_norm(f, interval, 3) ≈ 4^(-1/3)
true
source
LSHFunctions.Lp_normFunction
Lp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)

Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p), ℓ1_norm(x), and ℓ2_norm(x), respectively.

See also: ℓp_norm

source
LSHFunctions.cossimMethod
cossim(x,y)

Computes the cosine similarity between two inputs $x$ and $y$. Cosine similarity is defined as

$\text{cossim}(x,y) = \frac{\left\langle x,y\right\rangle}{\|x\|\cdot\|y\|}$

where $\left\langle\cdot,\cdot\right\rangle$ is an inner product (e.g. dot product) and $\|\cdot\|$ is its derived norm. This is roughly interpreted as being related to the angle between the inputs $x$ and $y$: when $x$ and $y$ have low angle between them, cossim(x,y) is high (close to $1$). When $x$ and $y$ have large angle between them, cossim(x,y) is low (close to $-1$).

Arguments

  • x and y: two inputs for which dot(x,y), norm(x), and norm(y) are defined.

Examples

julia> using LinearAlgebra: dot, norm;

julia> x, y = rand(4), rand(4);

julia> cossim(x,y) == dot(x,y) / (norm(x) * norm(y))
true

julia> z = rand(5);

julia> cossim(x,z)
ERROR: DimensionMismatch("dot product arguments have lengths 4 and 5")

See also: SimHash

source
LSHFunctions.inner_prodMethod
inner_prod(x::AbstractVector, y::AbstractVector)

Computes the $\ell^2$ inner product (dot product)

$\left\langle x, y\right\rangle = \sum_i x_iy_i$

Examples

julia> using LinearAlgebra: dot;

julia> x, y = randn(4), randn(4);

julia> inner_prod(x,y) ≈ dot(x,y)
true
source
LSHFunctions.inner_prodMethod
inner_prod(f, g, interval::LSHFunctions.RealInterval)

Computes the $L^2$ inner product

$\left\langle f, g\right\rangle = \int_a^b f(x)g(x) \hspace{0.15cm} dx$

where the interval we're integrating over is specified by the interval argument.

Examples

julia> f(x) = cos(x); g(x) = sin(x);

julia> inner_prod(f, g, @interval(0 ≤ x ≤ π/2)) ≈ 1/2
true
source
LSHFunctions.jaccardMethod
function jaccard(x::BitArray{1}, y::BitArray{1})

Computes the Jaccard similarity between a pair of binary vectors:

$J(x, y) = \frac{\sum_{i} \min{(x_i,y_i)}}{\sum_{i} \max{(x_i,y_i)}}$

Arguments

  • x::BitArray{1}, y::BitArray{1}: two binary vectors, in the form of BitArrays.

Examples

julia> x = BitArray([true, false, true, true, false]);

julia> y = BitArray([false, false, true, true, true]);

julia> jaccard(x,y)
0.5
source
LSHFunctions.jaccardMethod
jaccard(A::Set, B::Set) :: Float64

Computes the Jaccard similarity between sets $A$ and $B$, which is defined as

$\text{Jaccard}(A,B) = \frac{\left|A \cap B\right|}{\left|A \cup B\right|}$

Arguments

  • A::Set, B::Set: two sets whose Jaccard similarity we would like to compute.

Examples

julia> A, B = Set([1, 2, 3]), Set([2, 3, 4]);

julia> jaccard(A,B)
0.5

julia> jaccard(A,B) == length(A ∩ B) / length(A ∪ B)
true

See also: MinHash

source
LSHFunctions.jaccardMethod
function jaccard(x::AbstractVector{<:Real}, y::AbstractVector{<:Real})

Computes the Jaccard similarity between a pair of vectors of real numbers:

$J(x, y) = \frac{\sum_{i} \min{(x_i,y_i)}}{\sum_{i} \max{(x_i,y_i)}}$

Arguments

  • x::AbstractVector{<:Real}, y::AbstractVector{<:Real}: a pair of vectors containing real numbers (subtypes of Real).

Examples

julia> x = [0.8, 0.1, 0.3, 0.4, 0.1];

julia> y = [1.0, 0.6, 0.0, 0.4, 0.5];

julia> jaccard(x,y)
0.5
source
LSHFunctions.jaccardMethod
function jaccard(A::Set{<:K},
                 B::Set{<:K},
                 weights::Dict{K,V}) where {K,V<:Number}

Computes the weighted Jaccard similarity between two sets:

$J(x, y) = \frac{\sum_{x\in A\cap B} w_x}{\sum_{y\in A\cup B} w_y}$

Arguments

  • A::Set, B::Set: two sets whose Jaccard similarity we would like to compute.
  • weights::Dict: a dictionary mapping symbols in the sets A and B to numerical weights. These weights must be positive.

Examples

julia> A = Set(["a", "b", "c"]);

julia> B = Set(["b", "c", "d"]);

julia> W = Dict("a" => 0.2, "b" => 2.4, "c" => 0.6, "d" => 1.8);

julia> jaccard(A,B,W)
0.6
source
LSHFunctions.ℓ1Method
ℓp(x::AbstractVector, y::AbstractVector, p::Real=2)
ℓ1(x::AbstractVector, y::AbstractVector)
ℓ2(x::AbstractVector, y::AbstractVector)

Computes the $\ell^p$ distance between a pair of vectors, given by

$\ell^p(x,y) \coloneqq \|x - y\|_p = \left(\sum_i \left|x_i - y_i\right|^p\right)^{1/p}$

ℓ1(x,y) is the same as ℓp(x,y,1), and ℓ2(x,y) is the same as ℓp(x,y,2).

Examples

julia> x = [1, 2, 3];

julia> y = [4, 5, 6];

julia> ℓp(x,y,2) == (abs(1-4)^2 + abs(2-5)^2 + abs(3-6)^2)^(1/2)
true

julia> ℓp(x,y,3) == (abs(1-4)^3 + abs(2-5)^3 + abs(3-6)^3)^(1/3)
true

See also: ℓp_norm, L1Hash, L2Hash

source
LSHFunctions.ℓ1_normMethod
ℓp_norm(x::AbstractVector, p::Real = 2)
ℓ1_norm(x::AbstractVector)
ℓ2_norm(x::AbstractVector)

Compute the $\ell^p$ norm of a point $x$, defined as

$\|x\|_p = \left(\sum_i \left|x_i\right|^p\right)^{1/p}$

Examples

julia> x = randn(4);

julia> ℓp_norm(x, 1) ≈ ℓ1_norm(x) ≈ (map(u -> abs(u)^1, x) |> sum)^(1/1)
true

julia> ℓp_norm(x, 2) ≈ ℓ2_norm(x) ≈ (map(u -> abs(u)^2, x) |> sum)^(1/2)
true

julia> ℓp_norm(x, 3) ≈ (map(u -> abs(u)^3, x) |> sum)^(1/3)
true

See also: ℓp, Lp_norm

source
LSHFunctions.ℓ2Method
ℓp(x::AbstractVector, y::AbstractVector, p::Real=2)
ℓ1(x::AbstractVector, y::AbstractVector)
ℓ2(x::AbstractVector, y::AbstractVector)

Computes the $\ell^p$ distance between a pair of vectors, given by

$\ell^p(x,y) \coloneqq \|x - y\|_p = \left(\sum_i \left|x_i - y_i\right|^p\right)^{1/p}$

ℓ1(x,y) is the same as ℓp(x,y,1), and ℓ2(x,y) is the same as ℓp(x,y,2).

Examples

julia> x = [1, 2, 3];

julia> y = [4, 5, 6];

julia> ℓp(x,y,2) == (abs(1-4)^2 + abs(2-5)^2 + abs(3-6)^2)^(1/2)
true

julia> ℓp(x,y,3) == (abs(1-4)^3 + abs(2-5)^3 + abs(3-6)^3)^(1/3)
true

See also: ℓp_norm, L1Hash, L2Hash

source
LSHFunctions.ℓ2_normMethod
ℓp_norm(x::AbstractVector, p::Real = 2)
ℓ1_norm(x::AbstractVector)
ℓ2_norm(x::AbstractVector)

Compute the $\ell^p$ norm of a point $x$, defined as

$\|x\|_p = \left(\sum_i \left|x_i\right|^p\right)^{1/p}$

Examples

julia> x = randn(4);

julia> ℓp_norm(x, 1) ≈ ℓ1_norm(x) ≈ (map(u -> abs(u)^1, x) |> sum)^(1/1)
true

julia> ℓp_norm(x, 2) ≈ ℓ2_norm(x) ≈ (map(u -> abs(u)^2, x) |> sum)^(1/2)
true

julia> ℓp_norm(x, 3) ≈ (map(u -> abs(u)^3, x) |> sum)^(1/3)
true

See also: ℓp, Lp_norm

source
LSHFunctions.ℓpFunction
ℓp(x::AbstractVector, y::AbstractVector, p::Real=2)
ℓ1(x::AbstractVector, y::AbstractVector)
ℓ2(x::AbstractVector, y::AbstractVector)

Computes the $\ell^p$ distance between a pair of vectors, given by

$\ell^p(x,y) \coloneqq \|x - y\|_p = \left(\sum_i \left|x_i - y_i\right|^p\right)^{1/p}$

ℓ1(x,y) is the same as ℓp(x,y,1), and ℓ2(x,y) is the same as ℓp(x,y,2).

Examples

julia> x = [1, 2, 3];

julia> y = [4, 5, 6];

julia> ℓp(x,y,2) == (abs(1-4)^2 + abs(2-5)^2 + abs(3-6)^2)^(1/2)
true

julia> ℓp(x,y,3) == (abs(1-4)^3 + abs(2-5)^3 + abs(3-6)^3)^(1/3)
true

See also: ℓp_norm, L1Hash, L2Hash

source
LSHFunctions.ℓp_normFunction
ℓp_norm(x::AbstractVector, p::Real = 2)
ℓ1_norm(x::AbstractVector)
ℓ2_norm(x::AbstractVector)

Compute the $\ell^p$ norm of a point $x$, defined as

$\|x\|_p = \left(\sum_i \left|x_i\right|^p\right)^{1/p}$

Examples

julia> x = randn(4);

julia> ℓp_norm(x, 1) ≈ ℓ1_norm(x) ≈ (map(u -> abs(u)^1, x) |> sum)^(1/1)
true

julia> ℓp_norm(x, 2) ≈ ℓ2_norm(x) ≈ (map(u -> abs(u)^2, x) |> sum)^(1/2)
true

julia> ℓp_norm(x, 3) ≈ (map(u -> abs(u)^3, x) |> sum)^(1/3)
true

See also: ℓp, Lp_norm

source

Hashing in function spaces

LSHFunctions.MonteCarloHashType
MonteCarloHash(sim, ω, args...; volume=1.0, n_samples=1024, kws...)

Samples a hash function from an LSH family for the similarity sim defined over the function space $L^p_{\mu}(\Omega)$. sim may be one of the following:

  • L1
  • L2
  • cossim

Given an input function $f\in L^p_{\mu}(\Omega)$, MonteCarloHash works by sampling $f$ at some randomly-selected points in $\Omega$, and then hashing those samples.

Arguments

  • sim: the similarity statistic you want to hash on.
  • ω: a function that takes no inputs and samples a single point from $\Omega$. Alternatively, it can be viewed as a random variable with probability measure

\[\frac{\mu}{\text{vol}_{\mu}(\Omega)} = \frac{\mu}{\int_{\Omega} d\mu}\]

  • args...: arguments to pass on when building the LSHFunction instance underlying the returned MonteCarloHash struct.
  • volume::Real (default: 1.0): the volume of the space $\Omega$, defined as

\[\text{vol}_{\mu}(\Omega) = \int_{\Omega} d\mu\]

  • n_samples::Integer (default: 1024): the number of points to sample from each function that is hashed by the MonteCarloHash. Larger values of n_samples tend to capture the input function better and will thus be more likely to achieve desirable collision probabilities.
  • kws...: keyword arguments to pass on when building the LSHFunction instance underlying the returned MonteCarloHash struct.

Examples

Create a hash function for cosine similarity for functions in $L^2([-1,1])$:

julia> μ() = 2*rand()-1;   # μ samples a random point from [-1,1]

julia> hashfn = MonteCarloHash(cossim, μ, 50; volume=2.0);

julia> n_hashes(hashfn)
50

julia> similarity(hashfn) == cossim
true

julia> hashtype(hashfn)
Bool

Create a hash function for $L^2$ distance in the function space $L^2([0,2\pi])$. Hash the functions f(x) = cos(x) and f(x) = x/(2π) using the returned MonteCarloHash.

julia> μ() = 2π * rand(); # μ samples a random point from [0,2π]

julia> hashfn = MonteCarloHash(L2, μ, 3; volume=2π);

julia> hashfn(cos)
3-element Array{Int32,1}:
 -1
  3
  0

julia> hashfn(x -> x/(2π))
3-element Array{Int32,1}:
 -1
 -2
 -1

Create a hash function with a different number of sample points.

julia> μ() = rand();  # Samples a random point from [0,1]

julia> hashfn = MonteCarloHash(cossim, μ; volume=1.0, n_samples=512);

julia> length(hashfn.sample_points)
512
source

Miscellaneous

LSHFunctions.@intervalMacro
@interval(expr)

Construct a new LSHFunctions.RealInterval representing an interval on the real line from an expression such as

0 ≤ x < 1

The returned expression constructs an LSHFunctions.RealInterval encoding the lower and upper bounds on the interval, as well as whether the ends are opened or closed.

Examples

You can construct an interval using the following syntax:

julia> interval = @interval(0 ≤ x < 1);

There are usually multiple ways of constructing the same interval. For instance, each of the expressions below are equivalent ways of constructing the interval [-1,1].

julia> @interval(-1 ≤  x ≤  1) ==
       @interval(-1 <= x <= 1) ==
       @interval(-1 ≤  y ≤  1) ==
       @interval( 1 ≥  x ≥ -1)
true

You can even create intervals with Inf or -Inf at the endpoints, e.g. @interval(-Inf < x < Inf).

There are two primary operations you can run on an interval: testing for membership and intersection. You can test whether or not x is in an interval using x ∈ interval, as shown below.

julia> interval = @interval(0 ≤ x < 1);

julia> 0 ∈ interval && 1 ∉ interval
true

julia> 0 in interval    # This syntax also works
true

You can also intersect two intervals using the operator (or by using intersect(interval_1, interval_2)).

julia> @interval(0 ≤ x < 1) ∩ @interval(1/2 < x ≤ 1) == @interval(1/2 < x < 1)
true

See also: RealInterval

source

Private interface

LSHFunctions.RealIntervalType
struct RealInterval{T<:Real}

Encodes an interval of the real line, such as [-1,1] or [0,Inf).

Fields

  • lower::T: lower bound on the interval.
  • upper::T: upper bound on the interval.
  • closed_below::Bool: whether or not the interval is closed below.
  • closed_above::Bool: whether or not the interval is closed above.

Examples

The following snippet constructs RealInterval represeting the interval [0,1)

julia> interval = LSHFunctions.RealInterval(0, 1, true, false);

It's generally easier to construct an interval using the @interval macro. Check out the documentation for @interval for more information.

See also: @interval

source
Base.isemptyMethod
isempty(interval::RealInterval)

Returns true if interval is empty (i.e. there doesn't exist an x for which x ∈ interval is true), and false otherwise.

source
LSHFunctions.widthMethod
width(interval::RealInterval)

Return the width of a RealInterval (i.e. the difference between its upper and lower bounds.

source
LSHFunctions.@register_similarity!Macro
register_similarity!(similarity, hashfn)

Register hashfn to the LSH module as the default locality-sensitive hash function to use for similarity function similarity. This makes it possible to construct a new hash function for similarity with LSHFunction(similarity, args...; kws...).

Arguments

  • similarity: the similarity function to register.
  • hashfn: the default locality-sensitive hash function that similarity should be associated with.

Examples

Create a custom implementation of cosine similarity called my_cossim, and associate it with SimHash:

julia> using LinearAlgebra: dot, norm

julia> my_cossim(x,y) = dot(x,y) / (norm(x) * norm(y));

julia> hashfn = LSHFunction(my_cossim);
ERROR: MethodError: no method matching LSHFunction(::typeof(my_cossim))

julia> LSHFunctions.@register_similarity!(my_cossim, SimHash);

julia> hashfn = LSHFunction(my_cossim);

julia> isa(hashfn, SimHash)
true
source