API reference
LSHFunction API
LSHFunctions.LSHFunction — TypeLSHFunction(similarity, args...; kws...)Construct the default LSHFunction subtype that corresponds to the similarity function similarity.
Arguments
similarity: the similarity function you want to use. Can be any of the following:cossiminner_prodjaccardℓ1ℓ2
args...: arguments to pass on to the defaultLSHFunctionconstructor corresponding tosimilarity.kws...: keyword parameters to pass on to the defaultLSHFunctionconstructor corresponding tosimilarity.
Returns
Returns a subtype of LSH.LSHFunction that hashes the similarity function similarity.
Examples
In the snippet below, we construct LSHFunctions.SimHash (the default hash function corresponding to cosine similarity) using LSHFunction():
julia> hashfn = LSHFunction(cossim);
julia> typeof(hashfn) <: LSHFunctions.SimHash <: LSHFunction
trueWe can provide arguments and keyword parameters corresponding to the hash function that we construct:
julia> hashfn = LSHFunction(inner_prod, 100; dtype=Float64, maxnorm=10);
julia> n_hashes(hashfn) == 100 &&
typeof(hashfn) <: SignALSH{Float64} &&
hashfn.maxnorm == 10
trueSee also: lsh_family
LSHFunctions.lsh_family — Functionlsh_family(similarity)Return the default constructor or LSHFunction subtype used to construct a hash function for the similarity function similarity.
The main use of lsh_family is to make it easier to find the documentation for the hash function that's constructed when you call LSHFunction. For instance, if you want to know more about the arguments and keyword parameters that can be given to LSHFunction(inner_prod), you can run
julia> lsh_family(inner_prod)
SignALSH
help?> SignALSHExamples
julia> lsh_family(cossim)
SimHash
julia> lsh_family(ℓ1)
L1HashSee also: LSHFunction
LSHFunctions.hashtype — Functionhashtype(hashfn::LSHFunction)Returns the type of hash generated by a hash function.
Examples
julia> hashfn = LSHFunction(cossim);
julia> hashtype(hashfn)
Bool
julia> hashfn = LSHFunction(ℓ1);
julia> hashtype(hashfn)
Int32LSHFunctions.n_hashes — Functionn_hashes(hashfn::LSHFunction)Return the number of hashes computed by hashfn.
Examples
julia> hashfn = SimHash();
julia> n_hashes(hashfn)
1
julia> hashfn = SimHash(12);
julia> n_hashes(hashfn)
12
julia> hashes = hashfn(rand(25));
julia> length(hashes)
12LSHFunctions.similarity — Functionsimilarity(hashfn::LSHFunction)Returns the similarity function that hashfn hashes on.
Arguments
hashfn::AbstractLSHFunction: the hash function whose similarity we would like to retrieve.
Examples
julia> hashfn = LSHFunction(cossim);
julia> similarity(hashfn) == cossim
true
julia> hashfn = LSHFunction(ℓ1);
julia> similarity(hashfn) == ℓ1
trueLSHFunctions.collision_probability — Functioncollision_probability(
hashfn::H,
sim;
n_hashes::Union{Symbol,Integer}=:auto
) where {H <: LSHFunction}Compute the probability of hash collision between two inputs with similarity sim for an LSHFunction of type H. This function returns the probability that n_hashes hashes simultaneously collide.
Arguments
hashfn::LSHFunction: theLSHFunctionfor which we want to compute the probability of collision.sim: a similarity (or vector of similarities), computed using the similarity function returned bysimilarity(hashfn).
Keyword arguments
n_hashes::Union{Symbol,Integer}(default::auto): the number of hash functions to use to compute the probability of collision. If the probability that a single hash collides isp, then the probability thatn_hasheshashes simultaneously collide isp^n_hashes. As a result,collision_probability(hashfn, sim; n_hashes=N)is the same as
collision_probability(hashfn, sim; n_hashes=1).^NIf
n_hashes = :autothen this function will select the number of hashes to ben_hashes(hashfn)(using then_hashesfunction from theLSHFunctionAPI).
Examples
The probability that a single MinHash hash function causes a hash collision between inputs A and B is equal to jaccard(A,B):
julia> hashfn = MinHash();
julia> A = Set(["a", "b", "c"]);
julia> B = Set(["b", "c", "d"]);
julia> jaccard(A,B)
0.5
julia> collision_probability(hashfn, jaccard(A,B); n_hashes=1)
0.5If our MinHash struct keeps track of N hash functions simultaneously, then the probability of collision is jaccard(A,B)^N:
julia> hashfn = MinHash(10);
julia> A = Set(["a", "b", "c"]);
julia> B = Set(["b", "c", "d"]);
julia> collision_probability(hashfn, jaccard(A,B)) ==
collision_probability(hashfn, jaccard(A,B); n_hashes=10) ==
collision_probability(hashfn, jaccard(A,B); n_hashes=1)^10
trueSee also: n_hashes, similarity
collision_probability(hashfn::LSHFunction, x, y;
n_hashes::Union{Symbol,Integer} = :auto)Computes the probability of a hash collision between two inputs x and y for a given hash function hashfn. This is the same as calling
collision_probability(hashfn, similarity(hashfn)(x,y); n_hashes=n_hashes)Examples
The following snippet computes the probability of collision between two sets A and B for a single MinHash. For MinHash, this probability is just equal to the Jaccard similarity between A and B.
julia> hashfn = MinHash();
julia> A = Set(["a", "b", "c"]);
julia> B = Set(["a", "b", "c"]);
julia> similarity(hashfn) == jaccard
true
julia> collision_probability(hashfn, A, B) ==
collision_probability(hashfn, jaccard(A,B)) ==
jaccard(A,B)
trueWe can use the n_hashes argument to specify the probability that n_hashes MinHash hash functions simultaneously collide. If left unspecified, then we'll simply use n_hashes(hashfn) as the number of hash functions:
julia> hashfn = MinHash(10);
julia> A = Set(["a", "b", "c"]);
julia> B = Set(["a", "b", "c"]);
julia> collision_probability(hashfn, A, B) ==
collision_probability(hashfn, A, B; n_hashes=10) ==
collision_probability(hashfn, A, B; n_hashes=1)^10
trueLSHFunctions.index_hash — Functionindex_hash(hashfn::AsymmetricLSHFunction, x)Computes the indexing hash (the hash used to insert items into the hash table) for an AsymmetricLSHFunction with input x.
See also: query_hash, AsymmetricLSHFunction
LSHFunctions.query_hash — Functionquery_hash(hashfn::AsymmetricLSHFunction, x)Computes the querying hash (the hash used to query for items in the hash table) for an AsymmetricLSHFunction with input x.
See also: index_hash, AsymmetricLSHFunction
LSHFunctions.SymmetricLSHFunction — Typeabstract type SymmetricLSHFunction <: LSHFunction endA symmetric locality-sensitive hashing function. A SymmetricLSHFunction uses the same hash function to insert items into a hash table as well as query the table for collisions. If hashfn is a SymmetricLSHFunction, you can compute the hash for an input x as hashfn(x).
See also: AsymmetricLSHFunction
LSHFunctions.AsymmetricLSHFunction — Typeabstract type AsymmetricLSHFunction <: LSHFunction endAn asymmetric locality-sensitive hashing function. An AsymmetricLSHFunction uses one hash function to insert items into a hash table, and a different hash function to query the table for collisions. If hashfn is an AsymmetricLSHFunction, you can compute the indexing hash for an input x with index_hash(hashfn,x), and the querying hash with query_hash(hashfn,x).
See also: SymmetricLSHFunction
Hash functions
LSHFunctions.SimHash — TypeSimHash(n_hashes::Integer = 1;
dtype::Type = Float32,
resize_pow2::Bool = false)Creates a locality-sensitive hash function for cosine similarity.
Arguments
n_hashes::Integer(default:1): the number of hash functions to generate.
Keyword parameters
dtype::Type(default:Float32): the data type to use in the LSHFunctions.SimHash internals. For performance reasons you should pickdtypeto match the type of the data you're hashing.resize_pow2::Bool(default:false): affects the way in which the returnedLSHFunctions.SimHashresizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to setresize_pow2 = true.
Examples
Construct a hash function by calling SimHash with the number of hash functions you want to generate:
julia> hashfn = SimHash(24);
julia> n_hashes(hashfn) == 24 &&
similarity(hashfn) == cossim
trueYou can then call hashfn(x) in order to compute hashes:
julia> hashfn = SimHash(32);
julia> x = randn(30);
julia> hashes = hashfn(x);
References
- Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC '02, page 380–388, New York, NY, USA, 2002. Association for Computing Machinery. 10.1145/509907.509965.
See also: cossim
LSHFunctions.MinHash — TypeMinHash(n_hashes::Integer = 1;
dtype::Type = Any,
symbols::Union{Vector,Set} = Set())Construct a locality-sensitive hash function for Jaccard similarity.
Arguments
n_hashes::Integer(default:1): the number of hash functions to generate.
Keyword parameters
dtype::Type(default:Any): the type of symbols in the sets you're hashing. This is overriden by the data type contained insymbolswhensymbolsis non-empty.symbols::Union{Vector,Set}: aVectororSetcontaining all of the possible elements ("symbols") of the sets that you will be hashing. If left empty,MinHashwill instead expand its dictionary when it sees new symbols (at small additional computational expense).
Examples
Construct a hash function to hash sets whose elements are integers between 1 and 50:
julia> hashfn = MinHash(40; symbols = Set(1:50));
julia> n_hashes(hashfn) == 40 && similarity(hashfn) == jaccard
true
julia> hashfn(Set([38, 14, 29, 48, 11]));
julia> hashfn([1, 1, 2, 3, 4]); # You can also hash Vectors
julia> hashfn(Set([100]))
ERROR: Symbol 100 not foundIf you aren't sure ahead of time exactly what kinds of elements will be in the sets you're hashing, you can opt not to specify symbols, in which case MinHash will lazily update its hash functions as it encounters new symbols:
julia> hashfn = MinHash();
julia> hashfn(Set([1, 2, 3]));
julia> hashfn(Set(["a", "b", "c"]));
If you don't know what elements you'll encounter, but you know that they'll all be of a specific data type, you can specify the dtype argument for increased efficiency:
julia> hashfn = MinHash(10; dtype = String);
julia> hashfn(Set(["a", "b", "c"]));
References
- Broder, A. On the resemblance and containment of documents. Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997. doi:10.1109/SEQUEN.1997.666900.
See also: jaccard
LSHFunctions.L1Hash — TypeL1Hash(
n_hashes::Integer = 1;
dtype::Type = Float32,
r::Real = 1.0,
resize_pow2::Bool = false
)Constructs a locality-sensitive hash for $\ell^1$ distance ($\|x - y\|_1$), defined as
$\|x - y\|_1 = \sum_i |x_i - y_i|$
Arguments
n_hashes::Integer(default:1): the number of hash functions to generate.
Keyword parameters
dtype::Type(default:Float32): the data type to use in the LSHFunctions.L1Hash internals. For performance reasons you should pickdtypeto match the type of the data you're hashing.r::Real(default:1.0): a positive coefficient whose magnitude influences the collision rate. Larger values ofrwill increase the collision rate, even for distant points. See references for more information.resize_pow2::Bool(default:false): affects the way in which the returnedLSHFunctions.L1Hashresizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to setresize_pow2 = true.
Examples
Construct an L1Hash with the number of hash functions you want to generate:
julia> hashfn = L1Hash(128);
julia> hashfn.power == 1 &&
n_hashes(hashfn) == 128 &&
similarity(hashfn) == ℓ1
trueAfter creating a hash function, you can compute hashes with hashfn(x):
julia> hashfn = L1Hash(20);
julia> x = rand(4);
julia> hashes = hashfn(x);
References
- Datar, Mayur & Indyk, Piotr & Immorlica, Nicole & Mirrokni, Vahab. (2004). Locality-sensitive hashing scheme based on p-stable distributions. Proceedings of the Annual Symposium on Computational Geometry. 10.1145/997817.997857.
See also: ℓ1
LSHFunctions.L2Hash — TypeL2Hash(
n_hashes::Integer = 1;
dtype::Type = Float32,
r::Real = 1.0,
resize_pow2::Bool = false
)Constructs a locality-sensitive hash for $\ell^2$ distance ($\|x - y\|_2$), defined as
$\|x - y\|_2 = \left(\sum_i |x_i - y_i|^2\right)^{1/2}$
Arguments
n_hashes::Integer(default:1): the number of hash functions to generate.
Keyword parameters
dtype::Type(default:Float32): the data type to use in the LSHFunctions.L2Hash internals. For performance reasons you should pickdtypeto match the type of the data you're hashing.r::Real(default:1.0): a positive coefficient whose magnitude influences the collision rate. Larger values ofrwill increase the collision rate, even for distant points. See references for more information.resize_pow2::Bool(default:false): affects the way in which the returnedLSHFunctions.L2Hashresizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to setresize_pow2 = true.
Examples
Construct an L2Hash with the number of hash functions you want to generate:
julia> hashfn = L2Hash(128);
julia> hashfn.power == 2 &&
n_hashes(hashfn) == 128 &&
similarity(hashfn) == ℓ2
trueAfter creating a hash function, you can compute hashes with hashfn(x):
julia> hashfn = L2Hash(20);
julia> x = rand(4);
julia> hashes = hashfn(x);
References
- Datar, Mayur & Indyk, Piotr & Immorlica, Nicole & Mirrokni, Vahab. (2004). Locality-sensitive hashing scheme based on p-stable distributions. Proceedings of the Annual Symposium on Computational Geometry. 10.1145/997817.997857.
See also: ℓ2
LSHFunctions.SignALSH — TypeSignALSH(n_hashes::Integer = 1,
dtype::Type = Float32,
maxnorm::Union{Nothing,Real} = nothing,
m::Integer = 3,
resize_pow2::Bool = false)Create a SignALSH hash function for hashing on inner product similarity.
Arguments
n_hashes::Integer(default:1): the number of hash functions to generate.
Keyword parameters
dtype::Type(default:Float32): the data type to use in the LSHFunctions.SignALSH internals. For performance reasons you should pickdtypeto match the type of the data you're hashing.maxnorm::Union{Nothing,Real}(default:nothing): an upper bound on the $\ell^2$-norm of the data points.
The maxnorm keyword parameter must be explicitly specified. If it is left unspecified (or set to nothing), SignALSH() will raise an error.
m::Integer(default:3): parametermthat affects the probability of a hash collision.resize_pow2::Bool(default:false): affects the way in which the returnedLSHFunctions.SignALSHresizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to setresize_pow2 = true.
Examples
SignALSH is an AsymmetricLSHFunction, and hence hashes must be computed using index_hash and query_hash.
julia> hashfn = SignALSH(12; maxnorm=10);
julia> x = rand(4);
julia> ih = index_hash(hashfn, x); qh = query_hash(hashfn, x);
julia> length(ih) == length(qh) == 12
true
julia> typeof(ih) == typeof(qh) == BitArray{1}
trueYou need to explicitly specify the maxnorm keyword parameter when constructing SignALSH, otherwise you will get an error.
julia> hashfn = SignALSH(12)
ERROR: maxnorm must be specified for SignALSHYou'll also get an error if you try to hash a vector that has norm greater than the maxnorm that you specified.
julia> hashfn = SignALSH(; maxnorm=1);
julia> index_hash(hashfn, ones(4))
ERROR: norm 2.0 exceeds maxnorm (1.0)References
- Anshumali Shrivastava and Ping Li. Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS). In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI'15, page 812–821, Arlington, Virginia, USA, 2015. AUAI Press. 10.5555/3020847.3020931. arXiv:1405.5869
See also: inner_prod, ℓ2_norm
LSHFunctions.MIPSHash — TypeMIPSHash(n_hashes::Integer = 1;
dtype::Datatype = Float32,
maxnorm::Union{Nothing,Real} = nothing,
scale::Real = 1,
m::Integer = 3,
resize_pow2::Bool = false)Create a MIPSHash hash function for hashing on inner product similarity.
Arguments
n_hashes::Integer(default:1): the number of hash functions to generate.
Keyword parameters
dtype::Type(default:Float32): the data type to use in the LSHFunctions.MIPSHash internals. For performance reasons you should pickdtypeto match the type of the data you're hashing.maxnorm::Union{Nothing,Real}(default:nothing): an upper bound on the $\ell^2$-norm of the data points.
The maxnorm keyword parameter must be explicitly specified. If it is left unspecified (or set to nothing), MIPSHash() will raise an error.
scale::Real(default:1): parameter that affects the probability of a hash collision. Large values ofscaleincreases hash collision probability (even for inputs with low inner product similarity); small values ofscalewill decrease hash collision probability.
Examples
MIPSHash is an AsymmetricLSHFunction, and hence hashes must be computed using index_hash and query_hash.
julia> hashfn = MIPSHash(5; maxnorm=10);
julia> x = rand(4);
julia> ih = index_hash(hashfn, x); qh = query_hash(hashfn, x);
julia> length(ih) == length(qh) == 5
true
julia> typeof(ih) == typeof(qh) == Vector{Int32}
trueYou need to explicitly specify the maxnorm keyword parameter when constructing MIPSHash, otherwise you will get an error.
julia> hashfn = MIPSHash(5)
ERROR: maxnorm must be specified for MIPSHashYou'll also get an error if you try to hash a vector that has norm greater than the maxnorm that you specified.
julia> hashfn = MIPSHash(; maxnorm=1);
julia> index_hash(hashfn, ones(4))
ERROR: norm 2.0 exceeds maxnorm (1.0)References
- Anshumali Shrivastava and Ping Li. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS'14, page 2321–2329, Cambridge, MA, USA, 2014. MIT Press. 10.5555/2969033.2969086. arXiv:1410.5410
See also: inner_prod, ℓ2_norm
Similarity functions
LSHFunctions.L1 — MethodLp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p), ℓ1(x,y), and ℓ2(x,y), respectively.
See also: ℓp
Lp(f, g, interval::LSHFunctions.RealInterval, p)
L1(f, g, interval::LSHFunctions.RealInterval)
L2(f, g, interval::LSHFunctions.RealInterval)Computes the $L^p$ distance between two functions, given by
$L^p(f,g) \coloneqq \|f - g\|_p = \left(\int_a^b \left|f(x) - g(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
Examples
Below we compute the $L^1$, $L^2$, and $L^3$ distances between $f(x) = x^2 + 1$ and $g(x) = 2x$ over the interval $[0,1]$. The distances are computed by evaluating the integral
$\left(\int_0^1 \left|f(x) - g(x)\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 \left|x^2 - 2x + 1\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 (x - 1)^{2p} \hspace{0.15cm}dx\right)^{1/p}$
for $p = 1$, $p = 2$, and $p = 3$.
julia> f(x) = x^2 + 1; g(x) = 2x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp(f, g, interval, 1) ≈ L1(f, g, interval) ≈ 3^(-1)
true
julia> Lp(f, g, interval, 2) ≈ L2(f, g, interval) ≈ 5^(-1/2)
true
julia> Lp(f, g, interval, 3) ≈ 7^(-1/3)
trueLSHFunctions.L1 — MethodLp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p), ℓ1(x,y), and ℓ2(x,y), respectively.
See also: ℓp
LSHFunctions.L1_norm — MethodLp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p), ℓ1_norm(x), and ℓ2_norm(x), respectively.
See also: ℓp_norm
LSHFunctions.L1_norm — MethodLp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p), ℓ1_norm(x), and ℓ2_norm(x), respectively.
See also: ℓp_norm
Lp_norm(f, interval::LSHFunctions.RealInterval, p::Real=2)
L1_norm(f, interval::LSHFunctions.RealInterval)
L2_norm(f, interval::LSHFunctions.RealInterval)Computes the $L^p$ function-space norm of a function $f$, which is given by the equation
$\|f\|_p = \left(\int_a^b \left|f(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
L1_norm(f, interval) is the same as Lp_norm(f, interval, 1), and L2_norm(f, interval) is the same as Lp_norm(f, interval, 2).
Examples
julia> f(x) = x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp_norm(f, interval, 1) ≈ L1_norm(f, interval) ≈ 2^(-1/1)
true
julia> Lp_norm(f, interval, 2) ≈ L2_norm(f, interval) ≈ 3^(-1/2)
true
julia> Lp_norm(f, interval, 3) ≈ 4^(-1/3)
trueLSHFunctions.L2 — MethodLp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p), ℓ1(x,y), and ℓ2(x,y), respectively.
See also: ℓp
Lp(f, g, interval::LSHFunctions.RealInterval, p)
L1(f, g, interval::LSHFunctions.RealInterval)
L2(f, g, interval::LSHFunctions.RealInterval)Computes the $L^p$ distance between two functions, given by
$L^p(f,g) \coloneqq \|f - g\|_p = \left(\int_a^b \left|f(x) - g(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
Examples
Below we compute the $L^1$, $L^2$, and $L^3$ distances between $f(x) = x^2 + 1$ and $g(x) = 2x$ over the interval $[0,1]$. The distances are computed by evaluating the integral
$\left(\int_0^1 \left|f(x) - g(x)\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 \left|x^2 - 2x + 1\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 (x - 1)^{2p} \hspace{0.15cm}dx\right)^{1/p}$
for $p = 1$, $p = 2$, and $p = 3$.
julia> f(x) = x^2 + 1; g(x) = 2x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp(f, g, interval, 1) ≈ L1(f, g, interval) ≈ 3^(-1)
true
julia> Lp(f, g, interval, 2) ≈ L2(f, g, interval) ≈ 5^(-1/2)
true
julia> Lp(f, g, interval, 3) ≈ 7^(-1/3)
trueLSHFunctions.L2 — MethodLp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p), ℓ1(x,y), and ℓ2(x,y), respectively.
See also: ℓp
LSHFunctions.L2_norm — MethodLp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p), ℓ1_norm(x), and ℓ2_norm(x), respectively.
See also: ℓp_norm
LSHFunctions.L2_norm — MethodLp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p), ℓ1_norm(x), and ℓ2_norm(x), respectively.
See also: ℓp_norm
Lp_norm(f, interval::LSHFunctions.RealInterval, p::Real=2)
L1_norm(f, interval::LSHFunctions.RealInterval)
L2_norm(f, interval::LSHFunctions.RealInterval)Computes the $L^p$ function-space norm of a function $f$, which is given by the equation
$\|f\|_p = \left(\int_a^b \left|f(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
L1_norm(f, interval) is the same as Lp_norm(f, interval, 1), and L2_norm(f, interval) is the same as Lp_norm(f, interval, 2).
Examples
julia> f(x) = x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp_norm(f, interval, 1) ≈ L1_norm(f, interval) ≈ 2^(-1/1)
true
julia> Lp_norm(f, interval, 2) ≈ L2_norm(f, interval) ≈ 3^(-1/2)
true
julia> Lp_norm(f, interval, 3) ≈ 4^(-1/3)
trueLSHFunctions.Lp — FunctionLp(f, g, interval::LSHFunctions.RealInterval, p)
L1(f, g, interval::LSHFunctions.RealInterval)
L2(f, g, interval::LSHFunctions.RealInterval)Computes the $L^p$ distance between two functions, given by
$L^p(f,g) \coloneqq \|f - g\|_p = \left(\int_a^b \left|f(x) - g(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
Examples
Below we compute the $L^1$, $L^2$, and $L^3$ distances between $f(x) = x^2 + 1$ and $g(x) = 2x$ over the interval $[0,1]$. The distances are computed by evaluating the integral
$\left(\int_0^1 \left|f(x) - g(x)\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 \left|x^2 - 2x + 1\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 (x - 1)^{2p} \hspace{0.15cm}dx\right)^{1/p}$
for $p = 1$, $p = 2$, and $p = 3$.
julia> f(x) = x^2 + 1; g(x) = 2x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp(f, g, interval, 1) ≈ L1(f, g, interval) ≈ 3^(-1)
true
julia> Lp(f, g, interval, 2) ≈ L2(f, g, interval) ≈ 5^(-1/2)
true
julia> Lp(f, g, interval, 3) ≈ 7^(-1/3)
trueLSHFunctions.Lp — MethodLp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p), ℓ1(x,y), and ℓ2(x,y), respectively.
See also: ℓp
LSHFunctions.Lp_norm — FunctionLp_norm(f, interval::LSHFunctions.RealInterval, p::Real=2)
L1_norm(f, interval::LSHFunctions.RealInterval)
L2_norm(f, interval::LSHFunctions.RealInterval)Computes the $L^p$ function-space norm of a function $f$, which is given by the equation
$\|f\|_p = \left(\int_a^b \left|f(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
L1_norm(f, interval) is the same as Lp_norm(f, interval, 1), and L2_norm(f, interval) is the same as Lp_norm(f, interval, 2).
Examples
julia> f(x) = x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp_norm(f, interval, 1) ≈ L1_norm(f, interval) ≈ 2^(-1/1)
true
julia> Lp_norm(f, interval, 2) ≈ L2_norm(f, interval) ≈ 3^(-1/2)
true
julia> Lp_norm(f, interval, 3) ≈ 4^(-1/3)
trueLSHFunctions.Lp_norm — FunctionLp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p), ℓ1_norm(x), and ℓ2_norm(x), respectively.
See also: ℓp_norm
LSHFunctions.cossim — Methodcossim(x,y)Computes the cosine similarity between two inputs $x$ and $y$. Cosine similarity is defined as
$\text{cossim}(x,y) = \frac{\left\langle x,y\right\rangle}{\|x\|\cdot\|y\|}$
where $\left\langle\cdot,\cdot\right\rangle$ is an inner product (e.g. dot product) and $\|\cdot\|$ is its derived norm. This is roughly interpreted as being related to the angle between the inputs $x$ and $y$: when $x$ and $y$ have low angle between them, cossim(x,y) is high (close to $1$). When $x$ and $y$ have large angle between them, cossim(x,y) is low (close to $-1$).
Arguments
xandy: two inputs for whichdot(x,y),norm(x), andnorm(y)are defined.
Examples
julia> using LinearAlgebra: dot, norm;
julia> x, y = rand(4), rand(4);
julia> cossim(x,y) == dot(x,y) / (norm(x) * norm(y))
true
julia> z = rand(5);
julia> cossim(x,z)
ERROR: DimensionMismatch("dot product arguments have lengths 4 and 5")See also: SimHash
LSHFunctions.inner_prod — Methodinner_prod(x::AbstractVector, y::AbstractVector)Computes the $\ell^2$ inner product (dot product)
$\left\langle x, y\right\rangle = \sum_i x_iy_i$
Examples
julia> using LinearAlgebra: dot;
julia> x, y = randn(4), randn(4);
julia> inner_prod(x,y) ≈ dot(x,y)
trueLSHFunctions.inner_prod — Methodinner_prod(f, g, interval::LSHFunctions.RealInterval)Computes the $L^2$ inner product
$\left\langle f, g\right\rangle = \int_a^b f(x)g(x) \hspace{0.15cm} dx$
where the interval we're integrating over is specified by the interval argument.
Examples
julia> f(x) = cos(x); g(x) = sin(x);
julia> inner_prod(f, g, @interval(0 ≤ x ≤ π/2)) ≈ 1/2
trueLSHFunctions.jaccard — Methodfunction jaccard(x::BitArray{1}, y::BitArray{1})Computes the Jaccard similarity between a pair of binary vectors:
$J(x, y) = \frac{\sum_{i} \min{(x_i,y_i)}}{\sum_{i} \max{(x_i,y_i)}}$
Arguments
x::BitArray{1},y::BitArray{1}: two binary vectors, in the form ofBitArrays.
Examples
julia> x = BitArray([true, false, true, true, false]);
julia> y = BitArray([false, false, true, true, true]);
julia> jaccard(x,y)
0.5LSHFunctions.jaccard — Methodjaccard(A::Set, B::Set) :: Float64Computes the Jaccard similarity between sets $A$ and $B$, which is defined as
$\text{Jaccard}(A,B) = \frac{\left|A \cap B\right|}{\left|A \cup B\right|}$
Arguments
A::Set,B::Set: two sets whose Jaccard similarity we would like to compute.
Examples
julia> A, B = Set([1, 2, 3]), Set([2, 3, 4]);
julia> jaccard(A,B)
0.5
julia> jaccard(A,B) == length(A ∩ B) / length(A ∪ B)
trueSee also: MinHash
LSHFunctions.jaccard — Methodfunction jaccard(x::AbstractVector{<:Real}, y::AbstractVector{<:Real})Computes the Jaccard similarity between a pair of vectors of real numbers:
$J(x, y) = \frac{\sum_{i} \min{(x_i,y_i)}}{\sum_{i} \max{(x_i,y_i)}}$
Arguments
x::AbstractVector{<:Real},y::AbstractVector{<:Real}: a pair of vectors containing real numbers (subtypes ofReal).
Examples
julia> x = [0.8, 0.1, 0.3, 0.4, 0.1];
julia> y = [1.0, 0.6, 0.0, 0.4, 0.5];
julia> jaccard(x,y)
0.5LSHFunctions.jaccard — Methodfunction jaccard(A::Set{<:K},
B::Set{<:K},
weights::Dict{K,V}) where {K,V<:Number}Computes the weighted Jaccard similarity between two sets:
$J(x, y) = \frac{\sum_{x\in A\cap B} w_x}{\sum_{y\in A\cup B} w_y}$
Arguments
A::Set,B::Set: two sets whose Jaccard similarity we would like to compute.weights::Dict: a dictionary mapping symbols in the setsAandBto numerical weights. These weights must be positive.
Examples
julia> A = Set(["a", "b", "c"]);
julia> B = Set(["b", "c", "d"]);
julia> W = Dict("a" => 0.2, "b" => 2.4, "c" => 0.6, "d" => 1.8);
julia> jaccard(A,B,W)
0.6LSHFunctions.ℓ1 — Methodℓp(x::AbstractVector, y::AbstractVector, p::Real=2)
ℓ1(x::AbstractVector, y::AbstractVector)
ℓ2(x::AbstractVector, y::AbstractVector)Computes the $\ell^p$ distance between a pair of vectors, given by
$\ell^p(x,y) \coloneqq \|x - y\|_p = \left(\sum_i \left|x_i - y_i\right|^p\right)^{1/p}$
ℓ1(x,y) is the same as ℓp(x,y,1), and ℓ2(x,y) is the same as ℓp(x,y,2).
Examples
julia> x = [1, 2, 3];
julia> y = [4, 5, 6];
julia> ℓp(x,y,2) == (abs(1-4)^2 + abs(2-5)^2 + abs(3-6)^2)^(1/2)
true
julia> ℓp(x,y,3) == (abs(1-4)^3 + abs(2-5)^3 + abs(3-6)^3)^(1/3)
trueLSHFunctions.ℓ1_norm — Methodℓp_norm(x::AbstractVector, p::Real = 2)
ℓ1_norm(x::AbstractVector)
ℓ2_norm(x::AbstractVector)Compute the $\ell^p$ norm of a point $x$, defined as
$\|x\|_p = \left(\sum_i \left|x_i\right|^p\right)^{1/p}$
Examples
julia> x = randn(4);
julia> ℓp_norm(x, 1) ≈ ℓ1_norm(x) ≈ (map(u -> abs(u)^1, x) |> sum)^(1/1)
true
julia> ℓp_norm(x, 2) ≈ ℓ2_norm(x) ≈ (map(u -> abs(u)^2, x) |> sum)^(1/2)
true
julia> ℓp_norm(x, 3) ≈ (map(u -> abs(u)^3, x) |> sum)^(1/3)
trueLSHFunctions.ℓ2 — Methodℓp(x::AbstractVector, y::AbstractVector, p::Real=2)
ℓ1(x::AbstractVector, y::AbstractVector)
ℓ2(x::AbstractVector, y::AbstractVector)Computes the $\ell^p$ distance between a pair of vectors, given by
$\ell^p(x,y) \coloneqq \|x - y\|_p = \left(\sum_i \left|x_i - y_i\right|^p\right)^{1/p}$
ℓ1(x,y) is the same as ℓp(x,y,1), and ℓ2(x,y) is the same as ℓp(x,y,2).
Examples
julia> x = [1, 2, 3];
julia> y = [4, 5, 6];
julia> ℓp(x,y,2) == (abs(1-4)^2 + abs(2-5)^2 + abs(3-6)^2)^(1/2)
true
julia> ℓp(x,y,3) == (abs(1-4)^3 + abs(2-5)^3 + abs(3-6)^3)^(1/3)
trueLSHFunctions.ℓ2_norm — Methodℓp_norm(x::AbstractVector, p::Real = 2)
ℓ1_norm(x::AbstractVector)
ℓ2_norm(x::AbstractVector)Compute the $\ell^p$ norm of a point $x$, defined as
$\|x\|_p = \left(\sum_i \left|x_i\right|^p\right)^{1/p}$
Examples
julia> x = randn(4);
julia> ℓp_norm(x, 1) ≈ ℓ1_norm(x) ≈ (map(u -> abs(u)^1, x) |> sum)^(1/1)
true
julia> ℓp_norm(x, 2) ≈ ℓ2_norm(x) ≈ (map(u -> abs(u)^2, x) |> sum)^(1/2)
true
julia> ℓp_norm(x, 3) ≈ (map(u -> abs(u)^3, x) |> sum)^(1/3)
trueLSHFunctions.ℓp — Functionℓp(x::AbstractVector, y::AbstractVector, p::Real=2)
ℓ1(x::AbstractVector, y::AbstractVector)
ℓ2(x::AbstractVector, y::AbstractVector)Computes the $\ell^p$ distance between a pair of vectors, given by
$\ell^p(x,y) \coloneqq \|x - y\|_p = \left(\sum_i \left|x_i - y_i\right|^p\right)^{1/p}$
ℓ1(x,y) is the same as ℓp(x,y,1), and ℓ2(x,y) is the same as ℓp(x,y,2).
Examples
julia> x = [1, 2, 3];
julia> y = [4, 5, 6];
julia> ℓp(x,y,2) == (abs(1-4)^2 + abs(2-5)^2 + abs(3-6)^2)^(1/2)
true
julia> ℓp(x,y,3) == (abs(1-4)^3 + abs(2-5)^3 + abs(3-6)^3)^(1/3)
trueLSHFunctions.ℓp_norm — Functionℓp_norm(x::AbstractVector, p::Real = 2)
ℓ1_norm(x::AbstractVector)
ℓ2_norm(x::AbstractVector)Compute the $\ell^p$ norm of a point $x$, defined as
$\|x\|_p = \left(\sum_i \left|x_i\right|^p\right)^{1/p}$
Examples
julia> x = randn(4);
julia> ℓp_norm(x, 1) ≈ ℓ1_norm(x) ≈ (map(u -> abs(u)^1, x) |> sum)^(1/1)
true
julia> ℓp_norm(x, 2) ≈ ℓ2_norm(x) ≈ (map(u -> abs(u)^2, x) |> sum)^(1/2)
true
julia> ℓp_norm(x, 3) ≈ (map(u -> abs(u)^3, x) |> sum)^(1/3)
trueHashing in function spaces
LSHFunctions.MonteCarloHash — TypeMonteCarloHash(sim, ω, args...; volume=1.0, n_samples=1024, kws...)Samples a hash function from an LSH family for the similarity sim defined over the function space $L^p_{\mu}(\Omega)$. sim may be one of the following:
L1L2cossim
Given an input function $f\in L^p_{\mu}(\Omega)$, MonteCarloHash works by sampling $f$ at some randomly-selected points in $\Omega$, and then hashing those samples.
Arguments
sim: the similarity statistic you want to hash on.ω: a function that takes no inputs and samples a single point from $\Omega$. Alternatively, it can be viewed as a random variable with probability measure
args...: arguments to pass on when building theLSHFunctioninstance underlying the returnedMonteCarloHashstruct.volume::Real(default:1.0): the volume of the space $\Omega$, defined as
n_samples::Integer(default:1024): the number of points to sample from each function that is hashed by theMonteCarloHash. Larger values ofn_samplestend to capture the input function better and will thus be more likely to achieve desirable collision probabilities.kws...: keyword arguments to pass on when building theLSHFunctioninstance underlying the returnedMonteCarloHashstruct.
Examples
Create a hash function for cosine similarity for functions in $L^2([-1,1])$:
julia> μ() = 2*rand()-1; # μ samples a random point from [-1,1]
julia> hashfn = MonteCarloHash(cossim, μ, 50; volume=2.0);
julia> n_hashes(hashfn)
50
julia> similarity(hashfn) == cossim
true
julia> hashtype(hashfn)
BoolCreate a hash function for $L^2$ distance in the function space $L^2([0,2\pi])$. Hash the functions f(x) = cos(x) and f(x) = x/(2π) using the returned MonteCarloHash.
julia> μ() = 2π * rand(); # μ samples a random point from [0,2π]
julia> hashfn = MonteCarloHash(L2, μ, 3; volume=2π);
julia> hashfn(cos)
3-element Array{Int32,1}:
-1
3
0
julia> hashfn(x -> x/(2π))
3-element Array{Int32,1}:
-1
-2
-1Create a hash function with a different number of sample points.
julia> μ() = rand(); # Samples a random point from [0,1]
julia> hashfn = MonteCarloHash(cossim, μ; volume=1.0, n_samples=512);
julia> length(hashfn.sample_points)
512See also: ChebHash
LSHFunctions.ChebHash — TypeChebHash(sim, args...; interval=LSHFunctions.RealInterval{Int64}(-1, 1, true, true), kws...)Samples a hash function from an LSH family for the similarity sim defined over the function space $L^p_{\mu}(\Omega)$. sim may be one of the following:
L2cossim
ChebHash works by approximating a function by Chebyshev polynomials. You can choose the degree of the approximation to trade between speed and generating desirable hash collision probabilities.
ChebHash can only hash function spaces of the form $L^2([a,b])$, where $[a,b]$ is an interval on the real line. For a more versatile option, checkout out MonteCarloHash.
Arguments
sim: the similarity function you want to hash on.args...: arguments to pass on when building theLSHFunctioninstance underlying the returnedChebHashstruct.kws...: keyword arguments to pass on when building theLSHFunctioninstance underlying the returnedChebHashstruct.
Examples
Create a hash function for cosine similarity for functions in $L^2([-1,1])$:
julia> hashfn = ChebHash(cossim, 50; interval=@interval(-1 ≤ x ≤ 1));
julia> n_hashes(hashfn)
50
julia> similarity(hashfn) == cossim
true
julia> hashtype(hashfn)
BoolCreate a hash function for $L^2$ distance defined over $L^2([0,2\pi])$. Hash the functions f(x) = cos(x) and f(x) = x/(2π) using the returned ChebHash:
julia> hashfn = ChebHash(L2, 3; interval=@interval(0 ≤ x ≤ 2π));
julia> hashfn(cos)
3-element Array{Int32,1}:
3
-1
-2
julia> hashfn(x -> x/(2π))
3-element Array{Int32,1}:
0
1
0See also: MonteCarloHash
Miscellaneous
LSHFunctions.HashCompressor — Typestruct HashCompressorA compressor for converting variable-width hashes generated by LSHFunctions into fixed-width hashes. HashCompressor works by taking an array of hashes generated by an LSHFunction, and using SHA-256 to convert it into a fixed-width hash.
LSHFunctions.HashCompressor — Methodfunction HashCompressor(
n_bytes :: Integer = 32,
salt :: Union{Vector{UInt8}} = Vector{UInt8}(undef,0)
)Construct a new HashCompressor. The created HashCompressor will compress hashes into n_bytes bytes, and use the provided salt during hash compression.
Keyword arguments
n_bytes::Integer(default:32): the number of bytes to compress hashes into.salt::Vector{UInt8}(default:Vector{UInt8}(undef,0): a salt to prepend to hashes before compression using SHA-256.
Examples
julia> compressor = HashCompressor(n_bytes=4);
julia> compressor([1, 4, 2, 9, 5, 5])
4-element Array{UInt8,1}:
0xf3
0x91
0x55
0x2eLSHFunctions.@interval — Macro@interval(expr)Construct a new LSHFunctions.RealInterval representing an interval on the real line from an expression such as
0 ≤ x < 1The returned expression constructs an LSHFunctions.RealInterval encoding the lower and upper bounds on the interval, as well as whether the ends are opened or closed.
Examples
You can construct an interval using the following syntax:
julia> interval = @interval(0 ≤ x < 1);There are usually multiple ways of constructing the same interval. For instance, each of the expressions below are equivalent ways of constructing the interval [-1,1].
julia> @interval(-1 ≤ x ≤ 1) ==
@interval(-1 <= x <= 1) ==
@interval(-1 ≤ y ≤ 1) ==
@interval( 1 ≥ x ≥ -1)
trueYou can even create intervals with Inf or -Inf at the endpoints, e.g. @interval(-Inf < x < Inf).
There are two primary operations you can run on an interval: testing for membership and intersection. You can test whether or not x is in an interval using x ∈ interval, as shown below.
julia> interval = @interval(0 ≤ x < 1);
julia> 0 ∈ interval && 1 ∉ interval
true
julia> 0 in interval # This syntax also works
trueYou can also intersect two intervals using the ∩ operator (or by using intersect(interval_1, interval_2)).
julia> @interval(0 ≤ x < 1) ∩ @interval(1/2 < x ≤ 1) == @interval(1/2 < x < 1)
trueSee also: RealInterval
Private interface
LSHFunctions.RealInterval — Typestruct RealInterval{T<:Real}Encodes an interval of the real line, such as [-1,1] or [0,Inf).
Fields
lower::T: lower bound on the interval.upper::T: upper bound on the interval.closed_below::Bool: whether or not the interval is closed below.closed_above::Bool: whether or not the interval is closed above.
Examples
The following snippet constructs RealInterval represeting the interval [0,1)
julia> interval = LSHFunctions.RealInterval(0, 1, true, false);It's generally easier to construct an interval using the @interval macro. Check out the documentation for @interval for more information.
See also: @interval
Base.isempty — Methodisempty(interval::RealInterval)Returns true if interval is empty (i.e. there doesn't exist an x for which x ∈ interval is true), and false otherwise.
LSHFunctions.width — Methodwidth(interval::RealInterval)Return the width of a RealInterval (i.e. the difference between its upper and lower bounds.
LSHFunctions.@register_similarity! — Macroregister_similarity!(similarity, hashfn)Register hashfn to the LSH module as the default locality-sensitive hash function to use for similarity function similarity. This makes it possible to construct a new hash function for similarity with LSHFunction(similarity, args...; kws...).
Arguments
similarity: the similarity function to register.hashfn: the default locality-sensitive hash function thatsimilarityshould be associated with.
Examples
Create a custom implementation of cosine similarity called my_cossim, and associate it with SimHash:
julia> using LinearAlgebra: dot, norm
julia> my_cossim(x,y) = dot(x,y) / (norm(x) * norm(y));
julia> hashfn = LSHFunction(my_cossim);
ERROR: MethodError: no method matching LSHFunction(::typeof(my_cossim))
julia> LSHFunctions.@register_similarity!(my_cossim, SimHash);
julia> hashfn = LSHFunction(my_cossim);
julia> isa(hashfn, SimHash)
true