API reference
LSHFunction API
LSHFunctions.LSHFunction
— TypeLSHFunction(similarity, args...; kws...)
Construct the default LSHFunction
subtype that corresponds to the similarity function similarity
.
Arguments
similarity
: the similarity function you want to use. Can be any of the following:cossim
inner_prod
jaccard
ℓ1
ℓ2
args...
: arguments to pass on to the defaultLSHFunction
constructor corresponding tosimilarity
.kws...
: keyword parameters to pass on to the defaultLSHFunction
constructor corresponding tosimilarity
.
Returns
Returns a subtype of LSH.LSHFunction
that hashes the similarity function similarity
.
Examples
In the snippet below, we construct LSHFunctions.SimHash
(the default hash function corresponding to cosine similarity) using LSHFunction()
:
julia> hashfn = LSHFunction(cossim);
julia> typeof(hashfn) <: LSHFunctions.SimHash <: LSHFunction
true
We can provide arguments and keyword parameters corresponding to the hash function that we construct:
julia> hashfn = LSHFunction(inner_prod, 100; dtype=Float64, maxnorm=10);
julia> n_hashes(hashfn) == 100 &&
typeof(hashfn) <: SignALSH{Float64} &&
hashfn.maxnorm == 10
true
See also: lsh_family
LSHFunctions.lsh_family
— Functionlsh_family(similarity)
Return the default constructor or LSHFunction
subtype used to construct a hash function for the similarity function similarity
.
The main use of lsh_family
is to make it easier to find the documentation for the hash function that's constructed when you call LSHFunction
. For instance, if you want to know more about the arguments and keyword parameters that can be given to LSHFunction(inner_prod)
, you can run
julia> lsh_family(inner_prod)
SignALSH
help?> SignALSH
Examples
julia> lsh_family(cossim)
SimHash
julia> lsh_family(ℓ1)
L1Hash
See also: LSHFunction
LSHFunctions.hashtype
— Functionhashtype(hashfn::LSHFunction)
Returns the type of hash generated by a hash function.
Examples
julia> hashfn = LSHFunction(cossim);
julia> hashtype(hashfn)
Bool
julia> hashfn = LSHFunction(ℓ1);
julia> hashtype(hashfn)
Int32
LSHFunctions.n_hashes
— Functionn_hashes(hashfn::LSHFunction)
Return the number of hashes computed by hashfn
.
Examples
julia> hashfn = SimHash();
julia> n_hashes(hashfn)
1
julia> hashfn = SimHash(12);
julia> n_hashes(hashfn)
12
julia> hashes = hashfn(rand(25));
julia> length(hashes)
12
LSHFunctions.similarity
— Functionsimilarity(hashfn::LSHFunction)
Returns the similarity function that hashfn
hashes on.
Arguments
hashfn::AbstractLSHFunction
: the hash function whose similarity we would like to retrieve.
Examples
julia> hashfn = LSHFunction(cossim);
julia> similarity(hashfn) == cossim
true
julia> hashfn = LSHFunction(ℓ1);
julia> similarity(hashfn) == ℓ1
true
LSHFunctions.collision_probability
— Functioncollision_probability(
hashfn::H,
sim;
n_hashes::Union{Symbol,Integer}=:auto
) where {H <: LSHFunction}
Compute the probability of hash collision between two inputs with similarity sim
for an LSHFunction
of type H
. This function returns the probability that n_hashes
hashes simultaneously collide.
Arguments
hashfn::LSHFunction
: theLSHFunction
for which we want to compute the probability of collision.sim
: a similarity (or vector of similarities), computed using the similarity function returned bysimilarity(hashfn)
.
Keyword arguments
n_hashes::Union{Symbol,Integer}
(default::auto
): the number of hash functions to use to compute the probability of collision. If the probability that a single hash collides isp
, then the probability thatn_hashes
hashes simultaneously collide isp^n_hashes
. As a result,collision_probability(hashfn, sim; n_hashes=N)
is the same as
collision_probability(hashfn, sim; n_hashes=1).^N
If
n_hashes = :auto
then this function will select the number of hashes to ben_hashes(hashfn)
(using then_hashes
function from theLSHFunction
API).
Examples
The probability that a single MinHash hash function causes a hash collision between inputs A
and B
is equal to jaccard(A,B)
:
julia> hashfn = MinHash();
julia> A = Set(["a", "b", "c"]);
julia> B = Set(["b", "c", "d"]);
julia> jaccard(A,B)
0.5
julia> collision_probability(hashfn, jaccard(A,B); n_hashes=1)
0.5
If our MinHash
struct keeps track of N
hash functions simultaneously, then the probability of collision is jaccard(A,B)^N
:
julia> hashfn = MinHash(10);
julia> A = Set(["a", "b", "c"]);
julia> B = Set(["b", "c", "d"]);
julia> collision_probability(hashfn, jaccard(A,B)) ==
collision_probability(hashfn, jaccard(A,B); n_hashes=10) ==
collision_probability(hashfn, jaccard(A,B); n_hashes=1)^10
true
See also: n_hashes
, similarity
collision_probability(hashfn::LSHFunction, x, y;
n_hashes::Union{Symbol,Integer} = :auto)
Computes the probability of a hash collision between two inputs x
and y
for a given hash function hashfn
. This is the same as calling
collision_probability(hashfn, similarity(hashfn)(x,y); n_hashes=n_hashes)
Examples
The following snippet computes the probability of collision between two sets A
and B
for a single MinHash. For MinHash, this probability is just equal to the Jaccard similarity between A
and B
.
julia> hashfn = MinHash();
julia> A = Set(["a", "b", "c"]);
julia> B = Set(["a", "b", "c"]);
julia> similarity(hashfn) == jaccard
true
julia> collision_probability(hashfn, A, B) ==
collision_probability(hashfn, jaccard(A,B)) ==
jaccard(A,B)
true
We can use the n_hashes
argument to specify the probability that n_hashes
MinHash hash functions simultaneously collide. If left unspecified, then we'll simply use n_hashes(hashfn)
as the number of hash functions:
julia> hashfn = MinHash(10);
julia> A = Set(["a", "b", "c"]);
julia> B = Set(["a", "b", "c"]);
julia> collision_probability(hashfn, A, B) ==
collision_probability(hashfn, A, B; n_hashes=10) ==
collision_probability(hashfn, A, B; n_hashes=1)^10
true
LSHFunctions.index_hash
— Functionindex_hash(hashfn::AsymmetricLSHFunction, x)
Computes the indexing hash (the hash used to insert items into the hash table) for an AsymmetricLSHFunction
with input x
.
See also: query_hash
, AsymmetricLSHFunction
LSHFunctions.query_hash
— Functionquery_hash(hashfn::AsymmetricLSHFunction, x)
Computes the querying hash (the hash used to query for items in the hash table) for an AsymmetricLSHFunction
with input x
.
See also: index_hash
, AsymmetricLSHFunction
LSHFunctions.SymmetricLSHFunction
— Typeabstract type SymmetricLSHFunction <: LSHFunction end
A symmetric locality-sensitive hashing function. A SymmetricLSHFunction
uses the same hash function to insert items into a hash table as well as query the table for collisions. If hashfn
is a SymmetricLSHFunction
, you can compute the hash for an input x
as hashfn(x)
.
See also: AsymmetricLSHFunction
LSHFunctions.AsymmetricLSHFunction
— Typeabstract type AsymmetricLSHFunction <: LSHFunction end
An asymmetric locality-sensitive hashing function. An AsymmetricLSHFunction
uses one hash function to insert items into a hash table, and a different hash function to query the table for collisions. If hashfn
is an AsymmetricLSHFunction
, you can compute the indexing hash for an input x
with index_hash(hashfn,x)
, and the querying hash with query_hash(hashfn,x)
.
See also: SymmetricLSHFunction
Hash functions
LSHFunctions.SimHash
— TypeSimHash(n_hashes::Integer = 1;
dtype::Type = Float32,
resize_pow2::Bool = false)
Creates a locality-sensitive hash function for cosine similarity.
Arguments
n_hashes::Integer
(default:1
): the number of hash functions to generate.
Keyword parameters
dtype::Type
(default:Float32
): the data type to use in the LSHFunctions.SimHash internals. For performance reasons you should pickdtype
to match the type of the data you're hashing.resize_pow2::Bool
(default:false
): affects the way in which the returnedLSHFunctions.SimHash
resizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to setresize_pow2 = true
.
Examples
Construct a hash function by calling SimHash
with the number of hash functions you want to generate:
julia> hashfn = SimHash(24);
julia> n_hashes(hashfn) == 24 &&
similarity(hashfn) == cossim
true
You can then call hashfn(x)
in order to compute hashes:
julia> hashfn = SimHash(32);
julia> x = randn(30);
julia> hashes = hashfn(x);
References
- Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC '02, page 380–388, New York, NY, USA, 2002. Association for Computing Machinery. 10.1145/509907.509965.
See also: cossim
LSHFunctions.MinHash
— TypeMinHash(n_hashes::Integer = 1;
dtype::Type = Any,
symbols::Union{Vector,Set} = Set())
Construct a locality-sensitive hash function for Jaccard similarity.
Arguments
n_hashes::Integer
(default:1
): the number of hash functions to generate.
Keyword parameters
dtype::Type
(default:Any
): the type of symbols in the sets you're hashing. This is overriden by the data type contained insymbols
whensymbols
is non-empty.symbols::Union{Vector,Set}
: aVector
orSet
containing all of the possible elements ("symbols") of the sets that you will be hashing. If left empty,MinHash
will instead expand its dictionary when it sees new symbols (at small additional computational expense).
Examples
Construct a hash function to hash sets whose elements are integers between 1
and 50
:
julia> hashfn = MinHash(40; symbols = Set(1:50));
julia> n_hashes(hashfn) == 40 && similarity(hashfn) == jaccard
true
julia> hashfn(Set([38, 14, 29, 48, 11]));
julia> hashfn([1, 1, 2, 3, 4]); # You can also hash Vectors
julia> hashfn(Set([100]))
ERROR: Symbol 100 not found
If you aren't sure ahead of time exactly what kinds of elements will be in the sets you're hashing, you can opt not to specify symbols
, in which case MinHash
will lazily update its hash functions as it encounters new symbols:
julia> hashfn = MinHash();
julia> hashfn(Set([1, 2, 3]));
julia> hashfn(Set(["a", "b", "c"]));
If you don't know what elements you'll encounter, but you know that they'll all be of a specific data type, you can specify the dtype
argument for increased efficiency:
julia> hashfn = MinHash(10; dtype = String);
julia> hashfn(Set(["a", "b", "c"]));
References
- Broder, A. On the resemblance and containment of documents. Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997. doi:10.1109/SEQUEN.1997.666900.
See also: jaccard
LSHFunctions.L1Hash
— TypeL1Hash(
n_hashes::Integer = 1;
dtype::Type = Float32,
r::Real = 1.0,
resize_pow2::Bool = false
)
Constructs a locality-sensitive hash for $\ell^1$ distance ($\|x - y\|_1$), defined as
$\|x - y\|_1 = \sum_i |x_i - y_i|$
Arguments
n_hashes::Integer
(default:1
): the number of hash functions to generate.
Keyword parameters
dtype::Type
(default:Float32
): the data type to use in the LSHFunctions.L1Hash internals. For performance reasons you should pickdtype
to match the type of the data you're hashing.r::Real
(default:1.0
): a positive coefficient whose magnitude influences the collision rate. Larger values ofr
will increase the collision rate, even for distant points. See references for more information.resize_pow2::Bool
(default:false
): affects the way in which the returnedLSHFunctions.L1Hash
resizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to setresize_pow2 = true
.
Examples
Construct an L1Hash
with the number of hash functions you want to generate:
julia> hashfn = L1Hash(128);
julia> hashfn.power == 1 &&
n_hashes(hashfn) == 128 &&
similarity(hashfn) == ℓ1
true
After creating a hash function, you can compute hashes with hashfn(x)
:
julia> hashfn = L1Hash(20);
julia> x = rand(4);
julia> hashes = hashfn(x);
References
- Datar, Mayur & Indyk, Piotr & Immorlica, Nicole & Mirrokni, Vahab. (2004). Locality-sensitive hashing scheme based on p-stable distributions. Proceedings of the Annual Symposium on Computational Geometry. 10.1145/997817.997857.
See also: ℓ1
LSHFunctions.L2Hash
— TypeL2Hash(
n_hashes::Integer = 1;
dtype::Type = Float32,
r::Real = 1.0,
resize_pow2::Bool = false
)
Constructs a locality-sensitive hash for $\ell^2$ distance ($\|x - y\|_2$), defined as
$\|x - y\|_2 = \left(\sum_i |x_i - y_i|^2\right)^{1/2}$
Arguments
n_hashes::Integer
(default:1
): the number of hash functions to generate.
Keyword parameters
dtype::Type
(default:Float32
): the data type to use in the LSHFunctions.L2Hash internals. For performance reasons you should pickdtype
to match the type of the data you're hashing.r::Real
(default:1.0
): a positive coefficient whose magnitude influences the collision rate. Larger values ofr
will increase the collision rate, even for distant points. See references for more information.resize_pow2::Bool
(default:false
): affects the way in which the returnedLSHFunctions.L2Hash
resizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to setresize_pow2 = true
.
Examples
Construct an L2Hash
with the number of hash functions you want to generate:
julia> hashfn = L2Hash(128);
julia> hashfn.power == 2 &&
n_hashes(hashfn) == 128 &&
similarity(hashfn) == ℓ2
true
After creating a hash function, you can compute hashes with hashfn(x)
:
julia> hashfn = L2Hash(20);
julia> x = rand(4);
julia> hashes = hashfn(x);
References
- Datar, Mayur & Indyk, Piotr & Immorlica, Nicole & Mirrokni, Vahab. (2004). Locality-sensitive hashing scheme based on p-stable distributions. Proceedings of the Annual Symposium on Computational Geometry. 10.1145/997817.997857.
See also: ℓ2
LSHFunctions.SignALSH
— TypeSignALSH(n_hashes::Integer = 1,
dtype::Type = Float32,
maxnorm::Union{Nothing,Real} = nothing,
m::Integer = 3,
resize_pow2::Bool = false)
Create a SignALSH
hash function for hashing on inner product similarity.
Arguments
n_hashes::Integer
(default:1
): the number of hash functions to generate.
Keyword parameters
dtype::Type
(default:Float32
): the data type to use in the LSHFunctions.SignALSH internals. For performance reasons you should pickdtype
to match the type of the data you're hashing.maxnorm::Union{Nothing,Real}
(default:nothing
): an upper bound on the $\ell^2$-norm of the data points.
The maxnorm
keyword parameter must be explicitly specified. If it is left unspecified (or set to nothing
), SignALSH()
will raise an error.
m::Integer
(default:3
): parameterm
that affects the probability of a hash collision.resize_pow2::Bool
(default:false
): affects the way in which the returnedLSHFunctions.SignALSH
resizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to setresize_pow2 = true
.
Examples
SignALSH
is an AsymmetricLSHFunction
, and hence hashes must be computed using index_hash
and query_hash
.
julia> hashfn = SignALSH(12; maxnorm=10);
julia> x = rand(4);
julia> ih = index_hash(hashfn, x); qh = query_hash(hashfn, x);
julia> length(ih) == length(qh) == 12
true
julia> typeof(ih) == typeof(qh) == BitArray{1}
true
You need to explicitly specify the maxnorm
keyword parameter when constructing SignALSH
, otherwise you will get an error.
julia> hashfn = SignALSH(12)
ERROR: maxnorm must be specified for SignALSH
You'll also get an error if you try to hash a vector that has norm greater than the maxnorm
that you specified.
julia> hashfn = SignALSH(; maxnorm=1);
julia> index_hash(hashfn, ones(4))
ERROR: norm 2.0 exceeds maxnorm (1.0)
References
- Anshumali Shrivastava and Ping Li. Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS). In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI'15, page 812–821, Arlington, Virginia, USA, 2015. AUAI Press. 10.5555/3020847.3020931. arXiv:1405.5869
See also: inner_prod
, ℓ2_norm
Similarity functions
LSHFunctions.L1
— MethodLp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)
Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p)
, ℓ1(x,y)
, and ℓ2(x,y)
, respectively.
See also: ℓp
Lp(f, g, interval::LSHFunctions.RealInterval, p)
L1(f, g, interval::LSHFunctions.RealInterval)
L2(f, g, interval::LSHFunctions.RealInterval)
Computes the $L^p$ distance between two functions, given by
$L^p(f,g) \coloneqq \|f - g\|_p = \left(\int_a^b \left|f(x) - g(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
Examples
Below we compute the $L^1$, $L^2$, and $L^3$ distances between $f(x) = x^2 + 1$ and $g(x) = 2x$ over the interval $[0,1]$. The distances are computed by evaluating the integral
$\left(\int_0^1 \left|f(x) - g(x)\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 \left|x^2 - 2x + 1\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 (x - 1)^{2p} \hspace{0.15cm}dx\right)^{1/p}$
for $p = 1$, $p = 2$, and $p = 3$.
julia> f(x) = x^2 + 1; g(x) = 2x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp(f, g, interval, 1) ≈ L1(f, g, interval) ≈ 3^(-1)
true
julia> Lp(f, g, interval, 2) ≈ L2(f, g, interval) ≈ 5^(-1/2)
true
julia> Lp(f, g, interval, 3) ≈ 7^(-1/3)
true
LSHFunctions.L1
— MethodLp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)
Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p)
, ℓ1(x,y)
, and ℓ2(x,y)
, respectively.
See also: ℓp
LSHFunctions.L1_norm
— MethodLp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)
Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p)
, ℓ1_norm(x)
, and ℓ2_norm(x)
, respectively.
See also: ℓp_norm
LSHFunctions.L1_norm
— MethodLp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)
Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p)
, ℓ1_norm(x)
, and ℓ2_norm(x)
, respectively.
See also: ℓp_norm
Lp_norm(f, interval::LSHFunctions.RealInterval, p::Real=2)
L1_norm(f, interval::LSHFunctions.RealInterval)
L2_norm(f, interval::LSHFunctions.RealInterval)
Computes the $L^p$ function-space norm of a function $f$, which is given by the equation
$\|f\|_p = \left(\int_a^b \left|f(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
L1_norm(f, interval)
is the same as Lp_norm(f, interval, 1)
, and L2_norm(f, interval)
is the same as Lp_norm(f, interval, 2)
.
Examples
julia> f(x) = x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp_norm(f, interval, 1) ≈ L1_norm(f, interval) ≈ 2^(-1/1)
true
julia> Lp_norm(f, interval, 2) ≈ L2_norm(f, interval) ≈ 3^(-1/2)
true
julia> Lp_norm(f, interval, 3) ≈ 4^(-1/3)
true
LSHFunctions.L2
— MethodLp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)
Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p)
, ℓ1(x,y)
, and ℓ2(x,y)
, respectively.
See also: ℓp
Lp(f, g, interval::LSHFunctions.RealInterval, p)
L1(f, g, interval::LSHFunctions.RealInterval)
L2(f, g, interval::LSHFunctions.RealInterval)
Computes the $L^p$ distance between two functions, given by
$L^p(f,g) \coloneqq \|f - g\|_p = \left(\int_a^b \left|f(x) - g(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
Examples
Below we compute the $L^1$, $L^2$, and $L^3$ distances between $f(x) = x^2 + 1$ and $g(x) = 2x$ over the interval $[0,1]$. The distances are computed by evaluating the integral
$\left(\int_0^1 \left|f(x) - g(x)\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 \left|x^2 - 2x + 1\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 (x - 1)^{2p} \hspace{0.15cm}dx\right)^{1/p}$
for $p = 1$, $p = 2$, and $p = 3$.
julia> f(x) = x^2 + 1; g(x) = 2x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp(f, g, interval, 1) ≈ L1(f, g, interval) ≈ 3^(-1)
true
julia> Lp(f, g, interval, 2) ≈ L2(f, g, interval) ≈ 5^(-1/2)
true
julia> Lp(f, g, interval, 3) ≈ 7^(-1/3)
true
LSHFunctions.L2
— MethodLp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)
Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p)
, ℓ1(x,y)
, and ℓ2(x,y)
, respectively.
See also: ℓp
LSHFunctions.L2_norm
— MethodLp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)
Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p)
, ℓ1_norm(x)
, and ℓ2_norm(x)
, respectively.
See also: ℓp_norm
LSHFunctions.L2_norm
— MethodLp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)
Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p)
, ℓ1_norm(x)
, and ℓ2_norm(x)
, respectively.
See also: ℓp_norm
Lp_norm(f, interval::LSHFunctions.RealInterval, p::Real=2)
L1_norm(f, interval::LSHFunctions.RealInterval)
L2_norm(f, interval::LSHFunctions.RealInterval)
Computes the $L^p$ function-space norm of a function $f$, which is given by the equation
$\|f\|_p = \left(\int_a^b \left|f(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
L1_norm(f, interval)
is the same as Lp_norm(f, interval, 1)
, and L2_norm(f, interval)
is the same as Lp_norm(f, interval, 2)
.
Examples
julia> f(x) = x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp_norm(f, interval, 1) ≈ L1_norm(f, interval) ≈ 2^(-1/1)
true
julia> Lp_norm(f, interval, 2) ≈ L2_norm(f, interval) ≈ 3^(-1/2)
true
julia> Lp_norm(f, interval, 3) ≈ 4^(-1/3)
true
LSHFunctions.Lp
— FunctionLp(f, g, interval::LSHFunctions.RealInterval, p)
L1(f, g, interval::LSHFunctions.RealInterval)
L2(f, g, interval::LSHFunctions.RealInterval)
Computes the $L^p$ distance between two functions, given by
$L^p(f,g) \coloneqq \|f - g\|_p = \left(\int_a^b \left|f(x) - g(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
Examples
Below we compute the $L^1$, $L^2$, and $L^3$ distances between $f(x) = x^2 + 1$ and $g(x) = 2x$ over the interval $[0,1]$. The distances are computed by evaluating the integral
$\left(\int_0^1 \left|f(x) - g(x)\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 \left|x^2 - 2x + 1\right|^p \hspace{0.15cm}dx\right)^{1/p} = \left(\int_0^1 (x - 1)^{2p} \hspace{0.15cm}dx\right)^{1/p}$
for $p = 1$, $p = 2$, and $p = 3$.
julia> f(x) = x^2 + 1; g(x) = 2x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp(f, g, interval, 1) ≈ L1(f, g, interval) ≈ 3^(-1)
true
julia> Lp(f, g, interval, 2) ≈ L2(f, g, interval) ≈ 5^(-1/2)
true
julia> Lp(f, g, interval, 3) ≈ 7^(-1/3)
true
LSHFunctions.Lp
— MethodLp(x::AbstractVector, y::AbstractVector, p::Real=2)
L1(x::AbstractVector, y::AbstractVector)
L2(x::AbstractVector, y::AbstractVector)
Computes the $ℓ^p$ distance between a pair of vectors $x$ and $y$. Identical to ℓp(x,y,p)
, ℓ1(x,y)
, and ℓ2(x,y)
, respectively.
See also: ℓp
LSHFunctions.Lp_norm
— FunctionLp_norm(f, interval::LSHFunctions.RealInterval, p::Real=2)
L1_norm(f, interval::LSHFunctions.RealInterval)
L2_norm(f, interval::LSHFunctions.RealInterval)
Computes the $L^p$ function-space norm of a function $f$, which is given by the equation
$\|f\|_p = \left(\int_a^b \left|f(x)\right|^p \hspace{0.15cm} dx\right)^{1/p}$
L1_norm(f, interval)
is the same as Lp_norm(f, interval, 1)
, and L2_norm(f, interval)
is the same as Lp_norm(f, interval, 2)
.
Examples
julia> f(x) = x;
julia> interval = @interval(0 ≤ x ≤ 1);
julia> Lp_norm(f, interval, 1) ≈ L1_norm(f, interval) ≈ 2^(-1/1)
true
julia> Lp_norm(f, interval, 2) ≈ L2_norm(f, interval) ≈ 3^(-1/2)
true
julia> Lp_norm(f, interval, 3) ≈ 4^(-1/3)
true
LSHFunctions.Lp_norm
— FunctionLp_norm(x::AbstractVector, p::Real = 2)
L1_norm(x::AbstractVector)
L2_norm(x::AbstractVector)
Compute the $\ell^p$ norm of a vector $x$. Identical to ℓp_norm(x,p)
, ℓ1_norm(x)
, and ℓ2_norm(x)
, respectively.
See also: ℓp_norm
LSHFunctions.cossim
— Methodcossim(x,y)
Computes the cosine similarity between two inputs $x$ and $y$. Cosine similarity is defined as
$\text{cossim}(x,y) = \frac{\left\langle x,y\right\rangle}{\|x\|\cdot\|y\|}$
where $\left\langle\cdot,\cdot\right\rangle$ is an inner product (e.g. dot product) and $\|\cdot\|$ is its derived norm. This is roughly interpreted as being related to the angle between the inputs $x$ and $y$: when $x$ and $y$ have low angle between them, cossim(x,y)
is high (close to $1$). When $x$ and $y$ have large angle between them, cossim(x,y)
is low (close to $-1$).
Arguments
x
andy
: two inputs for whichdot(x,y)
,norm(x)
, andnorm(y)
are defined.
Examples
julia> using LinearAlgebra: dot, norm;
julia> x, y = rand(4), rand(4);
julia> cossim(x,y) == dot(x,y) / (norm(x) * norm(y))
true
julia> z = rand(5);
julia> cossim(x,z)
ERROR: DimensionMismatch("dot product arguments have lengths 4 and 5")
See also: SimHash
LSHFunctions.inner_prod
— Methodinner_prod(x::AbstractVector, y::AbstractVector)
Computes the $\ell^2$ inner product (dot product)
$\left\langle x, y\right\rangle = \sum_i x_iy_i$
Examples
julia> using LinearAlgebra: dot;
julia> x, y = randn(4), randn(4);
julia> inner_prod(x,y) ≈ dot(x,y)
true
LSHFunctions.inner_prod
— Methodinner_prod(f, g, interval::LSHFunctions.RealInterval)
Computes the $L^2$ inner product
$\left\langle f, g\right\rangle = \int_a^b f(x)g(x) \hspace{0.15cm} dx$
where the interval we're integrating over is specified by the interval
argument.
Examples
julia> f(x) = cos(x); g(x) = sin(x);
julia> inner_prod(f, g, @interval(0 ≤ x ≤ π/2)) ≈ 1/2
true
LSHFunctions.jaccard
— Methodfunction jaccard(x::BitArray{1}, y::BitArray{1})
Computes the Jaccard similarity between a pair of binary vectors:
$J(x, y) = \frac{\sum_{i} \min{(x_i,y_i)}}{\sum_{i} \max{(x_i,y_i)}}$
Arguments
x::BitArray{1}
,y::BitArray{1}
: two binary vectors, in the form ofBitArray
s.
Examples
julia> x = BitArray([true, false, true, true, false]);
julia> y = BitArray([false, false, true, true, true]);
julia> jaccard(x,y)
0.5
LSHFunctions.jaccard
— Methodjaccard(A::Set, B::Set) :: Float64
Computes the Jaccard similarity between sets $A$ and $B$, which is defined as
$\text{Jaccard}(A,B) = \frac{\left|A \cap B\right|}{\left|A \cup B\right|}$
Arguments
A::Set
,B::Set
: two sets whose Jaccard similarity we would like to compute.
Examples
julia> A, B = Set([1, 2, 3]), Set([2, 3, 4]);
julia> jaccard(A,B)
0.5
julia> jaccard(A,B) == length(A ∩ B) / length(A ∪ B)
true
See also: MinHash
LSHFunctions.jaccard
— Methodfunction jaccard(x::AbstractVector{<:Real}, y::AbstractVector{<:Real})
Computes the Jaccard similarity between a pair of vectors of real numbers:
$J(x, y) = \frac{\sum_{i} \min{(x_i,y_i)}}{\sum_{i} \max{(x_i,y_i)}}$
Arguments
x::AbstractVector{<:Real}
,y::AbstractVector{<:Real}
: a pair of vectors containing real numbers (subtypes ofReal
).
Examples
julia> x = [0.8, 0.1, 0.3, 0.4, 0.1];
julia> y = [1.0, 0.6, 0.0, 0.4, 0.5];
julia> jaccard(x,y)
0.5
LSHFunctions.jaccard
— Methodfunction jaccard(A::Set{<:K},
B::Set{<:K},
weights::Dict{K,V}) where {K,V<:Number}
Computes the weighted Jaccard similarity between two sets:
$J(x, y) = \frac{\sum_{x\in A\cap B} w_x}{\sum_{y\in A\cup B} w_y}$
Arguments
A::Set
,B::Set
: two sets whose Jaccard similarity we would like to compute.weights::Dict
: a dictionary mapping symbols in the setsA
andB
to numerical weights. These weights must be positive.
Examples
julia> A = Set(["a", "b", "c"]);
julia> B = Set(["b", "c", "d"]);
julia> W = Dict("a" => 0.2, "b" => 2.4, "c" => 0.6, "d" => 1.8);
julia> jaccard(A,B,W)
0.6
LSHFunctions.ℓ1
— Methodℓp(x::AbstractVector, y::AbstractVector, p::Real=2)
ℓ1(x::AbstractVector, y::AbstractVector)
ℓ2(x::AbstractVector, y::AbstractVector)
Computes the $\ell^p$ distance between a pair of vectors, given by
$\ell^p(x,y) \coloneqq \|x - y\|_p = \left(\sum_i \left|x_i - y_i\right|^p\right)^{1/p}$
ℓ1(x,y)
is the same as ℓp(x,y,1)
, and ℓ2(x,y)
is the same as ℓp(x,y,2)
.
Examples
julia> x = [1, 2, 3];
julia> y = [4, 5, 6];
julia> ℓp(x,y,2) == (abs(1-4)^2 + abs(2-5)^2 + abs(3-6)^2)^(1/2)
true
julia> ℓp(x,y,3) == (abs(1-4)^3 + abs(2-5)^3 + abs(3-6)^3)^(1/3)
true
LSHFunctions.ℓ1_norm
— Methodℓp_norm(x::AbstractVector, p::Real = 2)
ℓ1_norm(x::AbstractVector)
ℓ2_norm(x::AbstractVector)
Compute the $\ell^p$ norm of a point $x$, defined as
$\|x\|_p = \left(\sum_i \left|x_i\right|^p\right)^{1/p}$
Examples
julia> x = randn(4);
julia> ℓp_norm(x, 1) ≈ ℓ1_norm(x) ≈ (map(u -> abs(u)^1, x) |> sum)^(1/1)
true
julia> ℓp_norm(x, 2) ≈ ℓ2_norm(x) ≈ (map(u -> abs(u)^2, x) |> sum)^(1/2)
true
julia> ℓp_norm(x, 3) ≈ (map(u -> abs(u)^3, x) |> sum)^(1/3)
true
LSHFunctions.ℓ2
— Methodℓp(x::AbstractVector, y::AbstractVector, p::Real=2)
ℓ1(x::AbstractVector, y::AbstractVector)
ℓ2(x::AbstractVector, y::AbstractVector)
Computes the $\ell^p$ distance between a pair of vectors, given by
$\ell^p(x,y) \coloneqq \|x - y\|_p = \left(\sum_i \left|x_i - y_i\right|^p\right)^{1/p}$
ℓ1(x,y)
is the same as ℓp(x,y,1)
, and ℓ2(x,y)
is the same as ℓp(x,y,2)
.
Examples
julia> x = [1, 2, 3];
julia> y = [4, 5, 6];
julia> ℓp(x,y,2) == (abs(1-4)^2 + abs(2-5)^2 + abs(3-6)^2)^(1/2)
true
julia> ℓp(x,y,3) == (abs(1-4)^3 + abs(2-5)^3 + abs(3-6)^3)^(1/3)
true
LSHFunctions.ℓ2_norm
— Methodℓp_norm(x::AbstractVector, p::Real = 2)
ℓ1_norm(x::AbstractVector)
ℓ2_norm(x::AbstractVector)
Compute the $\ell^p$ norm of a point $x$, defined as
$\|x\|_p = \left(\sum_i \left|x_i\right|^p\right)^{1/p}$
Examples
julia> x = randn(4);
julia> ℓp_norm(x, 1) ≈ ℓ1_norm(x) ≈ (map(u -> abs(u)^1, x) |> sum)^(1/1)
true
julia> ℓp_norm(x, 2) ≈ ℓ2_norm(x) ≈ (map(u -> abs(u)^2, x) |> sum)^(1/2)
true
julia> ℓp_norm(x, 3) ≈ (map(u -> abs(u)^3, x) |> sum)^(1/3)
true
LSHFunctions.ℓp
— Functionℓp(x::AbstractVector, y::AbstractVector, p::Real=2)
ℓ1(x::AbstractVector, y::AbstractVector)
ℓ2(x::AbstractVector, y::AbstractVector)
Computes the $\ell^p$ distance between a pair of vectors, given by
$\ell^p(x,y) \coloneqq \|x - y\|_p = \left(\sum_i \left|x_i - y_i\right|^p\right)^{1/p}$
ℓ1(x,y)
is the same as ℓp(x,y,1)
, and ℓ2(x,y)
is the same as ℓp(x,y,2)
.
Examples
julia> x = [1, 2, 3];
julia> y = [4, 5, 6];
julia> ℓp(x,y,2) == (abs(1-4)^2 + abs(2-5)^2 + abs(3-6)^2)^(1/2)
true
julia> ℓp(x,y,3) == (abs(1-4)^3 + abs(2-5)^3 + abs(3-6)^3)^(1/3)
true
LSHFunctions.ℓp_norm
— Functionℓp_norm(x::AbstractVector, p::Real = 2)
ℓ1_norm(x::AbstractVector)
ℓ2_norm(x::AbstractVector)
Compute the $\ell^p$ norm of a point $x$, defined as
$\|x\|_p = \left(\sum_i \left|x_i\right|^p\right)^{1/p}$
Examples
julia> x = randn(4);
julia> ℓp_norm(x, 1) ≈ ℓ1_norm(x) ≈ (map(u -> abs(u)^1, x) |> sum)^(1/1)
true
julia> ℓp_norm(x, 2) ≈ ℓ2_norm(x) ≈ (map(u -> abs(u)^2, x) |> sum)^(1/2)
true
julia> ℓp_norm(x, 3) ≈ (map(u -> abs(u)^3, x) |> sum)^(1/3)
true
Hashing in function spaces
LSHFunctions.MonteCarloHash
— TypeMonteCarloHash(sim, ω, args...; volume=1.0, n_samples=1024, kws...)
Samples a hash function from an LSH family for the similarity sim
defined over the function space $L^p_{\mu}(\Omega)$. sim
may be one of the following:
L1
L2
cossim
Given an input function $f\in L^p_{\mu}(\Omega)$, MonteCarloHash
works by sampling $f$ at some randomly-selected points in $\Omega$, and then hashing those samples.
Arguments
sim
: the similarity statistic you want to hash on.ω
: a function that takes no inputs and samples a single point from $\Omega$. Alternatively, it can be viewed as a random variable with probability measure
\[\frac{\mu}{\text{vol}_{\mu}(\Omega)} = \frac{\mu}{\int_{\Omega} d\mu}\]
args...
: arguments to pass on when building theLSHFunction
instance underlying the returnedMonteCarloHash
struct.volume::Real
(default:1.0
): the volume of the space $\Omega$, defined as
\[\text{vol}_{\mu}(\Omega) = \int_{\Omega} d\mu\]
n_samples::Integer
(default:1024
): the number of points to sample from each function that is hashed by theMonteCarloHash
. Larger values ofn_samples
tend to capture the input function better and will thus be more likely to achieve desirable collision probabilities.kws...
: keyword arguments to pass on when building theLSHFunction
instance underlying the returnedMonteCarloHash
struct.
Examples
Create a hash function for cosine similarity for functions in $L^2([-1,1])$:
julia> μ() = 2*rand()-1; # μ samples a random point from [-1,1]
julia> hashfn = MonteCarloHash(cossim, μ, 50; volume=2.0);
julia> n_hashes(hashfn)
50
julia> similarity(hashfn) == cossim
true
julia> hashtype(hashfn)
Bool
Create a hash function for $L^2$ distance in the function space $L^2([0,2\pi])$. Hash the functions f(x) = cos(x)
and f(x) = x/(2π)
using the returned MonteCarloHash
.
julia> μ() = 2π * rand(); # μ samples a random point from [0,2π]
julia> hashfn = MonteCarloHash(L2, μ, 3; volume=2π);
julia> hashfn(cos)
3-element Array{Int32,1}:
-1
3
0
julia> hashfn(x -> x/(2π))
3-element Array{Int32,1}:
-1
-2
-1
Create a hash function with a different number of sample points.
julia> μ() = rand(); # Samples a random point from [0,1]
julia> hashfn = MonteCarloHash(cossim, μ; volume=1.0, n_samples=512);
julia> length(hashfn.sample_points)
512
Miscellaneous
LSHFunctions.@interval
— Macro@interval(expr)
Construct a new LSHFunctions.RealInterval
representing an interval on the real line from an expression such as
0 ≤ x < 1
The returned expression constructs an LSHFunctions.RealInterval
encoding the lower and upper bounds on the interval, as well as whether the ends are opened or closed.
Examples
You can construct an interval using the following syntax:
julia> interval = @interval(0 ≤ x < 1);
There are usually multiple ways of constructing the same interval. For instance, each of the expressions below are equivalent ways of constructing the interval [-1,1]
.
julia> @interval(-1 ≤ x ≤ 1) ==
@interval(-1 <= x <= 1) ==
@interval(-1 ≤ y ≤ 1) ==
@interval( 1 ≥ x ≥ -1)
true
You can even create intervals with Inf
or -Inf
at the endpoints, e.g. @interval(-Inf < x < Inf)
.
There are two primary operations you can run on an interval: testing for membership and intersection. You can test whether or not x
is in an interval using x ∈ interval
, as shown below.
julia> interval = @interval(0 ≤ x < 1);
julia> 0 ∈ interval && 1 ∉ interval
true
julia> 0 in interval # This syntax also works
true
You can also intersect two intervals using the ∩
operator (or by using intersect(interval_1, interval_2)
).
julia> @interval(0 ≤ x < 1) ∩ @interval(1/2 < x ≤ 1) == @interval(1/2 < x < 1)
true
See also: RealInterval
Private interface
LSHFunctions.RealInterval
— Typestruct RealInterval{T<:Real}
Encodes an interval of the real line, such as [-1,1]
or [0,Inf)
.
Fields
lower::T
: lower bound on the interval.upper::T
: upper bound on the interval.closed_below::Bool
: whether or not the interval is closed below.closed_above::Bool
: whether or not the interval is closed above.
Examples
The following snippet constructs RealInterval
represeting the interval [0,1)
julia> interval = LSHFunctions.RealInterval(0, 1, true, false);
It's generally easier to construct an interval using the @interval
macro. Check out the documentation for @interval
for more information.
See also: @interval
Base.isempty
— Methodisempty(interval::RealInterval)
Returns true
if interval
is empty (i.e. there doesn't exist an x
for which x ∈ interval
is true
), and false
otherwise.
LSHFunctions.width
— Methodwidth(interval::RealInterval)
Return the width of a RealInterval
(i.e. the difference between its upper and lower bounds.
LSHFunctions.@register_similarity!
— Macroregister_similarity!(similarity, hashfn)
Register hashfn
to the LSH
module as the default locality-sensitive hash function to use for similarity function similarity
. This makes it possible to construct a new hash function for similarity
with LSHFunction(similarity, args...; kws...)
.
Arguments
similarity
: the similarity function to register.hashfn
: the default locality-sensitive hash function thatsimilarity
should be associated with.
Examples
Create a custom implementation of cosine similarity called my_cossim
, and associate it with SimHash
:
julia> using LinearAlgebra: dot, norm
julia> my_cossim(x,y) = dot(x,y) / (norm(x) * norm(y));
julia> hashfn = LSHFunction(my_cossim);
ERROR: MethodError: no method matching LSHFunction(::typeof(my_cossim))
julia> LSHFunctions.@register_similarity!(my_cossim, SimHash);
julia> hashfn = LSHFunction(my_cossim);
julia> isa(hashfn, SimHash)
true