Documentation
¶
Overview ¶
Package disfun implements various distance functions.
Index ¶
- Constants
- func BrayCurtis(x, y *mat64.Dense) (result float64)
- func DamerauLevenshtein(s1, s2 string) (distance int)
- func DoubleMetaphone(strInput string, maxLength int) (string, string)
- func Hamming(s1 string, s2 string) (distance int, err error)
- func Haversine(onePhi, oneLambda, twoPhi, twoLambda float64) (c float64)
- func HaversineLatLon(lat1, lon1, lat2, lon2 float64) float64
- func HaversinePoint(pointA1, pointA2, pointB1, pointB2, radius float64) float64
- func Jaro(r1 string, r2 string) (distance float64)
- func JaroWinkler(r1 string, r2 string, longTolerance bool) (distance float64)
- func Lev(s1, s2 string) int
- func LevEditDistance(s1, s2 string) (distance int)
- func Manhattan(x, y *mat64.Dense) (result float64)
- func NYSIIS(s1 string) string
- func OSA(s1 string, s2 string) (distance int)
- func Phonex(s1 string) string
- func SmithWaterman(s1 string, s2 string) float64
- func Soundex(s1 string) string
- type Euclidean
- type Leven
- type Levenshtein
- type Ngram
- type String
Constants ¶
Variables ¶
This section is empty.
Functions ¶
func BrayCurtis ¶
BrayCurtis finds the distance by the ratio of the absolute sum of all differences of points in vectors [u,v]:
Example: dist((u, v) = dist(v, u)) = |v_1 - u_1| + |v_2 - u_2| + ... + |v_n - u_n| / |v_1 + u_1| + |v_2 + u_2| + ... + |v_n + u_n| References: http://mathworld.wolfram.com/TaxicabMetric.html http://demonstrations.wolfram.com/TaxicabGeometry/
func DamerauLevenshtein ¶
DamerauLevenshtein computes the Damerau-Levenshtein distance between two strings. The returned value - distance - is the number of insertions, deletions, substitutions, and transpositions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point. It is similar to the Optimal String Alignment, algorithm, but is more complex because it allows multiple edits on substrings.
References: This implementation is based off of the one found on Wikipedia at http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Distance_with_adjacent_transpositions as well as KevinStern's Java implementation found at https://github.com/KevinStern/software-and-algorithms.
func DoubleMetaphone ¶
DoubleMetaphone computes the Double-Metaphone value of the input string. This value is a phonetic representation of how the string sounds, with affordances for many different language dialects. It was originally developed by Lawrence Phillips in the 1990s.
More information about this algorithm can be found on Wikipedia at http://en.wikipedia.org/wiki/Metaphone.
func Hamming ¶
Hamming computes the Hamming distance between two equal-length strings. This is the number of times the two strings differ between characters at the same index. This implementation is based off of the algorithm description found at http://en.wikipedia.org/wiki/Hamming_distance.
func Haversine ¶
Haversine finds the shortest "as-the-crow-flies" distance between 2 points (φ,λ) on a sphere R... accounting for curvuture. In Navigation terms, for the example below, φ (phi) is latitude, λ (lambda) is longitude, R(adius) is earth’s radius (mean radius = 6,371km).
Example: a = sin²(Δφ/2) + cos φ1 ⋅ cos φ2 ⋅ sin²(Δλ/2) c = 2 ⋅ atan2( √a, √(1−a) ) d = R ⋅ c References: https://github.com/kellydunn/golang-geo. http://www.codecodex.com/wiki/Calculate_Distance_Between_Two_Points_on_a_Globe http://www.movable-type.co.uk/scripts/latlong.html https://github.com/reddavis/Distance-Measures/blob/master/lib/distance_measures/haversine.rb
func HaversineLatLon ¶
HaversineLatLon accepts 2 sets of lat/lon, it multiplies the EARTH_RADIUS by the result of the haversine function.
func HaversinePoint ¶
HaversinePoint accepts 2 sets of points with a radius, it multiplies radius by the result of the haversine function.
func Jaro ¶
Jaro computes the Jaro edit distance between two strings. It represents this with a float64 between 0 and 1 inclusive, with 0 indicating the two strings are not at all similar and 1 indicating the two strings are exact matches.
See http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance for a full description.
func JaroWinkler ¶
JaroWinkler computes the Jaro-Winkler edit distance between two strings. This is a modification of the Jaro algorithm that gives additional weight to prefix matches.
func Lev ¶
Lev (edit distance) gives similarity metric by calcuating number of positions for substitution, insertion, and deletion.
Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.
Levenshtein is the most accurate but also the most costly due to building matrices. I can get this down but using dense matrices for this is not a good idea. The other two functions are about the same speed but they are not very readable and not the most accurate... see the tests for differences in accuracy between Leven, Lev, and Levenshtein.
func LevEditDistance ¶
LevEditDistance computes the Levenshtein distance between two strings. The returned value - distance - is the number of insertions, deletions, and substitutions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point.
//Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.
func Manhattan ¶
Manhattan (also called TaxiCab) finds the distance by the absolute sum of all differences of points in vectors [u,v]:
Example: dist((u, v) = dist(v, u)) = |v_1 - u_1| + |v_2 - u_2| + ... + |v_n - u_n| References: http://mathworld.wolfram.com/TaxicabMetric.html http://demonstrations.wolfram.com/TaxicabGeometry/
func NYSIIS ¶
NYSIIS computes the NYSIIS phonetic encoding of the input string. It is a modification of the traditional Soundex algorithm.
func OSA ¶
OSA computes the Optimal String Alignment distance between two strings. The returned value - distance - is the number of insertions, deletions, substitutions, and transpositions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point. It is similar to Damerau-Levenshtein, but is simpler because it does not allow multiple edits on any substring.
func Phonex ¶
Phonex computes the Phonex phonetic encoding of the input string. Phonex is a modification of the venerable Soundex algorithm. It accounts for a few more letter combinations to improve accuracy on some data sets.
This implementation is based off of the original C implementation by the creator - A. J. Lait - as found in his research paper entitled "An Assessment of Name Matching Algorithms."
func SmithWaterman ¶
SmithWaterman computes the Smith-Waterman local sequence alignment for the two input strings. This was originally designed to find similar regions in strings representing DNA or protein sequences.
func Soundex ¶
Soundex computes the Soundex phonetic representation of the input string. It attempts to encode homophones with the same characters. More information can be found at http://en.wikipedia.org/wiki/Soundex.
Types ¶
type Euclidean ¶
type Euclidean struct{}
Euclidean finds the "ordinary" distance between vectors [u,v] given by the Pythagorean formula (sum of squares of all points):
Example: dist((u, v) = dist(v, u)) = √(v_1 - u_1)² + (v_2 - u_2)² + ... + (v_n - u_n)² References: http://reference.wolfram.com/language/ref/EuclideanDistance.html
type Leven ¶
Leven (edit distance) gives similarity metric by calcuating number of positions for substitution, insertion, and deletion.
Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.
Levenshtein is the most accurate but also the most costly due to building matrices. I can get this down but using dense matrices for this is not a good idea. The other two functions are about the same speed but they are not very readable and not the most accurate... see the tests for differences in accuracy between Leven, Lev, and Levenshtein.
func (*Leven) Similarity ¶
type Levenshtein ¶
type Levenshtein struct { Source []rune Target []rune SourceString string TargeString string RowHeight int ColWidth int M *mat64.Dense }
Levenshtein (edit distance) gives similarity metric by calcuating number of positions for substitution, insertion, and deletion.
Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.
Row == Height Column == Width Demo here: http://andrew.hedges.name/experiments/levenshtein/
Levenshtein is the most accurate but also the most costly due to building matrices. I can get this down but using dense matrices for this is not a good idea. The other two functions are about the same speed but they are not very readable and not the most accurate... see the tests for differences in accuracy between Leven, Lev, and Levenshtein.
func NewLevenshtein ¶
func NewLevenshtein(source, target string) *Levenshtein
func (*Levenshtein) ComputeMatrix ¶
func (l *Levenshtein) ComputeMatrix() *mat64.Dense
func (*Levenshtein) Similarity ¶
func (l *Levenshtein) Similarity() float64
type Ngram ¶
Ngram is continuous sequence of n-items from a given sequence. The distance is the relative number of items between these two sequences.
References:
https://webdocs.cs.ualberta.ca/~kondrak/papers/spire05.pdf https://en.wikipedia.org/wiki/N-gram http://m.wolframalpha.com/input/?i=n-grams+%22n-gram+example+of+n-grams+in+wolfram+alpha%22&x=0&y=0
func NewNgram ¶
NewNgram initializes and creates data structures for the Ngram struct, which you can then call Similarity() on.
func (*Ngram) JaccardCoEfficient ¶
JaccardCoEfficient calculates the similarity of two sets as the intersection divided by the union of the two sets.
func (*Ngram) Similarity ¶
Similarity bulds the two ngram sets and returns their similarity.
type String ¶
type String struct {
// contains filtered or unexported fields
}
String wraps a regular string with a small structure that provides more efficient indexing by code point index, as opposed to byte index. Scanning incrementally forwards or backwards is O(1) per index operation (although not as fast a range clause going forwards). Random access is O(N) in the length of the string, but the overhead is less than always scanning from the beginning. If the string is ASCII, random access is O(1). Unlike the built-in string type, String has internal mutable state and is not thread-safe.
func (*String) At ¶
At returns the rune with index i in the String. The sequence of runes is the same as iterating over the contents with a "for range" clause.
func (*String) Init ¶
Init initializes an existing String to hold the provided contents. It returns a pointer to the initialized String.
func (*String) IsASCII ¶
IsASCII returns a boolean indicating whether the String contains only ASCII bytes.
func (*String) RuneCount ¶
RuneCount returns the number of runes (Unicode code points) in the String.