disfun

package module

v0.0.0-...-62faa31 Latest Latest Go to latest Published: Sep 12, 2016 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jbowles/disfun

Links

Open Source Insights

README ¶

DisFun

Inspired by the Ruby Gem Distance-Measures and Measurable and my need to have available a number of distance functions for things like k-Nearest-Neighbor, k-Means, string metrics, and the like....

Algos

I'll add algorithm implementations as they arise. Some imlementations have been forked from existing projects where I have made modifcations to suite my purposes (see external_licenses).

Test Coverage

coverage: 87.3% of statements

Run benchmarks

go test -bench .

## or

go test -bench=.

## or for levenshtien

go test -bench Lev

## or for euclidean
go test -bench Euc

## or use gobenchui
gobenchui .

Documentation ¶

Overview ¶

Package disfun implements various distance functions.

Index ¶

Constants
func BrayCurtis(x, y *mat64.Dense) (result float64)
func DamerauLevenshtein(s1, s2 string) (distance int)
func DoubleMetaphone(strInput string, maxLength int) (string, string)
func Hamming(s1 string, s2 string) (distance int, err error)
func Haversine(onePhi, oneLambda, twoPhi, twoLambda float64) (c float64)
func HaversineLatLon(lat1, lon1, lat2, lon2 float64) float64
func HaversinePoint(pointA1, pointA2, pointB1, pointB2, radius float64) float64
func Jaro(r1 string, r2 string) (distance float64)
func JaroWinkler(r1 string, r2 string, longTolerance bool) (distance float64)
func Lev(s1, s2 string) int
func LevEditDistance(s1, s2 string) (distance int)
func Manhattan(x, y *mat64.Dense) (result float64)
func NYSIIS(s1 string) string
func OSA(s1 string, s2 string) (distance int)
func Phonex(s1 string) string
func SmithWaterman(s1 string, s2 string) float64
func Soundex(s1 string) string
type Euclidean
- func NewEuclidean() *Euclidean
- func (e *Euclidean) Distance(u, v *mat64.Dense) float64
- func (e *Euclidean) InnerProduct(u, v *mat64.Dense) (result float64)
type Leven
- func NewLeven(s1, s2 string) *Leven
- func (l *Leven) Similarity() float64
type Levenshtein
- func NewLevenshtein(source, target string) *Levenshtein
- func (l *Levenshtein) ComputeMatrix() *mat64.Dense
- func (l *Levenshtein) Similarity() float64
type Ngram
- func NewNgram(n int, s1, s2 string) *Ngram
- func (n *Ngram) Build()
- func (n *Ngram) JaccardCoEfficient() float64
- func (n *Ngram) Similarity() float64
type String
- func NewString(contents string) *String
- func (s *String) At(i int) int
- func (s *String) Init(contents string) *String
- func (s *String) IsASCII() bool
- func (s *String) RuneCount() int
- func (s *String) Slice(i, j int) string
- func (s *String) String() string

Constants ¶

View Source

const (
	EarthRadius  = float64(6371)
	Substitution = float64(1)
	Insertion    = float64(1)
	Deletion     = float64(1)
	GapCost      = float64(0.5)
)

Variables ¶

This section is empty.

Functions ¶

func BrayCurtis ¶

func BrayCurtis(x, y *mat64.Dense) (result float64)

BrayCurtis finds the distance by the ratio of the absolute sum of all differences of points in vectors [u,v]:

Example:
 dist((u, v) = dist(v, u)) =
	|v_1 - u_1| + |v_2 - u_2| + ... + |v_n - u_n| / |v_1 + u_1| + |v_2 + u_2| + ... + |v_n + u_n|

References:
http://mathworld.wolfram.com/TaxicabMetric.html
http://demonstrations.wolfram.com/TaxicabGeometry/

func DamerauLevenshtein ¶

func DamerauLevenshtein(s1, s2 string) (distance int)

DamerauLevenshtein computes the Damerau-Levenshtein distance between two strings. The returned value - distance - is the number of insertions, deletions, substitutions, and transpositions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point. It is similar to the Optimal String Alignment, algorithm, but is more complex because it allows multiple edits on substrings.

References:
This implementation is based off of the one found on Wikipedia at
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Distance_with_adjacent_transpositions
as well as KevinStern's Java implementation found at
https://github.com/KevinStern/software-and-algorithms.

func DoubleMetaphone ¶

func DoubleMetaphone(strInput string, maxLength int) (string, string)

DoubleMetaphone computes the Double-Metaphone value of the input string. This value is a phonetic representation of how the string sounds, with affordances for many different language dialects. It was originally developed by Lawrence Phillips in the 1990s.

More information about this algorithm can be found on Wikipedia at http://en.wikipedia.org/wiki/Metaphone.

func Hamming ¶

func Hamming(s1 string, s2 string) (distance int, err error)

Hamming computes the Hamming distance between two equal-length strings. This is the number of times the two strings differ between characters at the same index. This implementation is based off of the algorithm description found at http://en.wikipedia.org/wiki/Hamming_distance.

func Haversine ¶

func Haversine(onePhi, oneLambda, twoPhi, twoLambda float64) (c float64)

Haversine finds the shortest "as-the-crow-flies" distance between 2 points (φ,λ) on a sphere R... accounting for curvuture. In Navigation terms, for the example below, φ (phi) is latitude, λ (lambda) is longitude, R(adius) is earth’s radius (mean radius = 6,371km).

	Example:
	  a = sin²(Δφ/2) + cos φ1 ⋅ cos φ2 ⋅ sin²(Δλ/2)
	  c = 2 ⋅ atan2( √a, √(1−a) )
	  d = R ⋅ c

	References:
 https://github.com/kellydunn/golang-geo.
 http://www.codecodex.com/wiki/Calculate_Distance_Between_Two_Points_on_a_Globe
 http://www.movable-type.co.uk/scripts/latlong.html
 https://github.com/reddavis/Distance-Measures/blob/master/lib/distance_measures/haversine.rb

func HaversineLatLon ¶

func HaversineLatLon(lat1, lon1, lat2, lon2 float64) float64

HaversineLatLon accepts 2 sets of lat/lon, it multiplies the EARTH_RADIUS by the result of the haversine function.

func HaversinePoint ¶

func HaversinePoint(pointA1, pointA2, pointB1, pointB2, radius float64) float64

HaversinePoint accepts 2 sets of points with a radius, it multiplies radius by the result of the haversine function.

func Jaro ¶

func Jaro(r1 string, r2 string) (distance float64)

Jaro computes the Jaro edit distance between two strings. It represents this with a float64 between 0 and 1 inclusive, with 0 indicating the two strings are not at all similar and 1 indicating the two strings are exact matches.

See http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance for a full description.

func JaroWinkler ¶

func JaroWinkler(r1 string, r2 string, longTolerance bool) (distance float64)

JaroWinkler computes the Jaro-Winkler edit distance between two strings. This is a modification of the Jaro algorithm that gives additional weight to prefix matches.

func Lev ¶

func Lev(s1, s2 string) int

Lev (edit distance) gives similarity metric by calcuating number of positions for substitution, insertion, and deletion.

Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.

Levenshtein is the most accurate but also the most costly due to building matrices. I can get this down but using dense matrices for this is not a good idea. The other two functions are about the same speed but they are not very readable and not the most accurate... see the tests for differences in accuracy between Leven, Lev, and Levenshtein.

func LevEditDistance ¶

func LevEditDistance(s1, s2 string) (distance int)

LevEditDistance computes the Levenshtein distance between two strings. The returned value - distance - is the number of insertions, deletions, and substitutions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point.

//Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.

func Manhattan ¶

func Manhattan(x, y *mat64.Dense) (result float64)

Manhattan (also called TaxiCab) finds the distance by the absolute sum of all differences of points in vectors [u,v]:

Example:
 dist((u, v) = dist(v, u)) = |v_1 - u_1| + |v_2 - u_2| + ... + |v_n - u_n|

References:
http://mathworld.wolfram.com/TaxicabMetric.html
http://demonstrations.wolfram.com/TaxicabGeometry/

func NYSIIS ¶

func NYSIIS(s1 string) string

NYSIIS computes the NYSIIS phonetic encoding of the input string. It is a modification of the traditional Soundex algorithm.

func OSA ¶

func OSA(s1 string, s2 string) (distance int)

OSA computes the Optimal String Alignment distance between two strings. The returned value - distance - is the number of insertions, deletions, substitutions, and transpositions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point. It is similar to Damerau-Levenshtein, but is simpler because it does not allow multiple edits on any substring.

func Phonex ¶

func Phonex(s1 string) string

Phonex computes the Phonex phonetic encoding of the input string. Phonex is a modification of the venerable Soundex algorithm. It accounts for a few more letter combinations to improve accuracy on some data sets.

This implementation is based off of the original C implementation by the creator - A. J. Lait - as found in his research paper entitled "An Assessment of Name Matching Algorithms."

func SmithWaterman ¶

func SmithWaterman(s1 string, s2 string) float64

SmithWaterman computes the Smith-Waterman local sequence alignment for the two input strings. This was originally designed to find similar regions in strings representing DNA or protein sequences.

func Soundex ¶

func Soundex(s1 string) string

Soundex computes the Soundex phonetic representation of the input string. It attempts to encode homophones with the same characters. More information can be found at http://en.wikipedia.org/wiki/Soundex.

Types ¶

type Euclidean ¶

type Euclidean struct{}

Euclidean finds the "ordinary" distance between vectors [u,v] given by the Pythagorean formula (sum of squares of all points):

Example:
  dist((u, v) = dist(v, u)) = √(v_1 - u_1)² + (v_2 - u_2)² + ... + (v_n - u_n)²

References:
http://reference.wolfram.com/language/ref/EuclideanDistance.html

func NewEuclidean ¶

func NewEuclidean() *Euclidean

NewEuclidean initializes the Euclidean struct.

func (*Euclidean) Distance ¶

func (e *Euclidean) Distance(u, v *mat64.Dense) float64

Distance finds the Euclidean distance.

func (*Euclidean) InnerProduct ¶

func (e *Euclidean) InnerProduct(u, v *mat64.Dense) (result float64)

InnerProduct computes a Eucledian inner product.

type Leven ¶

type Leven struct {
	S1         string
	S2         string
	Lens1      int
	Lens2      int
	Width      int
	VectorCell []int
}

Leven (edit distance) gives similarity metric by calcuating number of positions for substitution, insertion, and deletion.

Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.

Levenshtein is the most accurate but also the most costly due to building matrices. I can get this down but using dense matrices for this is not a good idea. The other two functions are about the same speed but they are not very readable and not the most accurate... see the tests for differences in accuracy between Leven, Lev, and Levenshtein.

func NewLeven ¶

func NewLeven(s1, s2 string) *Leven

func (*Leven) Similarity ¶

func (l *Leven) Similarity() float64

type Levenshtein ¶

type Levenshtein struct {
	Source       []rune
	Target       []rune
	SourceString string
	TargeString  string
	RowHeight    int
	ColWidth     int
	M            *mat64.Dense
}

Levenshtein (edit distance) gives similarity metric by calcuating number of positions for substitution, insertion, and deletion.

Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.

Row     == Height
Column  == Width

Demo here: http://andrew.hedges.name/experiments/levenshtein/

Levenshtein is the most accurate but also the most costly due to building matrices. I can get this down but using dense matrices for this is not a good idea. The other two functions are about the same speed but they are not very readable and not the most accurate... see the tests for differences in accuracy between Leven, Lev, and Levenshtein.

func NewLevenshtein ¶

func NewLevenshtein(source, target string) *Levenshtein

func (*Levenshtein) ComputeMatrix ¶

func (l *Levenshtein) ComputeMatrix() *mat64.Dense

func (*Levenshtein) Similarity ¶

func (l *Levenshtein) Similarity() float64

type Ngram ¶

type Ngram struct {
	Set1 set.Strings
	Set2 set.Strings
	S1   string
	S2   string
	N    int
}

Ngram is continuous sequence of n-items from a given sequence. The distance is the relative number of items between these two sequences.

References:

https://webdocs.cs.ualberta.ca/~kondrak/papers/spire05.pdf
https://en.wikipedia.org/wiki/N-gram
http://m.wolframalpha.com/input/?i=n-grams+%22n-gram+example+of+n-grams+in+wolfram+alpha%22&x=0&y=0

func NewNgram ¶

func NewNgram(n int, s1, s2 string) *Ngram

NewNgram initializes and creates data structures for the Ngram struct, which you can then call Similarity() on.

func (*Ngram) Build ¶

func (n *Ngram) Build()

Build creates Set1 and Set2 of the Ngram.

func (*Ngram) JaccardCoEfficient ¶

func (n *Ngram) JaccardCoEfficient() float64

JaccardCoEfficient calculates the similarity of two sets as the intersection divided by the union of the two sets.

func (*Ngram) Similarity ¶

func (n *Ngram) Similarity() float64

Similarity bulds the two ngram sets and returns their similarity.

type String ¶

type String struct {
	// contains filtered or unexported fields
}

String wraps a regular string with a small structure that provides more efficient indexing by code point index, as opposed to byte index. Scanning incrementally forwards or backwards is O(1) per index operation (although not as fast a range clause going forwards). Random access is O(N) in the length of the string, but the overhead is less than always scanning from the beginning. If the string is ASCII, random access is O(1). Unlike the built-in string type, String has internal mutable state and is not thread-safe.

func NewString ¶

func NewString(contents string) *String

NewString returns a new UTF-8 string with the provided contents.

func (*String) At ¶

func (s *String) At(i int) int

At returns the rune with index i in the String. The sequence of runes is the same as iterating over the contents with a "for range" clause.

func (*String) Init ¶

func (s *String) Init(contents string) *String

Init initializes an existing String to hold the provided contents. It returns a pointer to the initialized String.

func (*String) IsASCII ¶

func (s *String) IsASCII() bool

IsASCII returns a boolean indicating whether the String contains only ASCII bytes.

func (*String) RuneCount ¶

func (s *String) RuneCount() int

RuneCount returns the number of runes (Unicode code points) in the String.

func (*String) Slice ¶

func (s *String) Slice(i, j int) string

Slice returns the string sliced at rune positions [i:j].

func (*String) String ¶

func (s *String) String() string

String returns the contents of the String. This method also means the String is directly printable by fmt.Print.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

DisFun

Algos

Test Coverage

Run benchmarks

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func BrayCurtis ¶

func DamerauLevenshtein ¶

func DoubleMetaphone ¶

func Hamming ¶

func Haversine ¶

func HaversineLatLon ¶

func HaversinePoint ¶

func Jaro ¶

func JaroWinkler ¶

func Lev ¶

func LevEditDistance ¶

func Manhattan ¶

func NYSIIS ¶

func OSA ¶

func Phonex ¶

func SmithWaterman ¶

func Soundex ¶

Types ¶

type Euclidean ¶

func NewEuclidean ¶

func (*Euclidean) Distance ¶

func (*Euclidean) InnerProduct ¶

type Leven ¶

func NewLeven ¶

func (*Leven) Similarity ¶

type Levenshtein ¶

func NewLevenshtein ¶

func (*Levenshtein) ComputeMatrix ¶

func (*Levenshtein) Similarity ¶

type Ngram ¶

func NewNgram ¶

func (*Ngram) Build ¶

func (*Ngram) JaccardCoEfficient ¶

func (*Ngram) Similarity ¶

type String ¶

func NewString ¶

func (*String) At ¶

func (*String) Init ¶

func (*String) IsASCII ¶

func (*String) RuneCount ¶

func (*String) Slice ¶

func (*String) String ¶

Source Files ¶