Documentation
¶
Overview ¶
Package streamstats includes data structures and algorithms that are O(1) time and space in the number of elements processed.
Moment-Based Statistics ¶
Single variable moments up to fourth order and first-order covariance use the methods of:
"Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments." Philippe P. Pébay, Technical Report SAND2008-6212, Sandia National Laboratories, September 2008.
which extend the results of:
"Note on a method for calculating corrected sums of squares and products".
B. P. Welford (1962). Technometrics 4(3):419–420
(popularized by Donald Knuth in "The Art of Computer Programming")
to arbitrary moments and combinations of arbitrary sized populations allowing parallel aggregation. These moments are also extended to two dependent variables with a covariance Sxy
This also an includes exponentially-weighted moving average with damping factor, 0 < *lambda* < 1, using update formula m = (1-lambda)*m + lambda*x
Order Statistics ¶
Quantiles and Histograms are based on the P2-algorithm:
"The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations." Raj Jain and Imrich Chlamtac, Communications of the ACM Volume 28 Issue 10, October 1985 Pages 1076-1085
Count Distinct ¶
Count distinct is provided by an implementation of the HyperLogLog data structure based on:
"Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" Philippe Flajolet and Éric Fusy and Olivier Gandouet and et al. in AOFA ’07: PROCEEDINGS OF THE 2007 INTERNATIONAL CONFERENCE ON ANALYSIS OF ALGORITHMS
This implementation includes some of the HyperLogLog++ enhancements such as the 64-bit hash function which eliminates the large cardinality correction for hash collisions and an empirical bias correction for small cardinalities The implementation is space in-efficient since bits are used to store the counts which could be at most 60 < 2^6
An additional LinearCounting implementation that is backed by a BitVector is available as well. If the maximum possible cardinality is known, this structure uses only 12.5% of the memory as the HyperLogLog and runs much faster for both `Add` and `Distinct`. However, the data structure saturates at the maximum value while HyperLogLog can count to virtually unlimited cardinalities.
Set Membership ¶
Approximate set membership is provided by a BloomFilter implementation based on:
"Space/time trade-offs in hash coding with allowable errors" Burton H. Bloom Communications of the ACM Volume 13 Issue 7, July 1970 Pages 422-426
the size m of the filter is rounded up to the nearest power of two for speed of addition and membership check, which could result in a larger filter depending on the cardinality and false positive target you supply. the k different hash functions are derived from top (`h1`) and bottom (`h2`) 32-bits of a 64-bit hash function using h[i] = h1 + i* h2 mod m for i in 0...m-1 based on
"Less hashing, same performance: Building a better Bloom filter" Adam Kirsch, Michael Mitzenmacher Random Structures & Algorithms Volume 33 Issue 2, September 2008 Pages 187-218
Index ¶
- type BitVector
- type BloomFilter
- func (bf *BloomFilter) Add(item []byte)
- func (bf BloomFilter) Check(item []byte) bool
- func (bf BloomFilter) Distinct() uint64
- func (bf BloomFilter) FalsePositiveRate() float64
- func (bf BloomFilter) Intersect(bfB *BloomFilter) (*BloomFilter, error)
- func (bf BloomFilter) Occupancy() float64
- func (bf BloomFilter) Union(bfB *BloomFilter) (*BloomFilter, error)
- type BoxPlot
- func (bp BoxPlot) InterQuartileRange() float64
- func (bp BoxPlot) IsOutlier(x float64) bool
- func (bp BoxPlot) LowerQuartile() float64
- func (bp BoxPlot) LowerWhisker() float64
- func (bp BoxPlot) Median() float64
- func (bp BoxPlot) MidHinge() float64
- func (bp BoxPlot) MidRange() float64
- func (bp BoxPlot) String() string
- func (bp BoxPlot) TriMean() float64
- func (bp BoxPlot) UpperQuartile() float64
- func (bp BoxPlot) UpperWhisker() float64
- type CovarStats
- func (c *CovarStats) Add(x, y float64)
- func (c *CovarStats) Combine(b *CovarStats) CovarStats
- func (c *CovarStats) Correlation() float64
- func (c *CovarStats) Intercept() float64
- func (c *CovarStats) N() uint64
- func (c *CovarStats) Slope() float64
- func (c *CovarStats) XKurtosis() float64
- func (c *CovarStats) XMean() float64
- func (c *CovarStats) XSkewness() float64
- func (c *CovarStats) XStdDev() float64
- func (c *CovarStats) XVariance() float64
- func (c *CovarStats) YKurtosis() float64
- func (c *CovarStats) YMean() float64
- func (c *CovarStats) YSkewness() float64
- func (c *CovarStats) YStdDev() float64
- func (c *CovarStats) YVariance() float64
- type CumulativeDensity
- type EWMA
- type HyperLogLog
- func (hll *HyperLogLog) Add(item []byte)
- func (hll *HyperLogLog) BiasCorrected() uint64
- func (hll *HyperLogLog) Compress(factor byte) *HyperLogLog
- func (hll *HyperLogLog) Distinct() uint64
- func (hll *HyperLogLog) ExpectedError() float64
- func (hll *HyperLogLog) Intersect(hllB *HyperLogLog) (*HyperLogLog, error)
- func (hll *HyperLogLog) LinearCounting() uint64
- func (hll *HyperLogLog) RawEstimate() uint64
- func (hll *HyperLogLog) Reset()
- func (hll *HyperLogLog) String() string
- func (hll *HyperLogLog) Union(hllB *HyperLogLog) (*HyperLogLog, error)
- type LinearCounting
- func (lc *LinearCounting) Add(item []byte)
- func (lc *LinearCounting) Compress(factor byte) *LinearCounting
- func (lc LinearCounting) Distinct() uint64
- func (lc LinearCounting) ExpectedError() float64
- func (lc *LinearCounting) Intersect(lcB *LinearCounting) (*LinearCounting, error)
- func (lc LinearCounting) Occupancy() float64
- func (lc LinearCounting) String() string
- func (lc *LinearCounting) Union(lcB *LinearCounting) (*LinearCounting, error)
- type MomentStats
- func (m *MomentStats) Add(x float64)
- func (m *MomentStats) Combine(b *MomentStats) MomentStats
- func (m *MomentStats) Kurtosis() float64
- func (m *MomentStats) Mean() float64
- func (m *MomentStats) N() uint64
- func (m *MomentStats) Skewness() float64
- func (m *MomentStats) StdDev() float64
- func (m *MomentStats) String() string
- func (m *MomentStats) Variance() float64
- type P2Histogram
- type P2Quantile
- func (p *P2Quantile) Add(x float64)
- func (p *P2Quantile) LowerQuantile() float64
- func (p *P2Quantile) Max() float64
- func (p *P2Quantile) Min() float64
- func (p *P2Quantile) N() uint64
- func (p *P2Quantile) P() float64
- func (p *P2Quantile) Quantile() float64
- func (p *P2Quantile) UpperQuantile() float64
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type BitVector ¶
type BitVector []uint64
BitVector represents an arbitrary length vector of bits backed by 64-bit words it is used as the data structure backing the Bloom Filter and Linear Counting implementations
func NewBitVector ¶
NewBitVector returns a new BitVector of length L
func (BitVector) Clear ¶
Clear clears the bit at position N for N >= L this will access memory out of bounds of the backing array and panic
func (BitVector) Get ¶
Get returns the bit at position N as a uint64 for N >= L this will access memory out of bounds of the backing array and panic
func (BitVector) PopCount ¶
PopCount returns the nubmer of set bits in the bit vector the algorithm for PopCount on a single 64-bit word is from 1957 due to Donald B. Gillies and Jeffrey C. P. Miller and referenced by Donald Knuth
type BloomFilter ¶
type BloomFilter struct {
// contains filtered or unexported fields
}
BloomFilter is a datastructure for approximate set membership with no false negatives and limited false positives
func NewBloomFilter ¶
func NewBloomFilter(Nitems uint64, FalsePositiveRate float64, hash hash.Hash64) *BloomFilter
NewBloomFilter returns a pointer to a new BloomFilter that has been sized in m with k hash functions to target the given false positive rate at the given number of items using the given hash function
func (*BloomFilter) Add ¶
func (bf *BloomFilter) Add(item []byte)
Add puts an item in the set represented by the BloomFilter
func (BloomFilter) Check ¶
func (bf BloomFilter) Check(item []byte) bool
Check returns false if an item in is definitely not in the set represented by the BloomFilter
func (BloomFilter) Distinct ¶
func (bf BloomFilter) Distinct() uint64
Distinct estimates the number of elements in the filter by using the LinearCounting estimate accounting for the k hash functions calculated for each element
func (BloomFilter) FalsePositiveRate ¶
func (bf BloomFilter) FalsePositiveRate() float64
FalsePositiveRate returns the expected false positive rate based on the ratio of filled buckets in the BloomFilter
func (BloomFilter) Intersect ¶
func (bf BloomFilter) Intersect(bfB *BloomFilter) (*BloomFilter, error)
Intersect combines two BloomFilters producing one that contains only of the elements in both BloomFilters the BloomFilters must be the same size m and k as well as use the same hash function
func (BloomFilter) Occupancy ¶
func (bf BloomFilter) Occupancy() float64
Occupancy returns the ratio of filled buckets in the BloomFilter
func (BloomFilter) Union ¶
func (bf BloomFilter) Union(bfB *BloomFilter) (*BloomFilter, error)
Union combines two BloomFilters producing one that contains all of the elements in either BloomFilter the BloomFilters must be the same size m and k as well as use the same hash function
type BoxPlot ¶
type BoxPlot struct {
P2Quantile
}
BoxPlot represents a BoxPlot with interquartile range and whiskers backed by a P2-Quantile tracking the median, P=0.5
func (BoxPlot) InterQuartileRange ¶
InterQuartileRange returns the estimated interquartile range
func (BoxPlot) LowerQuartile ¶
LowerQuartile returns the estimated lower quartile
func (BoxPlot) LowerWhisker ¶
LowerWhisker returns the estimated lower whisker, Q1 - 1.5 * IQR
func (BoxPlot) MidHinge ¶
MidHinge returns the MidHinge of the data, average of upper and lower quantiles
func (BoxPlot) UpperQuartile ¶
UpperQuartile returns the estimated upper quartile
func (BoxPlot) UpperWhisker ¶
UpperWhisker returns the estimated upper whisker, Q3 + 1.5 * IQR
type CovarStats ¶
type CovarStats struct {
// contains filtered or unexported fields
}
CovarStats is a data structure for computing stats on two related variables x,y from a stream
func NewCovarStats ¶
func NewCovarStats() *CovarStats
NewCovarStats returns an empty CovarStats structure with no values
func (*CovarStats) Add ¶
func (c *CovarStats) Add(x, y float64)
Add adds a sample of the two variables to the CovarStats data structure
func (*CovarStats) Combine ¶
func (c *CovarStats) Combine(b *CovarStats) CovarStats
Combine returns the combination of two CovarStats datastructures
func (*CovarStats) Correlation ¶
func (c *CovarStats) Correlation() float64
Correlation returns the Pearson product-moment correlation coefficient of the x and y samples seen so far
func (*CovarStats) Intercept ¶
func (c *CovarStats) Intercept() float64
Intercept returns the intercept of the correlation between x and y samples seen so far
func (*CovarStats) Slope ¶
func (c *CovarStats) Slope() float64
Slope returns the slope of the correlation between x and y samples seen so far
func (*CovarStats) XKurtosis ¶
func (c *CovarStats) XKurtosis() float64
XKurtosis returns the kurtorsis of the x values seen so far
func (*CovarStats) XMean ¶
func (c *CovarStats) XMean() float64
XMean returns the mean of the x values seen so far
func (*CovarStats) XSkewness ¶
func (c *CovarStats) XSkewness() float64
XSkewness returns the skewness of the x values seen so far
func (*CovarStats) XStdDev ¶
func (c *CovarStats) XStdDev() float64
XStdDev returns the standard deviation of the x values seen so far
func (*CovarStats) XVariance ¶
func (c *CovarStats) XVariance() float64
XVariance returns the variance of the x values seen so far
func (*CovarStats) YKurtosis ¶
func (c *CovarStats) YKurtosis() float64
YKurtosis returns the kurtosis of the y values seen so far
func (*CovarStats) YMean ¶
func (c *CovarStats) YMean() float64
YMean returns the mean of the y values seen so far
func (*CovarStats) YSkewness ¶
func (c *CovarStats) YSkewness() float64
YSkewness returns the skewness of the y values seen so far
func (*CovarStats) YStdDev ¶
func (c *CovarStats) YStdDev() float64
YStdDev returns the standard deviation of the y values seen so far
func (*CovarStats) YVariance ¶
func (c *CovarStats) YVariance() float64
YVariance returns the variance of the y values seen so far
type CumulativeDensity ¶
CumulativeDensity represents the probability P of observing a value less than or equal to X
type EWMA ¶
type EWMA struct {
// contains filtered or unexported fields
}
EWMA data structure for exponentially weighted moving average
type HyperLogLog ¶
type HyperLogLog struct {
// contains filtered or unexported fields
}
HyperLogLog a data structure for computing count distinct on arbitrary sized data
func NewHyperLogLog ¶
func NewHyperLogLog(p byte, hash hash.Hash64) *HyperLogLog
NewHyperLogLog returns a new HyperLogLog data structure with 2^p buckets based on Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm Philippe Flajolet and Éric Fusy and Olivier Gandouet and et al. IN AOFA ’07: PROCEEDINGS OF THE 2007 INTERNATIONAL CONFERENCE ON ANALYSIS OF ALGORITHMS This implementation does not include any of the HyperLogLog++ enhancments except for the 64-bit hash function which eliminates the large cardinality correction for hash collisions this is also space in-efficient since bytes are used to store the counts which could be at most 60 < 2^6
func (*HyperLogLog) Add ¶
func (hll *HyperLogLog) Add(item []byte)
Add adds an item to the multiset represented by the HyperLogLog
func (*HyperLogLog) BiasCorrected ¶
func (hll *HyperLogLog) BiasCorrected() uint64
BiasCorrected returns the bias corrected estimated number of distinct items in the multiset
func (*HyperLogLog) Compress ¶
func (hll *HyperLogLog) Compress(factor byte) *HyperLogLog
Compress produces a new HyperLogLog with reduced size by 2^factor with reduced precision if new p < minimumHyperLogLogP, p=minimumHyperLogLogP , if factor=0 it just produces a copy
func (*HyperLogLog) Distinct ¶
func (hll *HyperLogLog) Distinct() uint64
Distinct returns the estimated number of distinct items in the multiset
func (*HyperLogLog) ExpectedError ¶
func (hll *HyperLogLog) ExpectedError() float64
ExpectedError returns the estimated error in the number of distinct items in the multiset
func (*HyperLogLog) Intersect ¶
func (hll *HyperLogLog) Intersect(hllB *HyperLogLog) (*HyperLogLog, error)
Intersect the estimate of two HyperLogLog reducing the precision to the minimum of the two sets the function will return nil and an error if the hash functions mismatch Intersect will always overestimate the size of the intersection
func (*HyperLogLog) LinearCounting ¶
func (hll *HyperLogLog) LinearCounting() uint64
LinearCounting returns the linear counting estimated number of distinct items in the multiset
func (*HyperLogLog) RawEstimate ¶
func (hll *HyperLogLog) RawEstimate() uint64
RawEstimate returns the raw estimated number of distinct items in the multiset
func (*HyperLogLog) Reset ¶
func (hll *HyperLogLog) Reset()
Reset zeros out the estimated number of distinct items in the multiset
func (*HyperLogLog) String ¶
func (hll *HyperLogLog) String() string
func (*HyperLogLog) Union ¶
func (hll *HyperLogLog) Union(hllB *HyperLogLog) (*HyperLogLog, error)
Union the estimate of two HyperLogLog reducing the precision to the minimum of the two sets the function will return nil and an error if the hash functions mismatch
type LinearCounting ¶
type LinearCounting struct {
// contains filtered or unexported fields
}
LinearCounting is a space efficient data structure for count distinct with hard upper bound
func NewLinearCounting ¶
func NewLinearCounting(p byte, hash hash.Hash64) *LinearCounting
NewLinearCounting initializes a LinearCounting structure with size m=2^p and the given hash function
func (*LinearCounting) Add ¶
func (lc *LinearCounting) Add(item []byte)
Add adds an item to the multiset represented by the LinearCounting structure
func (*LinearCounting) Compress ¶
func (lc *LinearCounting) Compress(factor byte) *LinearCounting
Compress produces a new LinearCouting with reduced size by 2^factor with reduced precision if new p < minLinearCountingP, p=minLinearCountingP , if factor=0 it just produces a copy
func (LinearCounting) Distinct ¶
func (lc LinearCounting) Distinct() uint64
Distinct returns the estimate of the number of distinct elements seen if the backing BitVector is full it returns m, the size of the BitVector
func (LinearCounting) ExpectedError ¶
func (lc LinearCounting) ExpectedError() float64
ExpectedError returns the expected error at the current filling in the LinearCounting
func (*LinearCounting) Intersect ¶
func (lc *LinearCounting) Intersect(lcB *LinearCounting) (*LinearCounting, error)
Intersect the estimate of two LinearCounting reducing the precision to the minimum of the two sets the function will return nil and an error if the hash functions mismatch
func (LinearCounting) Occupancy ¶
func (lc LinearCounting) Occupancy() float64
Occupancy returns the ratio of filled buckets in the LinearCounting bitvector
func (LinearCounting) String ¶
func (lc LinearCounting) String() string
func (*LinearCounting) Union ¶
func (lc *LinearCounting) Union(lcB *LinearCounting) (*LinearCounting, error)
Union the estimate of two LinearCounting reducing the precision to the minimum of the two sets the function will return nil and an error if the hash functions mismatch
type MomentStats ¶
type MomentStats struct {
// contains filtered or unexported fields
}
MomentStats is a datastructure for computing the first four moments of a stream
func NewMomentStats ¶
func NewMomentStats() *MomentStats
NewMomentStats returns an empty MomentStats structure with no values
func (*MomentStats) Combine ¶
func (m *MomentStats) Combine(b *MomentStats) MomentStats
Combine combines the stats from two MomentStats structures
func (*MomentStats) Kurtosis ¶
func (m *MomentStats) Kurtosis() float64
Kurtosis returns the excess kurtosis of the samples seen so far
func (*MomentStats) Mean ¶
func (m *MomentStats) Mean() float64
Mean returns the mean of the observations seen so far
func (*MomentStats) Skewness ¶
func (m *MomentStats) Skewness() float64
Skewness returns the skewness of the samples seen so far
func (*MomentStats) StdDev ¶
func (m *MomentStats) StdDev() float64
StdDev returns the standard deviation of the samples seen so far
func (*MomentStats) String ¶
func (m *MomentStats) String() string
String returns the standard string representation of the samples seen so far
func (*MomentStats) Variance ¶
func (m *MomentStats) Variance() float64
Variance returns the variance of the observations seen so far
type P2Histogram ¶
type P2Histogram struct {
// contains filtered or unexported fields
}
P2Histogram is an O(1) time and space data structure for estimating the evenly spaced histogram bins of a series of N data points based on the "The P2 Algorithm for Dynamic Computing Calculation of Quantiles and Histograms Without Storing Observations" by RAJ JAIN and IIMRICH CHLAMTAC Communications of the ACM Volume 28 Issue 10, Oct. 1985 Pages 1076-1085
func NewP2Histogram ¶
func NewP2Histogram(b uint64) P2Histogram
NewP2Histogram intializes the data structure to track b bins
func (*P2Histogram) Add ¶
func (h *P2Histogram) Add(x float64)
Add updates the data structure with a given x value
func (*P2Histogram) CDF ¶
func (h *P2Histogram) CDF(x float64) float64
CDF returns the linear approximation to the CDF at x based on the histogram data
func (*P2Histogram) Histogram ¶
func (h *P2Histogram) Histogram() []CumulativeDensity
Histogram returns the histogram of observations seen so far
func (*P2Histogram) Max ¶
func (h *P2Histogram) Max() float64
Max returns the maximum of observations seen so far
func (*P2Histogram) Min ¶
func (h *P2Histogram) Min() float64
Min returns the minimum of observations seen so far
func (*P2Histogram) N ¶
func (h *P2Histogram) N() uint64
N returns the number of observations seen so far
func (*P2Histogram) Quantile ¶
func (h *P2Histogram) Quantile(p float64) float64
Quantile returns the linear approximation to the given quantile based on the histogram data
type P2Quantile ¶
type P2Quantile struct {
// contains filtered or unexported fields
}
P2Quantile is an O(1) time and space data structure for estimating the p-quantile of a series of N data points based on the "The P2 Algorithm for Dynamic Computing Calculation of Quantiles and Histograms Without Storing Observations" by RAJ JAIN and IIMRICH CHLAMTAC Communications of the ACM Volume 28 Issue 10, Oct. 1985 Pages 1076-1085
func NewP2Quantile ¶
func NewP2Quantile(p float64) P2Quantile
NewP2Quantile intializes the data structure to track the p-quantile
func (*P2Quantile) Add ¶
func (p *P2Quantile) Add(x float64)
Add updates the data structure with a given x value
func (*P2Quantile) LowerQuantile ¶
func (p *P2Quantile) LowerQuantile() float64
LowerQuantile returns the estimate for the lower quantile, p/2
func (*P2Quantile) Max ¶
func (p *P2Quantile) Max() float64
Max returns the exact maximum value seen so far
func (*P2Quantile) Min ¶
func (p *P2Quantile) Min() float64
Min returns the exact minimum value seen so far
func (*P2Quantile) N ¶
func (p *P2Quantile) N() uint64
N returns the number of observations seen so far
func (*P2Quantile) Quantile ¶
func (p *P2Quantile) Quantile() float64
Quantile returns the estimated value for the p-quantile
func (*P2Quantile) UpperQuantile ¶
func (p *P2Quantile) UpperQuantile() float64
UpperQuantile returns the estimate for the upper quantile, (1+p/2)