Documentation
¶
Overview ¶
Package syzgydb provides an embeddable vector database written in Go, designed to efficiently handle large datasets by keeping data on disk rather than in memory. This makes SyzygyDB ideal for systems with limited memory resources.
What is a Vector Database? ¶
A vector database is a specialized database designed to store and query high-dimensional vector data. Vectors are numerical representations of data points, often used in machine learning and data science to represent features of objects, such as images, text, or audio. Vector databases enable efficient similarity searches, allowing users to find vectors that are close to a given query vector based on a specified distance metric.
Features ¶
- Disk-Based Storage: Operates with minimal memory usage by storing data on disk. - Vector Quantization: Supports multiple quantization levels (4, 8, 16, 32, 64 bits) to optimize storage and performance. - Distance Metrics: Supports Euclidean and Cosine distance calculations for vector similarity. - Scalable: Efficiently handles large datasets with support for adding, updating, and removing documents. - Search Capabilities: Provides nearest neighbor and radius-based search functionalities.
Usage ¶
## Creating a Collection
To create a new collection, define the collection options and initialize the collection:
options := CollectionOptions{ Name: "example_collection", DistanceMethod: Euclidean, // or Cosine DimensionCount: 128, // Number of dimensions for each vector Quantization: 64, // Quantization level (4, 8, 16, 32, 64) } collection := NewCollection(options)
## Adding Documents
Add documents to the collection by specifying an ID, vector, and optional metadata:
vector := []float64{0.1, 0.2, 0.3, ..., 0.128} // Example vector metadata := []byte("example metadata") collection.AddDocument(1, vector, metadata)
## Searching
Perform a search to find similar vectors using either nearest neighbor or radius-based search:
searchVector := []float64{0.1, 0.2, 0.3, ..., 0.128} // Example search vector // Nearest neighbor search args := SearchArgs{ Vector: searchVector, MaxCount: 5, // Return top 5 results } results := collection.Search(args) // Radius-based search args = SearchArgs{ Vector: searchVector, Radius: 0.5, // Search within a radius of 0.5 } results = collection.Search(args)
## Updating and Removing Documents
Update the metadata of an existing document or remove a document from the collection:
// Update document metadata err := collection.UpdateDocument(1, []byte("updated metadata")) // Remove a document err = collection.removeDocument(1)
## Dumping the Collection
To dump the collection for inspection or backup, use the DumpIndex function:
DumpIndex("example_collection")
Contributing ¶
Contributions are welcome! Please feel free to submit a pull request or open an issue to discuss improvements or report bugs.
License ¶
This project is licensed under the MIT License. See the LICENSE file for details.
Index ¶
- Constants
- func Configure(cfg Config)
- func DumpIndex(filename string)
- func EmbedText(texts []string, useCache bool) ([][]float64, error)
- func ExportJSON(c *Collection, w io.Writer) error
- func ImportJSON(collectionName string, r io.Reader) error
- func RunServer()
- func SpanLog(format string, v ...interface{})
- type Collection
- func (c *Collection) AddDocument(id uint64, vector []float64, metadata []byte)
- func (c *Collection) Close() error
- func (c *Collection) ComputeStats() CollectionStats
- func (c *Collection) GetAllIDs() []uint64
- func (c *Collection) GetDocument(id uint64) (*Document, error)
- func (c *Collection) GetDocumentCount() int
- func (c *Collection) GetOptions() CollectionOptions
- func (c *Collection) Search(args SearchArgs) SearchResults
- func (c *Collection) UpdateDocument(id uint64, newMetadata []byte) error
- type CollectionOptions
- type CollectionStats
- type Config
- type DataStream
- type Document
- type EmbedTextFunc
- type FileMode
- type FilterFn
- type FreeSpan
- type IndexEntry
- type SearchArgs
- type SearchResult
- type SearchResults
- type Server
- type Span
- type SpanFile
- func (db *SpanFile) Close() error
- func (db *SpanFile) GetStats() (size uint64, numRecords int)
- func (db *SpanFile) IterateRecords(callback func(recordID string, sr *SpanReader) error) error
- func (db *SpanFile) IterateSortedRecords(callback func(recordID string, sr *SpanReader) error) error
- func (db *SpanFile) ReadRecord(recordID string) (*Span, error)
- func (db *SpanFile) RemoveRecord(recordID string) error
- func (db *SpanFile) WriteRecord(recordID string, dataStreams []DataStream) error
- type SpanReader
Constants ¶
const ( StopSearch = iota // Indicates to stop the search due to an error PointAccepted // Indicates the point was accepted and is better PointChecked // Indicates the point was checked unnecessarily PointIgnored // no action taken; pretend point did not exist )
const ( Euclidean = iota Cosine )
Variables ¶
This section is empty.
Functions ¶
func DumpIndex ¶
func DumpIndex(filename string)
DumpIndex reads the specified file and displays its contents in a human-readable format.
func EmbedText ¶
EmbedText connects to the configured Ollama server and runs the configured text model to generate an embedding for the given text.
func ExportJSON ¶
func ExportJSON(c *Collection, w io.Writer) error
Types ¶
type Collection ¶
type Collection struct { CollectionOptions // contains filtered or unexported fields }
Collection represents a collection of documents, supporting operations such as adding, updating, removing, and searching documents.
func NewCollection ¶
func NewCollection(options CollectionOptions) (*Collection, error)
NewCollection creates a new Collection with the specified options. It initializes the collection's memory file and pivots manager.
func (*Collection) AddDocument ¶
func (c *Collection) AddDocument(id uint64, vector []float64, metadata []byte)
AddDocument adds a new document to the collection with the specified ID, vector, and metadata. It manages pivots and encodes the document for storage.
func (*Collection) Close ¶
func (c *Collection) Close() error
Close closes the memfile associated with the collection.
Returns: - An error if the memfile cannot be closed.
func (*Collection) ComputeStats ¶
func (c *Collection) ComputeStats() CollectionStats
ComputeStats gathers and returns statistics about the collection. It returns a CollectionStats object filled with the relevant statistics.
func (*Collection) GetAllIDs ¶
func (c *Collection) GetAllIDs() []uint64
GetAllIDs returns a sorted list of all document IDs in the collection.
func (*Collection) GetDocument ¶
func (c *Collection) GetDocument(id uint64) (*Document, error)
GetDocument retrieves a document from the collection by its ID. It returns the document or an error if the document is not found.
func (*Collection) GetDocumentCount ¶
func (c *Collection) GetDocumentCount() int
GetDocumentCount returns the total number of documents in the collection.
This method provides a quick way to determine the size of the collection by returning the count of document IDs stored in the memfile.
func (*Collection) GetOptions ¶
func (c *Collection) GetOptions() CollectionOptions
GetOptions returns the collection options used to create the collection.
func (*Collection) Search ¶
func (c *Collection) Search(args SearchArgs) SearchResults
Search returns the search results, including the list of matching documents and the percentage of the database searched.
func (*Collection) UpdateDocument ¶
func (c *Collection) UpdateDocument(id uint64, newMetadata []byte) error
UpdateDocument updates the metadata of an existing document in the collection. It returns an error if the document is not found.
type CollectionOptions ¶
type CollectionOptions struct { // Name is the identifier for the collection. Name string `json:"name"` // DistanceMethod specifies the method used to calculate distances between vectors. // It can be either Euclidean or Cosine. DistanceMethod int `json:"distance_method"` // DimensionCount is the number of dimensions for each vector in the collection. DimensionCount int `json:"dimension_count"` // Quantization specifies the bit-level quantization for storing vectors. // Supported values are 4, 8, 16, 32, and 64, with 64 as the default. Quantization int `json:"quantization"` // FileMode specifies the mode for opening the memfile. FileMode FileMode `json:"-"` }
CollectionOptions defines the configuration options for creating a Collection.
type CollectionStats ¶
type CollectionStats struct { // Number of documents in the collection DocumentCount int `json:"document_count"` // Number of dimensions in each document vector DimensionCount int `json:"dimension_count"` // Quantization level used for storing vectors Quantization int `json:"quantization"` // Distance method used for calculating distances // cosine or euclidean DistanceMethod string `json:"distance_method"` // Storage on disk used by the collection StorageSize int64 `json:"storage_size"` // Average distance between random pairs of documents AverageDistance float64 `json:"average_distance"` }
Contains statistics about the collection
type Config ¶
type Config struct { OllamaServer string `mapstructure:"ollama_server"` TextModel string `mapstructure:"text_model"` ImageModel string `mapstructure:"image_model"` DataFolder string `mapstructure:"data_folder"` SyzgyHost string `mapstructure:"syzgy_host"` HTMLRoot string `mapstructure:"html_root"` // If non-zero, we will use psuedorandom numbers so everything is predictable for testing. RandomSeed int64 }
Config holds the configuration settings for the service.
type DataStream ¶
type Document ¶
type Document struct { // ID is the unique identifier for the document. ID uint64 // Vector is the numerical representation of the document. Vector []float64 // Metadata is additional information associated with the document. Metadata []byte }
Document represents a single document in the collection, consisting of an ID, vector, and metadata.
type FilterFn ¶
func BuildFilter ¶
BuildFilter compiles the query into a filter function that can be used with SearchArgs.
type IndexEntry ¶
type SearchArgs ¶
type SearchArgs struct { // Vector is the search vector used to find similar documents. Vector []float64 // Filter is an optional function to filter documents based on their ID and metadata. Filter FilterFn // K specifies the maximum number of nearest neighbors to return. K int // Radius specifies the maximum distance for radius-based search. Radius float64 // when MaxCount and Radius are both 0 we will return all the documents in order of id. // These specify the offset and limit Offset int Limit int Precision string }
SearchArgs defines the arguments for performing a search in the collection.
type SearchResult ¶
type SearchResult struct { // ID is the unique identifier of the document in the search result. ID uint64 // Metadata is the associated metadata of the document in the search result. Metadata []byte // Distance is the calculated distance from the search vector to the document vector. Distance float64 }
SearchResult represents a single result from a search operation, including the document ID, metadata, and distance.
type SearchResults ¶
type SearchResults struct { // Results is a slice of SearchResult containing the documents that matched the search criteria. Results []SearchResult // PercentSearched indicates the percentage of the database that was searched to obtain the results. PercentSearched float64 }
SearchResults contains the results of a search operation, including the list of results and the percentage of the database searched.
type SpanFile ¶
type SpanFile struct {
// contains filtered or unexported fields
}
func (*SpanFile) IterateRecords ¶
func (db *SpanFile) IterateRecords(callback func(recordID string, sr *SpanReader) error) error
func (*SpanFile) IterateSortedRecords ¶
func (db *SpanFile) IterateSortedRecords(callback func(recordID string, sr *SpanReader) error) error
func (*SpanFile) RemoveRecord ¶
func (*SpanFile) WriteRecord ¶
func (db *SpanFile) WriteRecord(recordID string, dataStreams []DataStream) error
type SpanReader ¶
type SpanReader struct {
// contains filtered or unexported fields
}