syzgydb

package module
v0.0.0-...-e21bc5c Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 1, 2024 License: MIT Imports: 26 Imported by: 1

README

Go Reference

Syzgy DB

image

Table of Contents

Introduction

SyzgyDB is a high-performance, embeddable vector database designed for applications requiring efficient handling of large datasets. Written in Go, it leverages disk-based storage to minimize memory usage, making it ideal for systems with limited resources. SyzgyDB supports a range of distance metrics, including Euclidean and Cosine, and offers multiple quantization levels to optimize storage and search performance.

With built-in integration for the Ollama server, SyzgyDB can automatically generate vector embeddings from text and images, simplifying the process of adding and querying data. This makes it well-suited for use cases such as image and video retrieval, recommendation systems, natural language processing, anomaly detection, and bioinformatics. With its RESTful API, SyzgyDB provides easy integration and management of collections and records, enabling developers to perform fast and flexible vector similarity searches.

Features

  • Disk-Based Storage: Operates with minimal memory usage by storing data on disk.
  • Automatic Embedding Generation: Seamlessly integrates with the Ollama server to generate vector embeddings from text and images, reducing the need for manual preprocessing.
  • Vector Quantization: Supports multiple quantization levels (4, 8, 16, 32, 64 bits) to optimize storage and performance.
  • Distance Metrics: Supports Euclidean and Cosine distance calculations for vector similarity.
  • Scalable: Efficiently handles large datasets with support for adding, updating, and removing documents.
  • Search Capabilities: Provides nearest neighbor and radius-based search functionalities.
  • Python Client: A Python client is available for easy integration with Python projects.

Running with Docker

docker run -p 8080:8080 -v /path/to/your/data:/data smhanov/syzgydb

This command will:

  1. Pull the smhanov/syzgydb image from Docker Hub.
  2. Map port 8080 of the container to port 8080 on your host machine.
  3. Map the /data directory inside the container to /path/to/your/data on your host system, ensuring that your data is persisted outside the container.

Configuration

The configuration settings can be specified on the command line, using an environment variable, or in a file /etc/syzgydb.conf.

Configuration Setting Description Default Value
DATA_FOLDER Specifies where the persistent files are kept. ./data (command line) or /data (Docker)
OLLAMA_SERVER The optional Ollama server used to create embeddings. localhost:11434
TEXT_MODEL The name of the text embedding model to use with Ollama. all-minilm (384 dimensions)
IMAGE_MODEL The name of the image embedding model to use with Ollama. minicpm-v

RESTful API

SyzgyDB provides a RESTful API for managing collections and records. Below are the available endpoints and example curl requests.

Collections API

A collection is a database, and you can create them and get information about them.

Create a Collection

Endpoint: POST /api/v1/collections Description: Creates a new collection with specified parameters. Request Body (JSON):

{
  "name": "collection_name",
  "vector_size": 128,
  "quantization": 64,
  "distance_function": "cosine"
}

Example curl:

curl -X POST http://localhost:8080/api/v1/collections -H "Content-Type: application/json" -d '{"name":"collection_name","vector_size":128,"quantization":64,"distance_function":"cosine"}'
Drop a Collection

Endpoint: DELETE /api/v1/collections/{collection_name} Description: Deletes the specified collection. Example curl:

curl -X DELETE http://localhost:8080/api/v1/collections/collection_name
Get Collection Info

Endpoint: GET /api/v1/collections/{collection_name} Description: Retrieves information about a collection. Example curl:

curl -X GET http://localhost:8080/api/v1/collections/collection_name
Data API
Insert / update records

Endpoint: POST /api/v1/collections/{collection_name}/records Description: Inserts multiple records into a collection. Overwrites if the ID exists. You can provide either a vector or a text field for each record. If a text field is provided, the server will automatically generate the vector embedding using the Ollama server. If an image field is provided, it should be in base64 format. Request Body (JSON):

[
  {
    "id": 1234567890,
    "text": "example text", // Optional: Provide text to generate vector
    "vector": [0.1, 0.2, ..., 0.5], // Optional: Directly provide a vector
    "metadata": {
      "key1": "value1",
      "key2": "value2"
    }
  },
  {
    "id": 1234567891,
    "text": "another example text",
    "metadata": {
      "key1": "value3"
    }
  }
]

Example curl:

curl -X POST http://localhost:8080/api/v1/collections/collection_name/records -H "Content-Type: application/json" -d '[{"id":1234567890,"vector":[0.1,0.2,0.3,0.4,0.5],"metadata":{"key1":"value1","key2":"value2"}},{"id":1234567891,"text":"example text","metadata":{"key1":"value1","key2":"value2"}}]'
Update a Record's Metadata

Endpoint: PUT /api/v1/collections/{collection_name}/records/{id}/metadata Description: Updates metadata for a record. Request Body (JSON):

{
  "metadata": {
    "key1": "new_value1",
    "key3": "value3"
  }
}

Example curl:

curl -X PUT http://localhost:8080/api/v1/collections/collection_name/records/1234567890/metadata -H "Content-Type: application/json" -d '{"metadata":{"key1":"new_value1","key3":"value3"}}'
Delete a Record

Endpoint: DELETE /api/v1/collections/{collection_name}/records/{id} Description: Deletes a record. Example curl:

curl -X DELETE http://localhost:8080/api/v1/collections/collection_name/records/1234567890
Get All Document IDs

Endpoint: GET /api/v1/collections/{collection_name}/ids Description: Retrieves a JSON array of all document IDs in the specified collection. Example curl:

curl -X GET http://localhost:8080/api/v1/collections/collection_name/ids
Search Records

Endpoint: POST /api/v1/collections/{collection_name}/search Description: Searches for records based on the provided criteria. If no search parameters are provided, it lists all records in the collection, allowing pagination with limit and offset.

Request Body (JSON):

{
  "vector": [0.1, 0.2, 0.3, ..., 0.5], // Optional: Provide a vector for similarity search
  "text": "example text",              // Optional: Provide text to generate vector for search
  "k": 5,                              // Optional: Number of nearest neighbors to return
  "radius": 0,                       // Optional: Radius for range search
  "limit": 0,                         // Optional: Maximum number of records to return
  "offset": 0,                         // Optional: Number of records to skip for pagination
  "precision": "",                 // Optional: Set to "exact" for exhaustive search
  "filter": "age >= 18 AND status == 'active'" // Optional: Query filter expression
}

Parameters Explanation:

  • vector: A numerical array representing the query vector. Used for similarity searches. If provided, the search will be based on this vector.
  • text: A string input that will be converted into a vector using the Ollama server. This is an alternative to providing a vector directly.
  • k: Specifies the number of nearest neighbors to return. Used when performing a k-nearest neighbor search.
  • radius: Defines the radius for a range search. All records within this distance from the query vector will be returned.
  • limit: Limits the number of records returned in the response. Useful for paginating results.
  • offset: Skips the specified number of records before starting to return results. Used in conjunction with limit for pagination.
  • precision: Specifies the search precision. Defaults to "medium". Set to "exact" to perform an exhaustive search of all points.
  • filter: A string containing a query filter expression. This allows for additional filtering of results based on metadata fields. See the Query Filter Language section for more details.

Example curl:

curl -X POST http://localhost:8080/api/v1/collections/collection_name/search -H "Content-Type: application/json" -d '{"vector":[0.1,0.2,0.3,0.4,0.5],"k":5,"limit":10,"offset":0,"filter":"age >= 18 AND status == \"active\""}'

Usage Scenarios:

  • List All Records: Call the endpoint with no parameters to list all records, using limit and offset to paginate.
  • Text-Based Search: Provide a text parameter to perform a search based on the text's vector representation.
  • Vector-Based Search: Use the vector parameter for direct vector similarity searches.
  • Range Query: Specify a radius to perform a range query, returning all records within the specified distance.
  • K-Nearest Neighbors: Use the k parameter to find the top k nearest records to the query vector.
  • Filtered Search: Use the filter parameter to apply additional constraints based on metadata fields.

Usage in a Go Project

You don't need to use the docker or REST api. You can build it right in to your go project. Here's how.

    import "github.com/smhanov/syzgydb"
Creating a Collection

To create a new collection, define the collection options and initialize the collection:

options := syzgydb.CollectionOptions{
    Name:           "example.dat",
    DistanceMethod: syzgydb.Euclidean, // or Cosine
    DimensionCount: 128,       // Number of dimensions for each vector
    Quantization:   64,        // Quantization level (4, 8, 16, 32, 64)
}

collection := syzgydb.NewCollection(options)
Adding Documents

Add documents to the collection by specifying an ID, vector, and optional metadata:

vector := []float64{0.1, 0.2, 0.3, ..., 0.128} // Example vector
metadata := []byte("example metadata")

collection.AddDocument(1, vector, metadata)
Searching

Perform a search to find similar vectors using either nearest neighbor or radius-based search:

searchVector := []float64{0.1, 0.2, 0.3, ..., 0.128} // Example search vector

// Nearest neighbor search
args := syzgydb.SearchArgs{
    Vector:   searchVector,
    K: 5, // Return top 5 results
}

results := collection.Search(args)

// Radius-based search
args = syzgydb.SearchArgs{
    Vector: searchVector,
    Radius: 0.5, // Search within a radius of 0.5
}

results = collection.Search(args)
Using a Filter Function

You can apply a filter function during the search to include only documents that meet certain criteria. There are two ways to create a filter function:

  1. Using a custom function:
filterFn := func(id uint64, metadata []byte) bool {
    return id%2 == 0 // Include only documents with even IDs
}

args := syzgydb.SearchArgs{
    Vector:   searchVector,
    K: 5, // Return top 5 results
    Filter:   filterFn,
}

results := collection.Search(args)
  1. Using the BuildFilter method with a query string:
queryString := `age >= 18 AND status == \"active\"`
filterFn, err := syzgydb.BuildFilter(queryString)
if err != nil {
    log.Fatalf("Error building filter: %v", err)
}

args := syzgydb.SearchArgs{
    Vector:   searchVector,
    K: 5, // Return top 5 results
    Filter:   filterFn,
}

results := collection.Search(args)

The BuildFilter method allows you to create a filter function from a query string using the Query Filter Language described in this document. This provides a flexible way to filter search results based on metadata fields without writing custom Go code for each filter.

Updating and Removing Documents

Update the metadata of an existing document or remove a document from the collection:

// Update document metadata
err := collection.UpdateDocument(1, []byte("updated metadata"))

// Remove a document
err = collection.RemoveDocument(1)
Dumping the Collection

To dump the collection for inspection or backup, use the DumpIndex function:

syzgydb.DumpIndex("example.dat")

Python Client

A Python client for SyzgyDB is available, making it easy to integrate SyzgyDB with your Python projects.

Installation

You can install the Python client using pip:

pip install syzgy

The Python client package is available on PyPI at https://pypi.org/project/syzgy/0.1.0/

For usage instructions and more details, please refer to the Python client documentation.

Query Filter Language

SyzgyDB supports a powerful query filter language that allows you to filter search results based on metadata fields. This language can be used in the filter parameter of the search API.

Basic Syntax
  • Field Comparison: field_name operator value

    • Example: age >= 18
  • Logical Operations: Combine conditions using AND, OR, NOT

    • Example: (age >= 18 AND status == "active") OR role == "admin"
  • Parentheses: Use to group conditions and control evaluation order

    • Example: (status == "active" AND age >= 18) OR role == "admin"
Supported Operators
  • Comparison: ==, !=, >, <, >=, <=
  • String Operations: CONTAINS, STARTS_WITH, ENDS_WITH, MATCHES (regex)
  • Existence: EXISTS, DOES NOT EXIST
  • Array Operations: IN, NOT IN
Functions
  • field.length: Returns the length of a string or array
Examples
  1. Basic Comparison:

    age >= 18 AND status == "active"
    
  2. String Operations:

    name STARTS_WITH "John" AND email ENDS_WITH "@example.com"
    
  3. Array Operations:

    status IN ["important", "urgent"] 
    
  4. Nested Fields:

    user.profile.verified == true AND user.friends.length > 5
    
  5. Existence Checks:

    phone_number EXISTS AND emergency_contact DOES NOT EXIST
    
  6. Combining Existence with Other Conditions:

    (status == "active" OR status == "pending") AND profile_picture EXISTS
    
  7. Complex Query:

    (status == "active" AND age >= 18) OR (role == "admin" AND NOT (department == "IT")) AND last_login EXISTS
    

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue to discuss improvements or report bugs.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Documentation

Overview

Package syzgydb provides an embeddable vector database written in Go, designed to efficiently handle large datasets by keeping data on disk rather than in memory. This makes SyzygyDB ideal for systems with limited memory resources.

What is a Vector Database?

A vector database is a specialized database designed to store and query high-dimensional vector data. Vectors are numerical representations of data points, often used in machine learning and data science to represent features of objects, such as images, text, or audio. Vector databases enable efficient similarity searches, allowing users to find vectors that are close to a given query vector based on a specified distance metric.

Features

- Disk-Based Storage: Operates with minimal memory usage by storing data on disk. - Vector Quantization: Supports multiple quantization levels (4, 8, 16, 32, 64 bits) to optimize storage and performance. - Distance Metrics: Supports Euclidean and Cosine distance calculations for vector similarity. - Scalable: Efficiently handles large datasets with support for adding, updating, and removing documents. - Search Capabilities: Provides nearest neighbor and radius-based search functionalities.

Usage

## Creating a Collection

To create a new collection, define the collection options and initialize the collection:

options := CollectionOptions{
    Name:           "example_collection",
    DistanceMethod: Euclidean, // or Cosine
    DimensionCount: 128,       // Number of dimensions for each vector
    Quantization:   64,        // Quantization level (4, 8, 16, 32, 64)
}

collection := NewCollection(options)

## Adding Documents

Add documents to the collection by specifying an ID, vector, and optional metadata:

vector := []float64{0.1, 0.2, 0.3, ..., 0.128} // Example vector
metadata := []byte("example metadata")

collection.AddDocument(1, vector, metadata)

## Searching

Perform a search to find similar vectors using either nearest neighbor or radius-based search:

searchVector := []float64{0.1, 0.2, 0.3, ..., 0.128} // Example search vector

// Nearest neighbor search
args := SearchArgs{
    Vector:   searchVector,
    MaxCount: 5, // Return top 5 results
}

results := collection.Search(args)

// Radius-based search
args = SearchArgs{
    Vector: searchVector,
    Radius: 0.5, // Search within a radius of 0.5
}

results = collection.Search(args)

## Updating and Removing Documents

Update the metadata of an existing document or remove a document from the collection:

// Update document metadata
err := collection.UpdateDocument(1, []byte("updated metadata"))

// Remove a document
err = collection.removeDocument(1)

## Dumping the Collection

To dump the collection for inspection or backup, use the DumpIndex function:

DumpIndex("example_collection")

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue to discuss improvements or report bugs.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Index

Constants

View Source
const (
	StopSearch    = iota // Indicates to stop the search due to an error
	PointAccepted        // Indicates the point was accepted and is better
	PointChecked         // Indicates the point was checked unnecessarily
	PointIgnored         // no action taken; pretend point did not exist
)
View Source
const (
	Euclidean = iota
	Cosine
)

Variables

This section is empty.

Functions

func Configure

func Configure(cfg Config)

func DumpIndex

func DumpIndex(filename string)

DumpIndex reads the specified file and displays its contents in a human-readable format.

func EmbedText

func EmbedText(texts []string, useCache bool) ([][]float64, error)

EmbedText connects to the configured Ollama server and runs the configured text model to generate an embedding for the given text.

func ExportJSON

func ExportJSON(c *Collection, w io.Writer) error

func ImportJSON

func ImportJSON(collectionName string, r io.Reader) error

func RunServer

func RunServer()

func SpanLog

func SpanLog(format string, v ...interface{})

Types

type Collection

type Collection struct {
	CollectionOptions
	// contains filtered or unexported fields
}

Collection represents a collection of documents, supporting operations such as adding, updating, removing, and searching documents.

func NewCollection

func NewCollection(options CollectionOptions) (*Collection, error)

NewCollection creates a new Collection with the specified options. It initializes the collection's memory file and pivots manager.

func (*Collection) AddDocument

func (c *Collection) AddDocument(id uint64, vector []float64, metadata []byte)

AddDocument adds a new document to the collection with the specified ID, vector, and metadata. It manages pivots and encodes the document for storage.

func (*Collection) Close

func (c *Collection) Close() error

Close closes the memfile associated with the collection.

Returns: - An error if the memfile cannot be closed.

func (*Collection) ComputeStats

func (c *Collection) ComputeStats() CollectionStats

ComputeStats gathers and returns statistics about the collection. It returns a CollectionStats object filled with the relevant statistics.

func (*Collection) GetAllIDs

func (c *Collection) GetAllIDs() []uint64

GetAllIDs returns a sorted list of all document IDs in the collection.

func (*Collection) GetDocument

func (c *Collection) GetDocument(id uint64) (*Document, error)

GetDocument retrieves a document from the collection by its ID. It returns the document or an error if the document is not found.

func (*Collection) GetDocumentCount

func (c *Collection) GetDocumentCount() int

GetDocumentCount returns the total number of documents in the collection.

This method provides a quick way to determine the size of the collection by returning the count of document IDs stored in the memfile.

func (*Collection) GetOptions

func (c *Collection) GetOptions() CollectionOptions

GetOptions returns the collection options used to create the collection.

func (*Collection) Search

func (c *Collection) Search(args SearchArgs) SearchResults

Search returns the search results, including the list of matching documents and the percentage of the database searched.

func (*Collection) UpdateDocument

func (c *Collection) UpdateDocument(id uint64, newMetadata []byte) error

UpdateDocument updates the metadata of an existing document in the collection. It returns an error if the document is not found.

type CollectionOptions

type CollectionOptions struct {
	// Name is the identifier for the collection.
	Name string `json:"name"`

	// DistanceMethod specifies the method used to calculate distances between vectors.
	// It can be either Euclidean or Cosine.
	DistanceMethod int `json:"distance_method"`

	// DimensionCount is the number of dimensions for each vector in the collection.
	DimensionCount int `json:"dimension_count"`

	// Quantization specifies the bit-level quantization for storing vectors.
	// Supported values are 4, 8, 16, 32, and 64, with 64 as the default.
	Quantization int `json:"quantization"`

	// FileMode specifies the mode for opening the memfile.
	FileMode FileMode `json:"-"`
}

CollectionOptions defines the configuration options for creating a Collection.

type CollectionStats

type CollectionStats struct {
	// Number of documents in the collection
	DocumentCount int `json:"document_count"`

	// Number of dimensions in each document vector
	DimensionCount int `json:"dimension_count"`

	// Quantization level used for storing vectors
	Quantization int `json:"quantization"`

	// Distance method used for calculating distances
	// cosine or euclidean
	DistanceMethod string `json:"distance_method"`

	// Storage on disk used by the collection
	StorageSize int64 `json:"storage_size"`

	// Average distance between random pairs of documents
	AverageDistance float64 `json:"average_distance"`
}

Contains statistics about the collection

type Config

type Config struct {
	OllamaServer string `mapstructure:"ollama_server"`
	TextModel    string `mapstructure:"text_model"`
	ImageModel   string `mapstructure:"image_model"`
	DataFolder   string `mapstructure:"data_folder"`
	SyzgyHost    string `mapstructure:"syzgy_host"`
	HTMLRoot     string `mapstructure:"html_root"`

	// If non-zero, we will use psuedorandom numbers so everything is predictable for testing.
	RandomSeed int64
}

Config holds the configuration settings for the service.

type DataStream

type DataStream struct {
	StreamID uint8
	Data     []byte
}

type Document

type Document struct {
	// ID is the unique identifier for the document.
	ID uint64

	// Vector is the numerical representation of the document.
	Vector []float64

	// Metadata is additional information associated with the document.
	Metadata []byte
}

Document represents a single document in the collection, consisting of an ID, vector, and metadata.

type EmbedTextFunc

type EmbedTextFunc func(text []string, useCache bool) ([][]float64, error)

type FileMode

type FileMode int
const (
	CreateIfNotExists  FileMode = 0 // Create the file only if it doesn't exist
	ReadWrite          FileMode = 1 // Open the file for read/write access
	ReadOnly           FileMode = 2 // Open the file for read-only access
	CreateAndOverwrite FileMode = 3 // Always create and overwrite the file if it exists
)

type FilterFn

type FilterFn func(id uint64, metadata []byte) bool

func BuildFilter

func BuildFilter(queryIn string) (FilterFn, error)

BuildFilter compiles the query into a filter function that can be used with SearchArgs.

type FreeSpan

type FreeSpan struct {
	Offset uint64
	Length uint64
}

type IndexEntry

type IndexEntry struct {
	Offset         uint64
	Span           *Span
	SequenceNumber uint64
}

type SearchArgs

type SearchArgs struct {
	// Vector is the search vector used to find similar documents.
	Vector []float64

	// Filter is an optional function to filter documents based on their ID and metadata.
	Filter FilterFn

	// K specifies the maximum number of nearest neighbors to return.
	K int

	// Radius specifies the maximum distance for radius-based search.
	Radius float64

	// when MaxCount and Radius are both 0 we will return all the documents in order of id.
	// These specify the offset and limit
	Offset    int
	Limit     int
	Precision string
}

SearchArgs defines the arguments for performing a search in the collection.

type SearchResult

type SearchResult struct {
	// ID is the unique identifier of the document in the search result.
	ID uint64

	// Metadata is the associated metadata of the document in the search result.
	Metadata []byte

	// Distance is the calculated distance from the search vector to the document vector.
	Distance float64
}

SearchResult represents a single result from a search operation, including the document ID, metadata, and distance.

type SearchResults

type SearchResults struct {
	// Results is a slice of SearchResult containing the documents that matched the search criteria.
	Results []SearchResult

	// PercentSearched indicates the percentage of the database that was searched to obtain the results.
	PercentSearched float64
}

SearchResults contains the results of a search operation, including the list of results and the percentage of the database searched.

type Server

type Server struct {
	// contains filtered or unexported fields
}

type Span

type Span struct {
	MagicNumber    uint32
	Length         uint64
	SequenceNumber uint32
	RecordID       string
	DataStreams    []DataStream
	Checksum       uint32
}

type SpanFile

type SpanFile struct {
	// contains filtered or unexported fields
}

func OpenFile

func OpenFile(filename string, mode FileMode) (*SpanFile, error)

func (*SpanFile) Close

func (db *SpanFile) Close() error

func (*SpanFile) GetStats

func (db *SpanFile) GetStats() (size uint64, numRecords int)

func (*SpanFile) IterateRecords

func (db *SpanFile) IterateRecords(callback func(recordID string, sr *SpanReader) error) error

func (*SpanFile) IterateSortedRecords

func (db *SpanFile) IterateSortedRecords(callback func(recordID string, sr *SpanReader) error) error

func (*SpanFile) ReadRecord

func (db *SpanFile) ReadRecord(recordID string) (*Span, error)

func (*SpanFile) RemoveRecord

func (db *SpanFile) RemoveRecord(recordID string) error

func (*SpanFile) WriteRecord

func (db *SpanFile) WriteRecord(recordID string, dataStreams []DataStream) error

type SpanReader

type SpanReader struct {
	// contains filtered or unexported fields
}

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL
JackTT - Gopher 🇻🇳