💎 Not Diamond Go SDK

One line statement to improve reliability and uptime of LLM requests. Documentation
Note
Currently supported providers:
- OpenAI models
- Azure models
- Vertex AI
✨ Features:
- Fallback to other models if one fails
- Load balance requests between multiple models
- Max retries and timeout for each model
- Exponential backoff strategy
- Retry based on HTTP status codes
- Average rolling latency fallback
📦 Installation
go get github.com/Not-Diamond/go-notdiamond
🚀 Basic Usage
Error handling intentionally ommited in the example for simplicity.
Redis needs to be running and accessible from the machine where this code is executed. (runs on default port 6379, but can be changed)
// Get keys
openaiApiKey := ''
azureApiKey := ''
azureEndpoint := ''
vertexProjectID := '' // Your Google Cloud project ID
vertexLocation := 'us-central1' // Your Google Cloud region
// Create requests
openaiRequest := openai.NewRequest("https://api.openai.com/v1/chat/completions", openaiApiKey)
azureRequest := azure.NewRequest(azureEndpoint, azureApiKey)
vertexRequest := vertex.NewRequest(vertexProjectID, vertexLocation)
// Create config
config := model.Config{
Clients: []http.Request{ openaiRequest, azureRequest, vertexRequest },
Models: model.OrderedModels{ "vertex/gemini-pro", "azure/gpt-4o-mini", "openai/gpt-4o-mini" },
MaxRetries: map[string]int{
"vertex/gemini-pro": 2,
"azure/gpt-4o-mini": 2,
"openai/gpt-4o-mini": 2,
},
VertexProjectID: vertexProjectID,
VertexLocation: vertexLocation,
}
// Create transport
transport, err := notdiamond.NewTransport(config)
// Create a standard http.Client with our transport
client := &http.Client{
Transport: transport,
}
// Prepare Payload
messages := []map[string]string{ {"role": "user", "content": "Hello, how are you?"} }
payload := map[string]interface{}{ "model": "gpt-4o-mini", "messages": messages }
jsonData := json.Marshal(payload)
// Create request
req := http.NewRequest("POST", "https://api.openai.com/v1/chat/completions", bytes.NewBuffer(jsonData))
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Authorization", "Bearer "+openaiApiKey)
// Do request via standard http.Client with our transport
resp := client.Do(req)
defer resp.Body.Close()
body := io.ReadAll(resp.Body)
// Final response
fmt.Println(string(body))
Load Balancing
You can configure load balancing between models using weights:
config := notdiamond.Config{
// ... other config ...
Models: notdiamond.WeightedModels{
"vertex/gemini-pro": 0.4, // 40% of requests
"azure/gpt-4": 0.3, // 30% of requests
"openai/gpt-4": 0.3, // 30% of requests
},
}
Max Retries
Configure custom max retries for each model:
// Default max retries is 1
config := notdiamond.Config{
// ... other config ...
MaxRetries: map[string]int{
"azure/gpt-4": 3, // 3 retries
"openai/gpt-4": 2, // 2 retries
},
}
Timeout
Configure custom timeout (in seconds) for each model:
// Default timeout is 100 seconds
config := notdiamond.Config{
// ... other config ...
Timeout: map[string]float64{
"azure/gpt-4": 10.0, // 10 seconds
"openai/gpt-4": 5.0, // 5 seconds
},
}
Exponential Backoff
Configure custom backoff times (in seconds) for each model:
// Default backoff is 1 second
config := notdiamond.Config{
// ... other config ...
Backoff: map[string]float64{
"azure/gpt-4": 0.5, // Start with 0.5s, then 1s, 2s, etc.
"openai/gpt-4": 1.0, // Start with 1s, then 2s, 4s, etc.
},
}
Model-Specific Messages
You can configure system messages that will be prepended to user messages for specific models:
config := notdiamond.Config{
// ... other config ...
ModelMessages: map[string][]map[string]string{
"azure/gpt-4": {
{"role": "system", "content": "You are a helpful assistant."},
},
"openai/gpt-4": {
{"role": "system", "content": "Respond concisely."},
},
},
}
Status Code Retries
You can configure specific retry behavior for different HTTP status codes, either globally or per model.
Per model retry behavior:
config := notdiamond.Config{
// ... other config ...
StatusCodeRetry: map[string]map[string]int{
"openai/gpt-4": {
"429": 3, // Retry rate limit errors 3 times
"500": 2, // Retry internal server errors 2 times
},
},
}
Global retry behavior:
config := notdiamond.Config{
// ... other config ...
StatusCodeRetry: map[string]int{
"429": 3, // Retry rate limit errors 3 times
},
}
Average Rolling Latency Fallback
Configure custom average rolling latency threshold and recovery time for each model:
config := notdiamond.Config{
// ... other config ...
ModelLatency: map[string]notdiamond.RollingAverageLatency{
"azure/gpt-4": {
AvgLatencyThreshold: 3.2, // Average latency threshold, if average latency is greater than this, fallback to other models
NoOfCalls: 10, // Number of calls to make to get the average latency
RecoveryTime: 3 * time.Second, // Time to wait before retrying
},
},
}
Model Limits
Configure custom limits for each model:
config := notdiamond.Config{
// ... other config ...
ModelLimits: model.ModelLimits{
MaxNoOfCalls: 10000,
MaxRecoveryTime: time.Hour * 24,
},
}
Customise Redis
config := notdiamond.Config{
// ... other config ...
Redis: notdiamond.RedisConfig{
Addr: "localhost:6379",
Password: "password",
DB: 0,
},
}
Redis Data Management
The SDK uses Redis to store metrics for model performance tracking, including latency and error rates. To prevent Redis from accumulating excessive data over time, the following data management features are available:
Automatic Cleanup
- When a model exits a recovery period (due to latency or errors), old data is automatically cleaned up
- A periodic background cleanup process can run at configurable intervals
Configure Redis data management through environment variables:
# Redis Data Cleanup Configuration
ENABLE_REDIS_PERIODIC_CLEANUP=true # Enable/disable periodic background cleanup
REDIS_CLEANUP_INTERVAL=6h # How often to run cleanup (accepts Go duration format)
REDIS_DATA_RETENTION=24h # How long to keep data before cleanup
Or add these to your .env
file:
# Redis Configuration
REDIS_ADDR=localhost:6379
REDIS_PASSWORD=
REDIS_DB=0
# Redis Data Cleanup Configuration
ENABLE_REDIS_PERIODIC_CLEANUP=true
REDIS_CLEANUP_INTERVAL=6h
REDIS_DATA_RETENTION=24h
The periodic cleanup process:
- Runs in a separate goroutine to avoid impacting application performance
- Identifies all models with data in Redis
- Removes data older than the specified retention period
- Logs cleanup activities for monitoring
Data Retention Policy
By default, the SDK retains 24 hours of data for both latency and error tracking. This allows for:
- Sufficient historical data for performance analysis
- Trend detection for model reliability
- Prevention of Redis memory growth in high-traffic scenarios
Error Rate Fallback
Configure custom error rate thresholds and recovery time for each model, with different thresholds for different status codes:
config := model.Config{
// ... other config ...
ModelErrorTracking: model.ModelErrorTracking{
"openai/gpt-4": &model.RollingErrorTracking{
StatusConfigs: map[int]*model.StatusCodeConfig{
401: {
ErrorThresholdPercentage: 80, // Fallback if 80% of calls return 401
NoOfCalls: 5, // Number of calls to consider
RecoveryTime: 1 * time.Minute, // Time to wait before retrying
},
500: {
ErrorThresholdPercentage: 70, // Fallback if 70% of calls return 500
NoOfCalls: 5,
RecoveryTime: 1 * time.Minute,
},
502: {
ErrorThresholdPercentage: 60, // Fallback if 60% of calls return 502
NoOfCalls: 5,
RecoveryTime: 1 * time.Minute,
},
429: {
ErrorThresholdPercentage: 40, // Fallback if 40% of calls return 429 (rate limit)
NoOfCalls: 5,
RecoveryTime: 30 * time.Second,
},
},
},
},
}
The error tracking system will:
- Track all HTTP status codes for each model
- Calculate the error percentage for each status code over the last N calls
- If any status code's error percentage exceeds its threshold, mark the model as unhealthy and fallback to other models
- After the recovery time, the model will be tried again
This allows for fine-grained control over error handling:
- Set different thresholds for different types of errors (e.g., more aggressive fallback for rate limits)
- Configure different number of calls and recovery times per status code
- Track any HTTP status code you want to monitor
- Configure different thresholds for different models based on their reliability
Multi-Region Support
The SDK supports configuring multiple regions for Azure and Vertex AI to improve reliability and reduce latency. This allows you to:
- Fallback to different regions if one region experiences issues
- Load balance requests across multiple regions
- Configure region-specific settings for optimal performance
Model Naming Convention
For region-specific models, use the format provider/model/region
:
- Azure:
azure/gpt-4o-mini/eastus
- Vertex AI:
vertex/gemini-pro/us-central1
Azure Multi-Region Configuration
Azure requires explicit configuration of region endpoints in the AzureRegions
map:
config := model.Config{
// ... other config ...
Models: model.OrderedModels{
"azure/gpt-4o-mini/eastus",
"azure/gpt-4o-mini/westeurope",
"vertex/gemini-pro/us-central1",
},
AzureAPIVersion: "2023-05-15", // Required for Azure
AzureRegions: map[string]string{
"eastus": "https://eastus.api.cognitive.microsoft.com",
"westeurope": "https://westeurope.api.cognitive.microsoft.com",
},
}
Vertex AI Multi-Region Configuration
Vertex AI uses the region directly in the API endpoint and doesn't require additional configuration:
config := model.Config{
// ... other config ...
Models: model.OrderedModels{
"vertex/gemini-pro/us-central1",
"vertex/gemini-pro/us-west1",
"vertex/gemini-pro/europe-west4",
"azure/gpt-4o-mini/eastus",
},
VertexProjectID: "your-project-id", // Required for Vertex AI
VertexLocation: "us-central1", // Default location if not specified in model name
}
Mixed Provider Configuration with Regions
You can combine providers and regions for comprehensive fallback strategies:
config := model.Config{
// ... other config ...
Models: model.OrderedModels{
"vertex/gemini-pro/us-east4", // Try Vertex in us-east4 first
"azure/gpt-4o-mini/eastus", // Then try Azure in eastus
"vertex/gemini-pro/us-central1", // Then try Vertex in us-central1
"azure/gpt-4o-mini/westeurope", // Then try Azure in westeurope
},
AzureAPIVersion: "2023-05-15",
AzureRegions: map[string]string{
"eastus": "https://eastus.api.cognitive.microsoft.com",
"westeurope": "https://westeurope.api.cognitive.microsoft.com",
},
VertexProjectID: "your-project-id",
}
Region-Specific Settings
You can configure different settings for each region-specific model:
config := model.Config{
// ... other config ...
MaxRetries: map[string]int{
"azure/gpt-4o-mini/eastus": 3,
"azure/gpt-4o-mini/westeurope": 2,
"vertex/gemini-pro/us-central1": 2,
},
Timeout: map[string]float64{
"azure/gpt-4o-mini/eastus": 10.0,
"azure/gpt-4o-mini/westeurope": 15.0,
"vertex/gemini-pro/us-central1": 12.0,
},
}
Parser
The parser is a function that parses the response from the API and returns the response in a structured format.
// Import from https://github.com/Not-Diamond/go-notdiamond/pkg/http/response
result, err := response.Parse(body, startTime)