Benchmarks
Evaluate AI models on real-world business tasks to select the best models for your specific use cases.
Overview
Benchmarks provides a framework for evaluating and comparing AI models based on real-world business performance. It enables you to:
- Assess model performance on enterprise-specific tasks
- Compare models across accuracy, cost, latency, and throughput
- Create standardized evaluation suites for business functions
- Maintain leaderboards for different business capabilities
Features
- Business Function Evals: Specialized tests for Sales, Marketing, Support, Coding, etc.
- Multi-dimensional Scoring: Evaluate on accuracy, cost, and performance
- Custom Test Suites: Create evaluations tailored to your business
- Automated Testing: Run benchmarks on model releases
- Comparative Analysis: Track model improvements over time
Usage
import { defineBenchmark, runBenchmark } from 'benchmarks.do'
// Define a customer support benchmark
const customerSupportBenchmark = defineBenchmark({
name: 'customer_support_quality',
description: 'Evaluates AI models on customer support tasks',
// Define test categories
categories: [
{
name: 'inquiry_classification',
description: 'Classify customer inquiries by type',
weight: 0.2,
},
{
name: 'response_generation',
description: 'Generate helpful responses to inquiries',
weight: 0.5,
},
{
name: 'escalation_detection',
description: 'Identify when to escalate to human agents',
weight: 0.3,
},
],
// Define evaluation metrics
metrics: [
{ name: 'accuracy', weight: 0.7 },
{ name: 'latency', weight: 0.15 },
{ name: 'cost', weight: 0.15 },
],
})
// Run benchmark with multiple models
const results = await runBenchmark({
benchmark: customerSupportBenchmark,
models: ['openai/gpt-4.5', 'anthropic/claude-3-opus', 'google/gemini-pro'],
})
// Get the best model for your use case
const bestModel = results.getBestModel({
accuracyWeight: 0.8,
costWeight: 0.1,
latencyWeight: 0.1,
})
Last updated on