Skip to content

Evaluation API

The ArtemisKit SDK provides a programmatic API for running LLM evaluations, red team testing, and stress testing directly from your code.

Terminal window
bun add @artemiskit/sdk
# or
npm install @artemiskit/sdk
import { ArtemisKit } from '@artemiskit/sdk';
const kit = new ArtemisKit({
provider: 'openai',
model: 'gpt-4o',
project: 'my-project',
});
// Run scenario-based evaluation
const results = await kit.run({
scenario: './scenarios/quality.yaml',
});
console.log(`Pass rate: ${results.manifest.metrics.pass_rate * 100}%`);
console.log(`Success: ${results.success}`);

Validate scenario files before execution for CI/CD pre-flight checks:

const validation = await kit.validate({
scenario: './scenarios/**/*.yaml', // File path(s) or glob pattern
strict: false, // Fail on warnings (default: false)
});
if (!validation.valid) {
console.error('Scenarios invalid:');
for (const error of validation.errors) {
console.error(` ${error.file}: ${error.message}`);
}
process.exit(1);
}
console.log(`Validated ${validation.scenarios.length} scenarios`);
interface ValidationResult {
valid: boolean;
scenarios: ScenarioValidation[];
errors: ValidationError[];
warnings: ValidationWarning[];
}
interface ScenarioValidation {
file: string;
name: string;
valid: boolean;
caseCount: number;
errors: ValidationError[];
warnings: ValidationWarning[];
}

Compare two test runs for regression detection:

const comparison = await kit.compare({
baseline: 'baseline-run-id', // Run ID or 'latest' or baseline name
current: 'current-run-id', // Run ID
threshold: 0.05, // Regression threshold (default: 0.05)
});
if (comparison.regression) {
console.error('Regression detected!');
console.error(`Pass rate delta: ${comparison.delta.passRate}%`);
console.error(`Added cases: ${comparison.addedCases.length}`);
console.error(`Removed cases: ${comparison.removedCases.length}`);
process.exit(1);
}
interface ComparisonResult {
regression: boolean;
delta: {
passRate: number; // Difference in pass rate
avgLatency: number; // Difference in avg latency
totalCases: number; // Difference in case count
};
baseline: RunSummary;
current: RunSummary;
addedCases: string[]; // Cases in current but not baseline
removedCases: string[]; // Cases in baseline but not current
changedCases: CaseComparison[];
}
import { ArtemisKit } from '@artemiskit/sdk';
const kit = new ArtemisKit({
// Project identifier (for organizing results)
project: 'my-llm-app',
// Default provider and model
provider: 'openai', // 'openai' | 'anthropic' | 'azure-openai' | 'vercel-ai'
model: 'gpt-4o',
// Provider configuration
providerConfig: {
apiKey: process.env.OPENAI_API_KEY,
timeout: 60000,
},
// Redaction settings
redaction: {
enabled: true,
patterns: ['email', 'phone', 'ssn', 'api_key'],
},
// Default timeout and retries
timeout: 30000,
retries: 2,
concurrency: 5,
});
const results = await kit.run({
scenario: './scenarios/quality.yaml',
});
console.log('Success:', results.success);
console.log('Pass rate:', results.manifest.metrics.pass_rate);
console.log('Cases:', results.cases);
// Run only cases with specific tags
const results = await kit.run({
scenario: './scenarios/quality.yaml',
tags: ['critical', 'smoke'],
});
// Check tag-specific results
const criticalCases = results.cases.filter(c => c.tags.includes('critical'));
console.log(`Critical cases: ${criticalCases.filter(c => c.ok).length}/${criticalCases.length}`);
// Override provider/model for this run
const results = await kit.run({
scenario: './scenarios/quality.yaml',
provider: 'anthropic',
model: 'claude-3-opus-20240229',
providerConfig: {
apiKey: process.env.ANTHROPIC_API_KEY,
},
});
const results = await kit.run({
scenario: './scenarios/large-suite.yaml',
concurrency: 10, // Run 10 cases in parallel
});
import { createAdapter } from '@artemiskit/core';
// Create your own client
const client = await createAdapter({
provider: 'openai',
apiKey: process.env.OPENAI_API_KEY,
});
// Use it with ArtemisKit
const results = await kit.run({
scenario: './scenarios/quality.yaml',
client, // Pass existing client
});
// Clean up
await client.close();

Run adversarial security testing against your LLM:

const results = await kit.redteam({
scenario: './scenarios/quality.yaml',
mutations: ['typo', 'role-spoof', 'encoding', 'instruction-flip'],
countPerCase: 5, // 5 mutations per test case
});
console.log('Defense rate:', results.defenseRate);
console.log('Unsafe responses:', results.unsafeCount);
console.log('Success:', results.success); // true if defense rate >= 95%
MutationDescription
typoIntroduce typos to bypass filters
role-spoofAttempt role/persona hijacking
instruction-flipReverse or contradict instructions
cot-injectionChain-of-thought injection attacks
encodingUse encoding (base64, hex, unicode) to obfuscate
multi-turnMulti-turn conversation attacks
const results = await kit.redteam({
scenario: './scenarios/quality.yaml',
});
// Check severity breakdown
const { by_severity } = results.manifest.metrics;
console.log('Critical:', by_severity.critical);
console.log('High:', by_severity.high);
console.log('Medium:', by_severity.medium);
console.log('Low:', by_severity.low);
// Custom thresholds
const hasCritical = by_severity.critical > 0;
const hasHigh = by_severity.high > 0;
if (hasCritical) {
console.error('CRITICAL vulnerabilities found!');
process.exit(1);
}

Test your LLM application under load:

const results = await kit.stress({
scenario: './scenarios/performance.yaml',
concurrency: 10, // 10 concurrent workers
duration: 60, // Run for 60 seconds
rampUp: 10, // 10 second ramp-up period
});
console.log('RPS:', results.rps);
console.log('P95 Latency:', results.p95LatencyMs, 'ms');
console.log('Success rate:', results.successRate);
OptionTypeDefaultDescription
concurrencynumber10Number of concurrent workers
durationnumber30Test duration in seconds
rampUpnumber5Ramp-up period in seconds
maxRequestsnumber-Maximum requests (optional cap)
const results = await kit.stress({
scenario: './scenarios/performance.yaml',
concurrency: 10,
duration: 60,
});
const { metrics } = results.manifest;
console.log('Total requests:', metrics.total_requests);
console.log('Successful:', metrics.successful_requests);
console.log('Failed:', metrics.failed_requests);
console.log('Success rate:', metrics.success_rate);
console.log('RPS:', metrics.requests_per_second);
// Latency metrics
console.log('Min latency:', metrics.min_latency_ms, 'ms');
console.log('Max latency:', metrics.max_latency_ms, 'ms');
console.log('Avg latency:', metrics.avg_latency_ms, 'ms');
console.log('P50 latency:', metrics.p50_latency_ms, 'ms');
console.log('P90 latency:', metrics.p90_latency_ms, 'ms');
console.log('P95 latency:', metrics.p95_latency_ms, 'ms');
console.log('P99 latency:', metrics.p99_latency_ms, 'ms');
// Token metrics (if available)
if (metrics.tokens) {
console.log('Total tokens:', metrics.tokens.total_tokens);
console.log('Avg per request:', metrics.tokens.avg_tokens_per_request);
}

Subscribe to events for real-time progress updates:

const kit = new ArtemisKit({ project: 'my-project' });
// Progress updates
kit.on('progress', (event) => {
console.log(`[${event.phase}] ${event.message} (${event.progress}%)`);
});
// Case completion
kit.on('caseComplete', (event) => {
const { result, index, total } = event;
console.log(`Case ${index + 1}/${total}: ${result.ok ? 'PASS' : 'FAIL'}`);
});
// Red team mutation completion
kit.on('redteamMutationComplete', (event) => {
console.log(`Mutation: ${event.mutation}, Status: ${event.status}`);
});
// Stress test request completion
kit.on('stressRequestComplete', (event) => {
console.log(`Request ${event.index}: ${event.result.latencyMs}ms, RPS: ${event.currentRPS}`);
});
// Run with events
const results = await kit.run({
scenario: './scenarios/quality.yaml',
});
kit
.onCaseStart((event) => {
console.log('Starting case:', event.caseId);
})
.onCaseComplete((event) => {
console.log('Completed case:', event.result.id);
})
.onProgress((event) => {
console.log('Progress:', event.message);
})
.onRedTeamMutationStart((event) => {
console.log('Starting mutation:', event.mutation);
})
.onRedTeamMutationComplete((event) => {
console.log('Mutation result:', event.status);
})
.onStressRequestComplete((event) => {
console.log('Request latency:', event.result.latencyMs);
});
// Listen for first case completion only
kit.once('caseComplete', (event) => {
console.log('First case completed:', event.result.id);
});
interface RunResult {
success: boolean;
manifest: {
version: string;
type: 'evaluation';
run_id: string;
project: string;
start_time: string;
end_time: string;
duration_ms: number;
config: { ... };
metrics: {
total_cases: number;
passed_cases: number;
failed_cases: number;
skipped_cases: number;
pass_rate: number;
latency: {
min_ms: number;
max_ms: number;
avg_ms: number;
p50_ms: number;
p95_ms: number;
p99_ms: number;
};
};
git?: { ... };
provenance?: { ... };
};
cases: CaseResult[];
}
interface RedTeamResult {
success: boolean;
defenseRate: number;
unsafeCount: number;
manifest: {
metrics: {
total_tests: number;
safe_responses: number;
blocked_responses: number;
unsafe_responses: number;
error_responses: number;
defended: number;
defense_rate: number;
by_severity: {
low: number;
medium: number;
high: number;
critical: number;
};
};
results: RedTeamCaseResult[];
};
}
interface StressResult {
success: boolean;
successRate: number;
rps: number;
p95LatencyMs: number;
manifest: {
metrics: {
total_requests: number;
successful_requests: number;
failed_requests: number;
success_rate: number;
requests_per_second: number;
min_latency_ms: number;
max_latency_ms: number;
avg_latency_ms: number;
p50_latency_ms: number;
p90_latency_ms: number;
p95_latency_ms: number;
p99_latency_ms: number;
tokens?: { ... };
};
};
}

You can pass scenarios directly without YAML files:

const results = await kit.run({
scenario: {
name: 'inline-test',
description: 'Test directly in code',
cases: [
{
id: 'math-test',
prompt: 'What is 2 + 2?',
expected: {
type: 'contains',
values: ['4'],
mode: 'any',
},
},
{
id: 'greeting-test',
prompt: 'Say hello',
expected: {
type: 'llm_grader',
criteria: 'Response is a friendly greeting',
minScore: 0.8,
},
},
],
},
});
try {
const results = await kit.run({
scenario: './scenarios/quality.yaml',
});
if (!results.success) {
const failedCases = results.cases.filter(c => !c.ok);
for (const failed of failedCases) {
console.error(`Failed: ${failed.name} - ${failed.reason}`);
}
process.exit(1);
}
} catch (error) {
if (error.code === 'SCENARIO_NOT_FOUND') {
console.error('Scenario file not found');
} else if (error.code === 'PROVIDER_ERROR') {
console.error('Provider error:', error.message);
} else {
console.error('Unexpected error:', error);
}
process.exit(1);
}
  1. Use project names — Set meaningful project names for result organization
  2. Configure timeouts — Set appropriate timeouts for your use case
  3. Enable redaction — Redact PII in results to avoid logging sensitive data
  4. Use tags — Tag test cases for selective execution
  5. Handle events — Subscribe to events for progress monitoring
  6. Clean up clients — Close clients when done if creating your own