Skip to content

Scenario Builders

The ArtemisKit SDK provides a type-safe fluent API for building evaluation scenarios programmatically, without writing YAML files.

Builders are included in the main SDK package:

Terminal window
bun add @artemiskit/sdk

Import from the builders subpath:

import { scenario, testCase, contains, exact } from '@artemiskit/sdk/builders';
import { scenario, testCase, contains, exact, regex } from '@artemiskit/sdk/builders';
import { ArtemisKit } from '@artemiskit/sdk';
// Build a scenario programmatically
const myScenario = scenario('api-response-tests')
.description('Test API response quality')
.provider('openai')
.model('gpt-4o')
.cases([
testCase('greeting')
.prompt('Say hello to the user')
.expect(contains(['hello', 'hi', 'hey'])),
testCase('math-calculation')
.prompt('What is 15 + 27?')
.expect(exact('42')),
testCase('code-output')
.prompt('Write a function that returns true')
.expect(regex(/return\s+true/)),
])
.build();
// Run the scenario
const kit = new ArtemisKit({ project: 'my-project' });
const results = await kit.run({ scenario: myScenario });

The scenario() function creates a new scenario builder:

import { scenario } from '@artemiskit/sdk/builders';
const myScenario = scenario('scenario-name')
.description('What this scenario tests')
.provider('openai') // 'openai' | 'anthropic' | 'azure-openai' | etc.
.model('gpt-4o')
.timeout(30000) // Timeout per case in ms
.retries(2) // Retry failed cases
.tags(['smoke', 'critical']) // Tags for filtering
.variables({ // Variables for interpolation
userName: 'Alice',
topic: 'TypeScript',
})
.cases([...])
.build();
MethodDescription
description(text)Set scenario description
provider(name)Set LLM provider
model(name)Set model name
timeout(ms)Set timeout per case
retries(count)Set retry count
tags(tags[])Add tags for filtering
variables(obj)Set variables for interpolation
cases(cases[])Add test cases
build()Build the final scenario object

The testCase() function creates individual test cases:

import { testCase, contains } from '@artemiskit/sdk/builders';
const tc = testCase('case-id')
.prompt('Your prompt here')
.systemPrompt('Optional system prompt')
.expect(contains(['expected', 'values']))
.tags(['smoke'])
.timeout(10000)
.build();
MethodDescription
prompt(text)Set the user prompt
systemPrompt(text)Set system prompt
messages(msgs[])Set full message array
expect(expectation)Set expected output
tags(tags[])Add tags
timeout(ms)Override timeout
build()Build the test case object
const tc = testCase('conversation')
.messages([
{ role: 'system', content: 'You are a helpful assistant' },
{ role: 'user', content: 'Hello!' },
{ role: 'assistant', content: 'Hi there! How can I help?' },
{ role: 'user', content: 'What is 2+2?' },
])
.expect(contains(['4']))
.build();

Check if response contains specified values:

import { contains } from '@artemiskit/sdk/builders';
// Any of the values (default)
contains(['hello', 'hi', 'hey'])
// All values required
contains(['hello', 'world'], { mode: 'all' })
// Case insensitive
contains(['HELLO'], { caseInsensitive: true })

Check that response does NOT contain values:

import { notContains } from '@artemiskit/sdk/builders';
notContains(['error', 'failed', 'exception'])

Exact string match:

import { exact } from '@artemiskit/sdk/builders';
exact('42')
exact('Hello, World!')

Regular expression match:

import { regex } from '@artemiskit/sdk/builders';
regex(/\d{4}-\d{2}-\d{2}/) // Date pattern
regex(/^(yes|no)$/i) // Yes/No with flags
regex('\\d+') // String pattern

Fuzzy string matching using Levenshtein distance:

import { fuzzy } from '@artemiskit/sdk/builders';
fuzzy('Hello World', { threshold: 0.8 })

Semantic similarity matching:

import { similarity } from '@artemiskit/sdk/builders';
// Embedding-based (default)
similarity('A friendly greeting', { threshold: 0.85 })
// LLM-based semantic comparison
similarity('A helpful response', { mode: 'llm', threshold: 0.9 })

LLM-as-judge grading:

import { llmGrade } from '@artemiskit/sdk/builders';
llmGrade('Response should be helpful, accurate, and concise', {
threshold: 0.8,
})
llmGrade('Is this a valid JSON response?', {
threshold: 0.9,
model: 'gpt-4o', // Override grader model
})

Validate JSON output against a schema:

import { jsonSchema } from '@artemiskit/sdk/builders';
jsonSchema({
type: 'object',
required: ['name', 'age'],
properties: {
name: { type: 'string' },
age: { type: 'number', minimum: 0 },
email: { type: 'string', format: 'email' },
},
})

allOf(...expectations) / anyOf(...expectations)

Section titled “allOf(...expectations) / anyOf(...expectations)”

Combine multiple expectations:

import { allOf, anyOf, contains, regex } from '@artemiskit/sdk/builders';
// All must pass
allOf(
contains(['hello']),
regex(/\d+/),
)
// At least one must pass
anyOf(
exact('yes'),
exact('no'),
contains(['maybe']),
)
import {
scenario,
testCase,
contains,
exact,
regex,
jsonSchema,
llmGrade,
allOf,
} from '@artemiskit/sdk/builders';
import { ArtemisKit } from '@artemiskit/sdk';
const apiTestScenario = scenario('api-quality-tests')
.description('Comprehensive API response quality tests')
.provider('openai')
.model('gpt-4o')
.timeout(30000)
.tags(['api', 'quality'])
.variables({
apiVersion: 'v2',
})
.cases([
testCase('greeting-response')
.prompt('Greet the user warmly')
.expect(contains(['hello', 'hi', 'welcome']))
.tags(['smoke']),
testCase('json-output')
.prompt('Return a JSON object with name and age')
.expect(jsonSchema({
type: 'object',
required: ['name', 'age'],
properties: {
name: { type: 'string' },
age: { type: 'number' },
},
})),
testCase('code-generation')
.systemPrompt('You are a helpful coding assistant')
.prompt('Write a TypeScript function that adds two numbers')
.expect(allOf(
contains(['function']),
regex(/:\s*number/), // Return type
)),
testCase('helpful-response')
.prompt('Explain what an API is to a beginner')
.expect(llmGrade(
'Response should be clear, accurate, and appropriate for beginners',
{ threshold: 0.85 }
)),
])
.build();
// Run the scenario
const kit = new ArtemisKit({
provider: 'openai',
model: 'gpt-4o',
project: 'api-tests',
});
const results = await kit.run({ scenario: apiTestScenario });
console.log(`Pass rate: ${results.manifest.metrics.pass_rate * 100}%`);

All builders are fully typed. TypeScript will catch errors at compile time:

import { testCase, contains } from '@artemiskit/sdk/builders';
// TypeScript error: 'invald' is not a valid mode
contains(['hello'], { mode: 'invald' });
// TypeScript error: threshold must be a number
llmGrade('rubric', { threshold: 'high' });
// TypeScript error: missing required 'prompt' or 'messages'
testCase('test').expect(contains(['x'])).build();

For maximum type safety, use the contract types:

import type {
ScenarioContract,
TestCaseContract,
ExpectationContract,
} from '@artemiskit/sdk/contracts';
const myScenario: ScenarioContract = {
name: 'typed-scenario',
cases: [...],
};