Skip to content

Scenarios

Scenarios are the core unit of testing in ArtemisKit. A scenario is a collection of test cases that define prompts to send to an LLM and expectations for evaluating the responses.

Every scenario contains:

  • Metadata — Name, description, version, tags
  • Provider configuration — Which LLM to use and how to configure it
  • Setup — System prompts and function definitions
  • Test cases — Individual prompts with expected outcomes
  • Variables — Template values for dynamic content
name: customer-support-bot
description: Evaluate customer support responses
version: "1.0"
provider: openai
model: gpt-4
setup:
systemPrompt: "You are a helpful customer support agent."
variables:
company: "TechCorp"
cases:
- id: greeting-test
prompt: "Hello!"
expected:
type: contains
values: ["hello", "hi", "welcome"]
mode: any

Scenarios can specify which LLM provider to use at multiple levels:

LevelScopeExample
ScenarioAll cases in scenarioprovider: openai at root
CaseSingle test caseprovider: anthropic in case
ProviderValueDescription
OpenAIopenaiGPT models via OpenAI API
Azure OpenAIazure-openaiGPT models via Azure
AnthropicanthropicClaude models
Vercel AIvercel-aiVercel AI SDK
GooglegoogleGemini models
MistralmistralMistral AI models
CoherecohereCohere models
OllamaollamaLocal models via Ollama
LangChainlangchainLangChain runnables
DeepAgentsdeepagentsDeepAgents systems
CustomcustomCustom adapters

Use providerConfig for provider-specific settings:

provider: azure-openai
model: gpt-4
providerConfig:
resourceName: my-azure-resource
deploymentName: my-deployment
apiVersion: 2024-02-15-preview

Each test case requires:

  • id — Unique identifier within the scenario
  • prompt — The input to send to the LLM
  • expected — How to evaluate the response

Simple string prompts for single-turn interactions:

cases:
- id: math-test
prompt: "What is 2 + 2?"
expected:
type: contains
values: ["4"]

Array of messages for conversation testing:

cases:
- id: context-test
prompt:
- role: user
content: "My name is Alice"
- role: assistant
content: "Nice to meet you, Alice!"
- role: user
content: "What is my name?"
expected:
type: contains
values: ["Alice"]

Override scenario defaults at the case level:

cases:
- id: test-with-claude
prompt: "Hello"
provider: anthropic
model: claude-3-5-sonnet-20241022
timeout: 60000
retries: 2
expected:
type: contains
values: ["hello"]

Variables enable template substitution using {{variable}} syntax:

variables:
product: "ArtemisKit"
version: "1.0"
cases:
- id: product-test
prompt: "Tell me about {{product}} version {{version}}"
expected:
type: contains
values: ["{{product}}"]
variables:
version: "2.0" # Case-level override

Variables support strings, numbers, and booleans:

variables:
name: "Alice" # string
count: 42 # number
enabled: true # boolean

Tags allow filtering and organizing test cases:

# Scenario-level tags (apply to all cases)
name: my-scenario
tags: [production, critical]
cases:
# Case-level tags (merged with scenario tags)
- id: quick-test
tags: [smoke, quick]
prompt: "Hello"
expected:
type: contains
values: ["hello"]

Run specific tags:

Terminal window
akit run scenarios/ --tags smoke
akit run scenarios/ --tags smoke,critical

Configure PII/sensitive data redaction:

redaction:
enabled: true
patterns:
- email
- phone
- credit_card
- ssn
- api_key
redactPrompts: true
redactResponses: true
replacement: "[REDACTED]"
PatternDescription
emailEmail addresses
phonePhone numbers
credit_cardCredit card numbers
ssnSocial Security Numbers
api_keyAPI keys and tokens
ipv4IPv4 addresses
jwtJWT tokens
aws_keyAWS access keys
secretsGeneric secrets

Configure scenario-level setup and cleanup:

setup:
systemPrompt: "You are a helpful assistant."
functions: [] # Function definitions for function calling
teardown:
cleanup: true # Clean up resources after run
name: customer-support-evaluation
description: Comprehensive test suite for support bot
version: "1.0"
provider: openai
model: gpt-4
seed: 42
temperature: 0.7
tags: [production, support]
variables:
company: "TechCorp"
product: "Widget Pro"
setup:
systemPrompt: |
You are a customer support agent for {{company}}.
Be helpful, professional, and accurate.
redaction:
enabled: true
patterns: [email, phone]
cases:
- id: greeting
name: Greeting Test
description: Test basic greeting response
tags: [smoke]
prompt: "Hello!"
expected:
type: contains
values: ["hello", "hi", "welcome"]
mode: any
- id: product-info
name: Product Information
tags: [product]
prompt: "Tell me about {{product}}"
expected:
type: llm_grader
rubric: "Response accurately describes the product"
threshold: 0.7
- id: multi-turn
name: Context Retention
tags: [advanced]
prompt:
- role: user
content: "I bought {{product}} last week"
- role: assistant
content: "Great choice! How can I help with your {{product}}?"
- role: user
content: "What did I buy?"
expected:
type: contains
values: ["{{product}}"]