Skip to content

Expectations

Expectations define how ArtemisKit evaluates LLM responses. Each test case requires exactly one expected object with a type field.

TypeDescription
containsCheck if response contains specific strings
not_containsCheck if response does NOT contain specific strings
exactCheck for exact string match
regexMatch against a regular expression
fuzzyApproximate string matching
llm_graderUse an LLM to evaluate the response
json_schemaValidate response against a JSON schema
similaritySemantic similarity matching using embeddings or LLM
inlineExpression-based matchers defined directly in YAML
combinedCombine multiple expectations with and/or logic
customUse a custom evaluator

Check if the response contains specific strings. Use mode to control matching behavior.

expected:
type: contains
values:
- "hello"
- "welcome"
mode: any # 'any' = at least one match, 'all' = all must match (default)
FieldTypeRequiredDescription
typestringYesMust be contains
valuesarrayYesArray of strings to look for
modestringNoall (default) or any

Match any of the values:

expected:
type: contains
values: ["hello", "hi", "hey"]
mode: any

Match all values (default behavior):

expected:
type: contains
values: ["price", "available"]
mode: all

Check if the response does NOT contain specific strings. The inverse of contains.

expected:
type: not_contains
values:
- "error"
- "failed"
mode: any # 'any' = fail if any value found, 'all' = fail only if all found
FieldTypeRequiredDescription
typestringYesMust be not_contains
valuesarrayYesArray of strings that should NOT be present
modestringNoall (default) or any

Fail if any forbidden term is found:

expected:
type: not_contains
values: ["password", "secret", "credential"]
mode: any

Ensure response doesn’t contain error indicators:

expected:
type: not_contains
values: ["error", "exception", "failed"]
mode: any

Check for an exact string match.

expected:
type: exact
value: "The answer is 42."
caseSensitive: true # Optional, default: true
FieldTypeRequiredDescription
typestringYesMust be exact
valuestringYesThe exact string to match
caseSensitivebooleanNoCase-sensitive matching (default: true)
expected:
type: exact
value: "Hello, World!"
caseSensitive: false

Match the response against a regular expression.

expected:
type: regex
pattern: "\\d{4}-\\d{2}-\\d{2}" # Date format YYYY-MM-DD
flags: "i" # Optional: regex flags
FieldTypeRequiredDescription
typestringYesMust be regex
patternstringYesRegular expression pattern
flagsstringNoRegex flags (e.g., i for case-insensitive)

Match an email:

expected:
type: regex
pattern: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"

Case-insensitive match:

expected:
type: regex
pattern: "hello.*world"
flags: "i"

Allow approximate matching using string similarity. Uses Levenshtein distance.

expected:
type: fuzzy
value: "approximately this text"
threshold: 0.8 # 80% similarity required (default)
FieldTypeRequiredDescription
typestringYesMust be fuzzy
valuestringYesThe expected text
thresholdnumberNoSimilarity threshold 0-1 (default: 0.8)
expected:
type: fuzzy
value: "The quick brown fox jumps over the lazy dog"
threshold: 0.75

Use an LLM to evaluate the response quality based on a rubric.

expected:
type: llm_grader
rubric: |
Evaluate the response based on:
- Accuracy of information
- Helpfulness and clarity
- Professional tone
threshold: 0.7 # Minimum score 0-1 (default)
provider: openai # Optional: override provider
model: gpt-5 # Optional: override model
FieldTypeRequiredDescription
typestringYesMust be llm_grader
rubricstringYesEvaluation criteria for the grader
thresholdnumberNoMinimum passing score 0-1 (default: 0.7)
providerstringNoProvider for the grader LLM
modelstringNoModel for the grader LLM
expected:
type: llm_grader
rubric: |
Score the response on these criteria:
1. Does it directly answer the question?
2. Is the information accurate?
3. Is it concise without unnecessary information?
threshold: 0.8

Validate that the response is valid JSON matching a schema.

expected:
type: json_schema
schema:
type: object
required:
- name
- age
properties:
name:
type: string
age:
type: number
minimum: 0
FieldTypeRequiredDescription
typestringYesMust be json_schema
schemaobjectYesJSON Schema definition
expected:
type: json_schema
schema:
type: object
required:
- status
- data
properties:
status:
type: string
enum: ["success", "error"]
data:
type: array
items:
type: object

Check if the response is semantically similar to a reference text. Supports two evaluation modes:

  • Embedding mode: Uses vector embeddings for fast, cost-effective comparison
  • LLM mode: Uses an LLM to evaluate semantic similarity (slower but more nuanced)
expected:
type: similarity
value: "The product is available in three colors: red, blue, and green."
threshold: 0.75 # Minimum similarity score 0-1 (default: 0.75)
mode: embedding # Optional: 'embedding', 'llm', or omit for auto
embeddingModel: text-embedding-3-large # Optional: embedding model (for embedding mode)
model: gpt-4o # Optional: LLM model (for llm mode)
FieldTypeRequiredDescription
typestringYesMust be similarity
valuestringYesThe reference text to compare against
thresholdnumberNoMinimum similarity score 0-1 (default: 0.75)
modestringNoembedding, llm, or omit for auto (tries embedding first)
embeddingModelstringNoEmbedding model to use (e.g., text-embedding-3-large)
modelstringNoLLM model for llm mode comparison
  • Auto (default): Tries embedding first, falls back to LLM if embeddings unavailable
  • Embedding: Uses only embeddings; fails if embedding function unavailable
  • LLM: Uses only LLM-based comparison; skips embedding entirely

Embedding mode with specific model:

expected:
type: similarity
mode: embedding
embeddingModel: text-embedding-3-large
value: "The weather today will be sunny with a high of 75°F"
threshold: 0.8

LLM mode for nuanced comparison:

expected:
type: similarity
mode: llm
model: gpt-4o
value: "A helpful explanation of how photosynthesis works"
threshold: 0.7

Auto mode (default behavior):

expected:
type: similarity
value: "Thank you for your purchase. Your order has been confirmed."
threshold: 0.6

Using Azure OpenAI embeddings:

expected:
type: similarity
mode: embedding
embeddingModel: text-embedding-ada-002
value: "Customer support response acknowledging the issue"
threshold: 0.75

Define expression-based matchers directly in YAML. Allows flexible matching logic without writing custom evaluators. Expressions are evaluated safely without using eval().

expected:
type: inline
expression: 'includes("hello") && length > 10'
FieldTypeRequiredDescription
typestringYesMust be inline
expressionstringYesSafe expression to evaluate
valuestringNoOptional value for comparisons
ExpressionDescription
length > NResponse has more than N characters
length < NResponse has fewer than N characters
length == NResponse has exactly N characters
length >= NResponse has N or more characters
length <= NResponse has N or fewer characters
ExpressionDescription
startsWith("prefix")Response starts with the given text
endsWith("suffix")Response ends with the given text
includes("text")Response contains the given text
!includes("text")Response does NOT contain the given text
ExpressionDescription
matches(/pattern/)Response matches the regex pattern
matches(/pattern/i)Case-insensitive regex match
matches(/pattern/g)Global regex match
ExpressionDescription
json.field == "value"JSON field equals string value
json.field == 42JSON field equals numeric value
json.field == trueJSON field equals boolean
json.nested.field == "value"Nested JSON field check
ExpressionDescription
expr1 && expr2Both expressions must pass (AND)
expr1 || expr2Either expression can pass (OR)

Check response length:

expected:
type: inline
expression: 'length >= 50 && length <= 280'

Check string format:

expected:
type: inline
expression: 'startsWith("{") && endsWith("}")'

Check for required content:

expected:
type: inline
expression: 'includes("thank you") || includes("thanks")'

Exclude forbidden content:

expected:
type: inline
expression: '!includes("error") && !includes("failed")'

Regex validation (email format):

expected:
type: inline
expression: 'matches(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/)'

Case-insensitive regex:

expected:
type: inline
expression: 'matches(/^(yes|no)$/i)'

JSON field validation:

expected:
type: inline
expression: 'json.status == "success"'

Nested JSON field:

expected:
type: inline
expression: 'json.user.role == "admin"'

JSON boolean check:

expected:
type: inline
expression: 'json.active == true'

Complex combined validation:

expected:
type: inline
expression: 'startsWith("PROD-") && length >= 10 && length <= 20 && matches(/^[A-Z0-9-]+$/)'

API response validation:

expected:
type: inline
expression: 'json.status == "success" && includes("data")'

Combine multiple expectations with and/or logic. This allows you to create complex evaluation criteria by combining any of the other expectation types.

expected:
type: combined
operator: and # 'and' = all must pass, 'or' = at least one must pass
expectations:
- type: contains
values: ["hello"]
mode: any
- type: not_contains
values: ["error"]
mode: any
FieldTypeRequiredDescription
typestringYesMust be combined
operatorstringYesand or or
expectationsarrayYesArray of expectation objects to combine

Response must contain greeting AND not contain errors:

expected:
type: combined
operator: and
expectations:
- type: contains
values: ["hello", "hi", "welcome"]
mode: any
- type: not_contains
values: ["error", "failed", "exception"]
mode: any

Response must match a pattern OR contain specific text:

expected:
type: combined
operator: or
expectations:
- type: regex
pattern: "\\d{3}-\\d{4}"
- type: contains
values: ["phone number not available"]
mode: any

Complex validation with multiple criteria:

expected:
type: combined
operator: and
expectations:
- type: contains
values: ["price", "cost"]
mode: any
- type: regex
pattern: "\\$\\d+\\.\\d{2}"
- type: not_contains
values: ["unavailable", "out of stock"]
mode: any

Use a custom evaluator function.

expected:
type: custom
evaluator: "word-count-validator"
config:
minWords: 10
maxWords: 100
FieldTypeRequiredDescription
typestringYesMust be custom
evaluatorstringYesName of the custom evaluator
configobjectNoConfiguration for the evaluator
  1. Start with contains — It’s the simplest and most common matcher
  2. Use mode: any — When checking for synonyms or variations
  3. Use mode: all — When multiple concepts must be present
  4. Reserve llm_grader for complex evaluations — It adds latency and cost
  5. Use json_schema for structured outputs — When your LLM returns JSON
  6. Set appropriate thresholds — Too strict causes false failures, too lenient misses issues
expected:
type: contains
values: ["yes", "correct", "right", "affirmative"]
mode: any
expected:
type: contains
values: ["pricing", "features", "support"]
mode: all
expected:
type: not_contains
values: ["password", "api_key", "secret", "credential"]
mode: any
expected:
type: regex
pattern: "^\\d+\\.\\s+.+" # Numbered list format
expected:
type: llm_grader
rubric: "Response is helpful, accurate, and professionally written"
threshold: 0.8
expected:
type: combined
operator: and
expectations:
- type: contains
values: ["thank you", "thanks"]
mode: any
- type: not_contains
values: ["error", "cannot", "unable"]
mode: any
- type: regex
pattern: "\\d+" # Must contain at least one number