Skip to content

Expectations

Expectations define how ArtemisKit evaluates LLM responses. Each expectation type uses a different matching strategy, from simple string containment to LLM-based semantic grading.

Every test case requires an expected field that specifies:

  • type — The evaluation strategy to use
  • Type-specific fields — Parameters for that evaluator
expected:
type: contains
values: ["hello", "world"]
mode: any
TypeUse CaseKey Fields
containsResponse includes textvalues, mode
not_containsResponse excludes textvalues, mode
exactExact string matchvalue, caseSensitive
regexPattern matchingpattern, flags
fuzzyApproximate matchvalue, threshold
similaritySemantic similarityvalue, threshold, mode
llm_graderLLM judges qualityrubric, threshold
json_schemaValidate JSON structureschema
combinedAND/OR logicoperator, expectations
inlineCustom expressionsexpression
customCustom evaluatorevaluator, config

Checks if the response contains specified text:

expected:
type: contains
values:
- "hello"
- "world"
mode: any # any | all (default: all)
  • mode: all — Response must contain ALL values
  • mode: any — Response must contain at least ONE value

Ensures the response does NOT contain specified text:

expected:
type: not_contains
values:
- "error"
- "I don't know"
mode: all # any | all (default: all)
  • mode: all — Response must NOT contain ANY of the values (passes if none match)
  • mode: any — Response must NOT contain at least one value

Requires an exact string match:

expected:
type: exact
value: "42"
caseSensitive: true # default: true

Matches using regular expressions:

expected:
type: regex
pattern: "\\d{4}-\\d{2}-\\d{2}" # Date pattern
flags: "i" # Optional: i, g, m flags

Uses Levenshtein distance for approximate string matching:

expected:
type: fuzzy
value: "Hello, world!"
threshold: 0.8 # 0-1, default: 0.8

A threshold of 0.8 means 80% similarity is required to pass.

Semantic similarity using embeddings or LLM comparison:

expected:
type: similarity
value: "A friendly greeting"
threshold: 0.75 # default: 0.75
mode: embedding
embeddingModel: text-embedding-3-small
  • embedding — Uses vector embeddings for comparison (faster, cheaper)
  • llm — Uses LLM to judge semantic similarity (more nuanced)

Uses an LLM to grade the response against a rubric:

expected:
type: llm_grader
rubric: |
Score the response on:
1. Accuracy (0-0.4)
2. Helpfulness (0-0.3)
3. Clarity (0-0.3)
Return total score 0-1.
threshold: 0.7 # default: 0.7
model: gpt-4 # optional: grader model
provider: openai # optional: grader provider

The grader returns a score from 0 to 1. The test passes if the score meets or exceeds the threshold.

Validates that the response is valid JSON matching a schema:

expected:
type: json_schema
schema:
type: object
required:
- name
- age
properties:
name:
type: string
age:
type: number
minimum: 0
email:
type: string
format: email

The evaluator:

  1. Parses the response as JSON
  2. Validates against the JSON Schema
  3. Returns validation errors if any

Combines multiple expectations with AND/OR logic:

expected:
type: combined
operator: and
expectations:
- type: contains
values: ["thank you"]
- type: not_contains
values: ["error"]

Write custom evaluation expressions:

expected:
type: inline
expression: "response.length > 100 && response.includes('hello')"
value: "hello" # optional: value to use in expression

The expression has access to:

  • response — The LLM response text
  • value — The optional value field

Use a custom evaluator by name:

expected:
type: custom
evaluator: myCustomEvaluator
config:
threshold: 0.5
customOption: true

Custom evaluators must be registered with ArtemisKit before use.

Every evaluator returns a consistent result structure:

interface EvaluatorResult {
passed: boolean; // Did the test pass?
score: number; // 0-1 score
reason?: string; // Human-readable explanation
details?: object; // Additional metadata
}

Example result:

{
"passed": true,
"score": 0.85,
"reason": "Response contains 2 of 2 required values",
"details": {
"matchedValues": ["hello", "world"],
"mode": "all"
}
}
ScenarioRecommended Type
Must include specific textcontains
Must NOT include textnot_contains
Exact answer expectedexact
Pattern matching (dates, IDs)regex
Typo tolerance neededfuzzy
Semantic meaning matterssimilarity
Subjective quality assessmentllm_grader
JSON API responsesjson_schema
Multiple conditionscombined
Complex custom logicinline