Expectations

Expectations define how ArtemisKit evaluates LLM responses. Each test case requires exactly one expected object with a type field.

Available Expectation Types

Type	Description
`contains`	Check if response contains specific strings
`not_contains`	Check if response does NOT contain specific strings
`exact`	Check for exact string match
`regex`	Match against a regular expression
`fuzzy`	Approximate string matching
`llm_grader`	Use an LLM to evaluate the response
`json_schema`	Validate response against a JSON schema
`similarity`	Semantic similarity matching using embeddings or LLM
`inline`	Expression-based matchers defined directly in YAML
`combined`	Combine multiple expectations with `and`/`or` logic
`custom`	Use a custom evaluator

Contains

Check if the response contains specific strings. Use mode to control matching behavior.

expected:
  type: contains
  values:
    - "hello"
    - "welcome"
  mode: any  # 'any' = at least one match, 'all' = all must match (default)

Field	Type	Required	Description
`type`	string	Yes	Must be `contains`
`values`	array	Yes	Array of strings to look for
`mode`	string	No	`all` (default) or `any`

Examples

Match any of the values:

expected:
  type: contains
  values: ["hello", "hi", "hey"]
  mode: any

Match all values (default behavior):

expected:
  type: contains
  values: ["price", "available"]
  mode: all

Not Contains

Check if the response does NOT contain specific strings. The inverse of contains.

expected:
  type: not_contains
  values:
    - "error"
    - "failed"
  mode: any  # 'any' = fail if any value found, 'all' = fail only if all found

Field	Type	Required	Description
`type`	string	Yes	Must be `not_contains`
`values`	array	Yes	Array of strings that should NOT be present
`mode`	string	No	`all` (default) or `any`

Examples

Fail if any forbidden term is found:

expected:
  type: not_contains
  values: ["password", "secret", "credential"]
  mode: any

Ensure response doesn’t contain error indicators:

expected:
  type: not_contains
  values: ["error", "exception", "failed"]
  mode: any

Exact

Check for an exact string match.

expected:
  type: exact
  value: "The answer is 42."
  caseSensitive: true  # Optional, default: true

Field	Type	Required	Description
`type`	string	Yes	Must be `exact`
`value`	string	Yes	The exact string to match
`caseSensitive`	boolean	No	Case-sensitive matching (default: true)

Example

expected:
  type: exact
  value: "Hello, World!"
  caseSensitive: false

Regex

Match the response against a regular expression.

expected:
  type: regex
  pattern: "\\d{4}-\\d{2}-\\d{2}"  # Date format YYYY-MM-DD
  flags: "i"  # Optional: regex flags

Field	Type	Required	Description
`type`	string	Yes	Must be `regex`
`pattern`	string	Yes	Regular expression pattern
`flags`	string	No	Regex flags (e.g., `i` for case-insensitive)

Examples

Match an email:

expected:
  type: regex
  pattern: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"

Case-insensitive match:

expected:
  type: regex
  pattern: "hello.*world"
  flags: "i"

Fuzzy

Allow approximate matching using string similarity. Uses Levenshtein distance.

expected:
  type: fuzzy
  value: "approximately this text"
  threshold: 0.8  # 80% similarity required (default)

Field	Type	Required	Description
`type`	string	Yes	Must be `fuzzy`
`value`	string	Yes	The expected text
`threshold`	number	No	Similarity threshold 0-1 (default: 0.8)

Example

expected:
  type: fuzzy
  value: "The quick brown fox jumps over the lazy dog"
  threshold: 0.75

LLM Grader

Use an LLM to evaluate the response quality based on a rubric.

expected:
  type: llm_grader
  rubric: |
    Evaluate the response based on:
    - Accuracy of information
    - Helpfulness and clarity
    - Professional tone
  threshold: 0.7  # Minimum score 0-1 (default)
  provider: openai  # Optional: override provider
  model: gpt-5    # Optional: override model

Field	Type	Required	Description
`type`	string	Yes	Must be `llm_grader`
`rubric`	string	Yes	Evaluation criteria for the grader
`threshold`	number	No	Minimum passing score 0-1 (default: 0.7)
`provider`	string	No	Provider for the grader LLM
`model`	string	No	Model for the grader LLM

Example

expected:
  type: llm_grader
  rubric: |
    Score the response on these criteria:
    1. Does it directly answer the question?
    2. Is the information accurate?
    3. Is it concise without unnecessary information?
  threshold: 0.8

JSON Schema

Validate that the response is valid JSON matching a schema.

expected:
  type: json_schema
  schema:
    type: object
    required:
      - name
      - age
    properties:
      name:
        type: string
      age:
        type: number
        minimum: 0

Field	Type	Required	Description
`type`	string	Yes	Must be `json_schema`
`schema`	object	Yes	JSON Schema definition

Example

expected:
  type: json_schema
  schema:
    type: object
    required:
      - status
      - data
    properties:
      status:
        type: string
        enum: ["success", "error"]
      data:
        type: array
        items:
          type: object

Similarity

Check if the response is semantically similar to a reference text. Supports two evaluation modes:

Embedding mode: Uses vector embeddings for fast, cost-effective comparison
LLM mode: Uses an LLM to evaluate semantic similarity (slower but more nuanced)

expected:
  type: similarity
  value: "The product is available in three colors: red, blue, and green."
  threshold: 0.75  # Minimum similarity score 0-1 (default: 0.75)
  mode: embedding  # Optional: 'embedding', 'llm', or omit for auto
  embeddingModel: text-embedding-3-large  # Optional: embedding model (for embedding mode)
  model: gpt-4o  # Optional: LLM model (for llm mode)

Field	Type	Required	Description
`type`	string	Yes	Must be `similarity`
`value`	string	Yes	The reference text to compare against
`threshold`	number	No	Minimum similarity score 0-1 (default: 0.75)
`mode`	string	No	`embedding`, `llm`, or omit for auto (tries embedding first)
`embeddingModel`	string	No	Embedding model to use (e.g., `text-embedding-3-large`)
`model`	string	No	LLM model for llm mode comparison

Mode Behavior

Auto (default): Tries embedding first, falls back to LLM if embeddings unavailable
Embedding: Uses only embeddings; fails if embedding function unavailable
LLM: Uses only LLM-based comparison; skips embedding entirely

Examples

Embedding mode with specific model:

expected:
  type: similarity
  mode: embedding
  embeddingModel: text-embedding-3-large
  value: "The weather today will be sunny with a high of 75°F"
  threshold: 0.8

LLM mode for nuanced comparison:

expected:
  type: similarity
  mode: llm
  model: gpt-4o
  value: "A helpful explanation of how photosynthesis works"
  threshold: 0.7

Auto mode (default behavior):

expected:
  type: similarity
  value: "Thank you for your purchase. Your order has been confirmed."
  threshold: 0.6

Using Azure OpenAI embeddings:

expected:
  type: similarity
  mode: embedding
  embeddingModel: text-embedding-ada-002
  value: "Customer support response acknowledging the issue"
  threshold: 0.75

Inline

Define expression-based matchers directly in YAML. Allows flexible matching logic without writing custom evaluators. Expressions are evaluated safely without using eval().

expected:
  type: inline
  expression: 'includes("hello") && length > 10'

Field	Type	Required	Description
`type`	string	Yes	Must be `inline`
`expression`	string	Yes	Safe expression to evaluate
`value`	string	No	Optional value for comparisons

Supported Expressions

Length Checks

Expression	Description
`length > N`	Response has more than N characters
`length < N`	Response has fewer than N characters
`length == N`	Response has exactly N characters
`length >= N`	Response has N or more characters
`length <= N`	Response has N or fewer characters

String Checks

Expression	Description
`startsWith("prefix")`	Response starts with the given text
`endsWith("suffix")`	Response ends with the given text
`includes("text")`	Response contains the given text
`!includes("text")`	Response does NOT contain the given text

Regex Matching

Expression	Description
`matches(/pattern/)`	Response matches the regex pattern
`matches(/pattern/i)`	Case-insensitive regex match
`matches(/pattern/g)`	Global regex match

JSON Field Checks

Expression	Description
`json.field == "value"`	JSON field equals string value
`json.field == 42`	JSON field equals numeric value
`json.field == true`	JSON field equals boolean
`json.nested.field == "value"`	Nested JSON field check

Combined Expressions

Expression	Description
`expr1 && expr2`	Both expressions must pass (AND)
`expr1 \|\| expr2`	Either expression can pass (OR)

Examples

Check response length:

expected:
  type: inline
  expression: 'length >= 50 && length <= 280'

Check string format:

expected:
  type: inline
  expression: 'startsWith("{") && endsWith("}")'

Check for required content:

expected:
  type: inline
  expression: 'includes("thank you") || includes("thanks")'

Exclude forbidden content:

expected:
  type: inline
  expression: '!includes("error") && !includes("failed")'

Regex validation (email format):

expected:
  type: inline
  expression: 'matches(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/)'

Case-insensitive regex:

expected:
  type: inline
  expression: 'matches(/^(yes|no)$/i)'

JSON field validation:

expected:
  type: inline
  expression: 'json.status == "success"'

Nested JSON field:

expected:
  type: inline
  expression: 'json.user.role == "admin"'

JSON boolean check:

expected:
  type: inline
  expression: 'json.active == true'

Complex combined validation:

expected:
  type: inline
  expression: 'startsWith("PROD-") && length >= 10 && length <= 20 && matches(/^[A-Z0-9-]+$/)'

API response validation:

expected:
  type: inline
  expression: 'json.status == "success" && includes("data")'

Combined

Combine multiple expectations with and/or logic. This allows you to create complex evaluation criteria by combining any of the other expectation types.

expected:
  type: combined
  operator: and  # 'and' = all must pass, 'or' = at least one must pass
  expectations:
    - type: contains
      values: ["hello"]
      mode: any
    - type: not_contains
      values: ["error"]
      mode: any

Field	Type	Required	Description
`type`	string	Yes	Must be `combined`
`operator`	string	Yes	`and` or `or`
`expectations`	array	Yes	Array of expectation objects to combine

Examples

Response must contain greeting AND not contain errors:

expected:
  type: combined
  operator: and
  expectations:
    - type: contains
      values: ["hello", "hi", "welcome"]
      mode: any
    - type: not_contains
      values: ["error", "failed", "exception"]
      mode: any

Response must match a pattern OR contain specific text:

expected:
  type: combined
  operator: or
  expectations:
    - type: regex
      pattern: "\\d{3}-\\d{4}"
    - type: contains
      values: ["phone number not available"]
      mode: any

Complex validation with multiple criteria:

expected:
  type: combined
  operator: and
  expectations:
    - type: contains
      values: ["price", "cost"]
      mode: any
    - type: regex
      pattern: "\\$\\d+\\.\\d{2}"
    - type: not_contains
      values: ["unavailable", "out of stock"]
      mode: any

Custom

Use a custom evaluator function.

expected:
  type: custom
  evaluator: "word-count-validator"
  config:
    minWords: 10
    maxWords: 100

Field	Type	Required	Description
`type`	string	Yes	Must be `custom`
`evaluator`	string	Yes	Name of the custom evaluator
`config`	object	No	Configuration for the evaluator

Best Practices

Start with contains — It’s the simplest and most common matcher
Use mode: any — When checking for synonyms or variations
Use mode: all — When multiple concepts must be present
Reserve llm_grader for complex evaluations — It adds latency and cost
Use json_schema for structured outputs — When your LLM returns JSON
Set appropriate thresholds — Too strict causes false failures, too lenient misses issues

Common Patterns

Checking for synonyms

expected:
  type: contains
  values: ["yes", "correct", "right", "affirmative"]
  mode: any

Ensuring multiple topics are covered

expected:
  type: contains
  values: ["pricing", "features", "support"]
  mode: all

Ensuring forbidden content is absent

expected:
  type: not_contains
  values: ["password", "api_key", "secret", "credential"]
  mode: any

Validating formatted output

expected:
  type: regex
  pattern: "^\\d+\\.\\s+.+"  # Numbered list format

Quality evaluation

expected:
  type: llm_grader
  rubric: "Response is helpful, accurate, and professionally written"
  threshold: 0.8

Combining multiple conditions

expected:
  type: combined
  operator: and
  expectations:
    - type: contains
      values: ["thank you", "thanks"]
      mode: any
    - type: not_contains
      values: ["error", "cannot", "unable"]
      mode: any
    - type: regex
      pattern: "\\d+"  # Must contain at least one number

Expectations

Expectations

Available Expectation Types

Contains

Examples

Not Contains

Examples

Exact

Example

Regex

Examples

Fuzzy

Example

LLM Grader

Example

JSON Schema

Example

Similarity

Mode Behavior

Examples

Inline

Supported Expressions

Length Checks

String Checks

Regex Matching

JSON Field Checks

Combined Expressions

Examples

Combined

Examples

Custom

Best Practices

Common Patterns

Checking for synonyms

Ensuring multiple topics are covered

Ensuring forbidden content is absent

Validating formatted output

Quality evaluation

Combining multiple conditions

See Also