Expectations
Expectations
Section titled “Expectations”Expectations define how ArtemisKit evaluates LLM responses. Each test case requires exactly one expected object with a type field.
Available Expectation Types
Section titled “Available Expectation Types”| Type | Description |
|---|---|
contains | Check if response contains specific strings |
not_contains | Check if response does NOT contain specific strings |
exact | Check for exact string match |
regex | Match against a regular expression |
fuzzy | Approximate string matching |
llm_grader | Use an LLM to evaluate the response |
json_schema | Validate response against a JSON schema |
similarity | Semantic similarity matching using embeddings or LLM |
inline | Expression-based matchers defined directly in YAML |
combined | Combine multiple expectations with and/or logic |
custom | Use a custom evaluator |
Contains
Section titled “Contains”Check if the response contains specific strings. Use mode to control matching behavior.
expected: type: contains values: - "hello" - "welcome" mode: any # 'any' = at least one match, 'all' = all must match (default)| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be contains |
values | array | Yes | Array of strings to look for |
mode | string | No | all (default) or any |
Examples
Section titled “Examples”Match any of the values:
expected: type: contains values: ["hello", "hi", "hey"] mode: anyMatch all values (default behavior):
expected: type: contains values: ["price", "available"] mode: allNot Contains
Section titled “Not Contains”Check if the response does NOT contain specific strings. The inverse of contains.
expected: type: not_contains values: - "error" - "failed" mode: any # 'any' = fail if any value found, 'all' = fail only if all found| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be not_contains |
values | array | Yes | Array of strings that should NOT be present |
mode | string | No | all (default) or any |
Examples
Section titled “Examples”Fail if any forbidden term is found:
expected: type: not_contains values: ["password", "secret", "credential"] mode: anyEnsure response doesn’t contain error indicators:
expected: type: not_contains values: ["error", "exception", "failed"] mode: anyCheck for an exact string match.
expected: type: exact value: "The answer is 42." caseSensitive: true # Optional, default: true| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be exact |
value | string | Yes | The exact string to match |
caseSensitive | boolean | No | Case-sensitive matching (default: true) |
Example
Section titled “Example”expected: type: exact value: "Hello, World!" caseSensitive: falseMatch the response against a regular expression.
expected: type: regex pattern: "\\d{4}-\\d{2}-\\d{2}" # Date format YYYY-MM-DD flags: "i" # Optional: regex flags| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be regex |
pattern | string | Yes | Regular expression pattern |
flags | string | No | Regex flags (e.g., i for case-insensitive) |
Examples
Section titled “Examples”Match an email:
expected: type: regex pattern: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"Case-insensitive match:
expected: type: regex pattern: "hello.*world" flags: "i"Allow approximate matching using string similarity. Uses Levenshtein distance.
expected: type: fuzzy value: "approximately this text" threshold: 0.8 # 80% similarity required (default)| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be fuzzy |
value | string | Yes | The expected text |
threshold | number | No | Similarity threshold 0-1 (default: 0.8) |
Example
Section titled “Example”expected: type: fuzzy value: "The quick brown fox jumps over the lazy dog" threshold: 0.75LLM Grader
Section titled “LLM Grader”Use an LLM to evaluate the response quality based on a rubric.
expected: type: llm_grader rubric: | Evaluate the response based on: - Accuracy of information - Helpfulness and clarity - Professional tone threshold: 0.7 # Minimum score 0-1 (default) provider: openai # Optional: override provider model: gpt-5 # Optional: override model| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be llm_grader |
rubric | string | Yes | Evaluation criteria for the grader |
threshold | number | No | Minimum passing score 0-1 (default: 0.7) |
provider | string | No | Provider for the grader LLM |
model | string | No | Model for the grader LLM |
Example
Section titled “Example”expected: type: llm_grader rubric: | Score the response on these criteria: 1. Does it directly answer the question? 2. Is the information accurate? 3. Is it concise without unnecessary information? threshold: 0.8JSON Schema
Section titled “JSON Schema”Validate that the response is valid JSON matching a schema.
expected: type: json_schema schema: type: object required: - name - age properties: name: type: string age: type: number minimum: 0| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be json_schema |
schema | object | Yes | JSON Schema definition |
Example
Section titled “Example”expected: type: json_schema schema: type: object required: - status - data properties: status: type: string enum: ["success", "error"] data: type: array items: type: objectSimilarity
Section titled “Similarity”Check if the response is semantically similar to a reference text. Supports two evaluation modes:
- Embedding mode: Uses vector embeddings for fast, cost-effective comparison
- LLM mode: Uses an LLM to evaluate semantic similarity (slower but more nuanced)
expected: type: similarity value: "The product is available in three colors: red, blue, and green." threshold: 0.75 # Minimum similarity score 0-1 (default: 0.75) mode: embedding # Optional: 'embedding', 'llm', or omit for auto embeddingModel: text-embedding-3-large # Optional: embedding model (for embedding mode) model: gpt-4o # Optional: LLM model (for llm mode)| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be similarity |
value | string | Yes | The reference text to compare against |
threshold | number | No | Minimum similarity score 0-1 (default: 0.75) |
mode | string | No | embedding, llm, or omit for auto (tries embedding first) |
embeddingModel | string | No | Embedding model to use (e.g., text-embedding-3-large) |
model | string | No | LLM model for llm mode comparison |
Mode Behavior
Section titled “Mode Behavior”- Auto (default): Tries embedding first, falls back to LLM if embeddings unavailable
- Embedding: Uses only embeddings; fails if embedding function unavailable
- LLM: Uses only LLM-based comparison; skips embedding entirely
Examples
Section titled “Examples”Embedding mode with specific model:
expected: type: similarity mode: embedding embeddingModel: text-embedding-3-large value: "The weather today will be sunny with a high of 75°F" threshold: 0.8LLM mode for nuanced comparison:
expected: type: similarity mode: llm model: gpt-4o value: "A helpful explanation of how photosynthesis works" threshold: 0.7Auto mode (default behavior):
expected: type: similarity value: "Thank you for your purchase. Your order has been confirmed." threshold: 0.6Using Azure OpenAI embeddings:
expected: type: similarity mode: embedding embeddingModel: text-embedding-ada-002 value: "Customer support response acknowledging the issue" threshold: 0.75Inline
Section titled “Inline”Define expression-based matchers directly in YAML. Allows flexible matching logic without writing custom evaluators. Expressions are evaluated safely without using eval().
expected: type: inline expression: 'includes("hello") && length > 10'| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be inline |
expression | string | Yes | Safe expression to evaluate |
value | string | No | Optional value for comparisons |
Supported Expressions
Section titled “Supported Expressions”Length Checks
Section titled “Length Checks”| Expression | Description |
|---|---|
length > N | Response has more than N characters |
length < N | Response has fewer than N characters |
length == N | Response has exactly N characters |
length >= N | Response has N or more characters |
length <= N | Response has N or fewer characters |
String Checks
Section titled “String Checks”| Expression | Description |
|---|---|
startsWith("prefix") | Response starts with the given text |
endsWith("suffix") | Response ends with the given text |
includes("text") | Response contains the given text |
!includes("text") | Response does NOT contain the given text |
Regex Matching
Section titled “Regex Matching”| Expression | Description |
|---|---|
matches(/pattern/) | Response matches the regex pattern |
matches(/pattern/i) | Case-insensitive regex match |
matches(/pattern/g) | Global regex match |
JSON Field Checks
Section titled “JSON Field Checks”| Expression | Description |
|---|---|
json.field == "value" | JSON field equals string value |
json.field == 42 | JSON field equals numeric value |
json.field == true | JSON field equals boolean |
json.nested.field == "value" | Nested JSON field check |
Combined Expressions
Section titled “Combined Expressions”| Expression | Description |
|---|---|
expr1 && expr2 | Both expressions must pass (AND) |
expr1 || expr2 | Either expression can pass (OR) |
Examples
Section titled “Examples”Check response length:
expected: type: inline expression: 'length >= 50 && length <= 280'Check string format:
expected: type: inline expression: 'startsWith("{") && endsWith("}")'Check for required content:
expected: type: inline expression: 'includes("thank you") || includes("thanks")'Exclude forbidden content:
expected: type: inline expression: '!includes("error") && !includes("failed")'Regex validation (email format):
expected: type: inline expression: 'matches(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/)'Case-insensitive regex:
expected: type: inline expression: 'matches(/^(yes|no)$/i)'JSON field validation:
expected: type: inline expression: 'json.status == "success"'Nested JSON field:
expected: type: inline expression: 'json.user.role == "admin"'JSON boolean check:
expected: type: inline expression: 'json.active == true'Complex combined validation:
expected: type: inline expression: 'startsWith("PROD-") && length >= 10 && length <= 20 && matches(/^[A-Z0-9-]+$/)'API response validation:
expected: type: inline expression: 'json.status == "success" && includes("data")'Combined
Section titled “Combined”Combine multiple expectations with and/or logic. This allows you to create complex evaluation criteria by combining any of the other expectation types.
expected: type: combined operator: and # 'and' = all must pass, 'or' = at least one must pass expectations: - type: contains values: ["hello"] mode: any - type: not_contains values: ["error"] mode: any| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be combined |
operator | string | Yes | and or or |
expectations | array | Yes | Array of expectation objects to combine |
Examples
Section titled “Examples”Response must contain greeting AND not contain errors:
expected: type: combined operator: and expectations: - type: contains values: ["hello", "hi", "welcome"] mode: any - type: not_contains values: ["error", "failed", "exception"] mode: anyResponse must match a pattern OR contain specific text:
expected: type: combined operator: or expectations: - type: regex pattern: "\\d{3}-\\d{4}" - type: contains values: ["phone number not available"] mode: anyComplex validation with multiple criteria:
expected: type: combined operator: and expectations: - type: contains values: ["price", "cost"] mode: any - type: regex pattern: "\\$\\d+\\.\\d{2}" - type: not_contains values: ["unavailable", "out of stock"] mode: anyCustom
Section titled “Custom”Use a custom evaluator function.
expected: type: custom evaluator: "word-count-validator" config: minWords: 10 maxWords: 100| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be custom |
evaluator | string | Yes | Name of the custom evaluator |
config | object | No | Configuration for the evaluator |
Best Practices
Section titled “Best Practices”- Start with
contains— It’s the simplest and most common matcher - Use
mode: any— When checking for synonyms or variations - Use
mode: all— When multiple concepts must be present - Reserve
llm_graderfor complex evaluations — It adds latency and cost - Use
json_schemafor structured outputs — When your LLM returns JSON - Set appropriate thresholds — Too strict causes false failures, too lenient misses issues
Common Patterns
Section titled “Common Patterns”Checking for synonyms
Section titled “Checking for synonyms”expected: type: contains values: ["yes", "correct", "right", "affirmative"] mode: anyEnsuring multiple topics are covered
Section titled “Ensuring multiple topics are covered”expected: type: contains values: ["pricing", "features", "support"] mode: allEnsuring forbidden content is absent
Section titled “Ensuring forbidden content is absent”expected: type: not_contains values: ["password", "api_key", "secret", "credential"] mode: anyValidating formatted output
Section titled “Validating formatted output”expected: type: regex pattern: "^\\d+\\.\\s+.+" # Numbered list formatQuality evaluation
Section titled “Quality evaluation”expected: type: llm_grader rubric: "Response is helpful, accurate, and professionally written" threshold: 0.8Combining multiple conditions
Section titled “Combining multiple conditions”expected: type: combined operator: and expectations: - type: contains values: ["thank you", "thanks"] mode: any - type: not_contains values: ["error", "cannot", "unable"] mode: any - type: regex pattern: "\\d+" # Must contain at least one numberSee Also
Section titled “See Also”- Scenario Format — Full scenario structure
- Run Command — Execute scenarios