Skip to content

Evaluators

Evaluators are the core of ArtemisKit’s testing engine. They take LLM responses and compare them against expectations to determine pass/fail status and compute scores.

┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Prompt │────▶│ LLM API │────▶│ Response │
└─────────────┘ └─────────────┘ └─────────────┘
┌─────────────┐ ┌─────────────┐
│ Result │◀────│ Evaluator │
│ (pass/fail) │ │ (matcher) │
└─────────────┘ └─────────────┘
  1. Prompt sent — ArtemisKit sends the prompt to the configured LLM
  2. Response received — The LLM generates a response
  3. Evaluator invoked — The appropriate evaluator processes the response
  4. Result produced — Pass/fail status, score, and details returned

All evaluators implement this interface:

interface Evaluator {
readonly type: string;
evaluate(
response: string,
expected: Expected,
context?: EvaluatorContext
): Promise<EvaluatorResult>;
}

Evaluators receive context that may include:

interface EvaluatorContext {
client?: ModelClient; // LLM client for LLM-based evaluation
testCase?: TestCase; // Full test case for reference
}

Every evaluator returns:

interface EvaluatorResult {
passed: boolean; // Did the test pass?
score: number; // 0.0 to 1.0 score
reason?: string; // Human-readable explanation
details?: object; // Additional metadata
}
EvaluatorTypeDescription
ContainscontainsChecks if response contains values
Not Containsnot_containsEnsures response excludes values
ExactexactRequires exact string match
RegexregexMatches regular expression patterns
EvaluatorTypeDescription
FuzzyfuzzyLevenshtein distance matching
SimilaritysimilaritySemantic similarity (embedding or LLM)
EvaluatorTypeDescription
LLM Graderllm_graderLLM judges against rubric
EvaluatorTypeDescription
JSON Schemajson_schemaValidates JSON structure
EvaluatorTypeDescription
CombinedcombinedAND/OR logic for multiple expectations
InlineinlineCustom JavaScript expressions

ArtemisKit automatically selects the evaluator based on the type field:

expected:
type: contains # ← Selects ContainsEvaluator
values: ["hello"]

The evaluator registry maps types to implementations:

const evaluators = {
'contains': new ContainsEvaluator(),
'not_contains': new NotContainsEvaluator(),
'exact': new ExactEvaluator(),
'regex': new RegexEvaluator(),
'fuzzy': new FuzzyEvaluator(),
'similarity': new SimilarityEvaluator(),
'llm_grader': new LLMGraderEvaluator(),
'json_schema': new JsonSchemaEvaluator(),
'combined': new CombinedEvaluator(),
'inline': new InlineEvaluator(),
'custom': new CustomEvaluator(),
};

Evaluators use different scoring strategies:

Some evaluators return 0 or 1:

  • exact — 1 if exact match, 0 otherwise
  • contains — 1 if all/any values found, 0 otherwise
  • regex — 1 if pattern matches, 0 otherwise

Others return values between 0 and 1:

  • fuzzy — Similarity ratio (0.0-1.0)
  • similarity — Semantic similarity score
  • llm_grader — LLM-assigned score based on rubric

Many evaluators use thresholds:

expected:
type: fuzzy
value: "Hello world"
threshold: 0.8 # Pass if score >= 0.8
// Simplified evaluation flow
async function evaluateCase(testCase, client) {
// 1. Send prompt to LLM
const result = await client.generate({
prompt: testCase.prompt,
model: testCase.model,
});
// 2. Get appropriate evaluator
const evaluator = getEvaluator(testCase.expected.type);
// 3. Evaluate response
const evalResult = await evaluator.evaluate(
result.text,
testCase.expected,
{ client, testCase }
);
return {
caseId: testCase.id,
response: result.text,
...evalResult,
};
}
// Simplified scenario runner
async function runScenario(scenario) {
const results = [];
for (const testCase of scenario.cases) {
const result = await evaluateCase(testCase, client);
results.push(result);
}
return {
scenario: scenario.name,
passed: results.every(r => r.passed),
results,
stats: computeStats(results),
};
}

Create custom evaluators for specialized needs:

import type { Evaluator, EvaluatorResult, Expected, EvaluatorContext } from '@artemiskit/core';
class MyCustomEvaluator implements Evaluator {
readonly type = 'my_custom';
async evaluate(
response: string,
expected: Expected,
context?: EvaluatorContext
): Promise<EvaluatorResult> {
// Your custom evaluation logic
const score = this.computeScore(response, expected);
return {
passed: score >= (expected.threshold ?? 0.5),
score,
reason: `Custom evaluation score: ${score}`,
details: {
responseLength: response.length,
},
};
}
private computeScore(response: string, expected: Expected): number {
// Custom scoring logic
return 0.85;
}
}

Register custom evaluators:

import { registerEvaluator } from '@artemiskit/core';
registerEvaluator('my_custom', new MyCustomEvaluator());

Use in scenarios:

expected:
type: custom
evaluator: my_custom
config:
threshold: 0.7
  • passed — Boolean pass/fail
  • score — Numeric score (0-1)
  • latencyMs — Response time
  • tokens — Token usage (prompt, completion, total)
  • passRate — Percentage of cases that passed
  • avgScore — Average score across cases
  • avgLatency — Average response time
  • totalTokens — Total token usage
  • cost — Estimated API cost