Skip to content

artemiskit compare

Compare metrics between two evaluation runs to detect regressions.

Terminal window
artemiskit compare <baseline> <current> [options]
akit compare <baseline> <current> [options]
ArgumentDescription
baselineRun ID of the baseline run
currentRun ID of the current run
OptionDescriptionDefault
--threshold <number>Regression threshold (0-1)0.05 (5%)
--config <path>Path to config fileartemis.config.yaml

Compare two runs by their IDs:

Terminal window
akit compare ar-20260115-abc ar-20260118-def

Allow up to 10% regression before failing:

Terminal window
akit compare ar-20260115-abc ar-20260118-def --threshold 0.1

The compare command displays a formatted comparison table:

╔════════════════════════════════════════════════════════════╗
║ COMPARISON RESULTS ║
╠════════════════════════════════════════════════════════════╣
║ Metric Baseline Current Delta ║
╟────────────────────────────────────────────────────────────╢
║ Success Rate 95.0% 92.0% -3.00% ║
║ Median Latency 200ms 180ms -20.00ms ║
║ Total Tokens 1,250 1,180 -70.00 ║
╚════════════════════════════════════════════════════════════╝
Baseline: ar-20260115-abc
Current: ar-20260118-def
✓ No regression detected

Generate a visual comparison report:

Terminal window
akit compare ar-baseline ar-current --html

The HTML report includes:

  • Metrics Overview: Side-by-side comparison cards for success rate, latency, tokens
  • Change Summary: Badges showing regressions, improvements, unchanged, new, removed cases
  • Case Comparison Table: Filterable list of all test cases with change indicators
  • Response Diff: Click any case to expand and view baseline vs current responses

For programmatic access:

Terminal window
akit compare ar-baseline ar-current --json
  • Green — Improvement (higher success rate, lower latency/tokens)
  • Red — Regression (lower success rate, higher latency/tokens)
  • Dim — No change

When success rate drops beyond the threshold:

✗ Regression detected! Success rate dropped by 8.0% (threshold: 5%)

In non-TTY environments (CI/CD pipelines, redirected output), a simplified plain-text format is used:

=== COMPARISON RESULTS ===
Success Rate: 95.0% -> 92.0% (-3.00%)
Median Latency: 200ms -> 180ms (-20.00ms)
Total Tokens: 1250 -> 1180 (-70.00)
CodeMeaning
0No significant regressions (within threshold)
1Regressions detected (above threshold)

Use in GitHub Actions to gate deployments:

- name: Run evaluation
run: akit run scenarios/qa.yaml
- name: Compare with baseline
run: akit compare ${{ env.BASELINE_RUN_ID }} ${{ env.CURRENT_RUN_ID }} --threshold 0.05

Choose thresholds based on your use case:

ThresholdUse CaseDescription
0.01 (1%)Production-criticalStrict, blocks small regressions
0.05 (5%)Standard CI/CDDefault, reasonable tolerance
0.10 (10%)Development branchesLenient, catches major issues only
0.20 (20%)Experimental featuresVery lenient, major regressions only
Terminal window
# Production deployment - strict
akit compare $BASELINE $CURRENT --threshold 0.01
# PR checks - standard
akit compare $BASELINE $CURRENT --threshold 0.05
# Feature branch - lenient
akit compare $BASELINE $CURRENT --threshold 0.10

Configure baseline selection in your config:

artemis.config.yaml
ci:
failOnRegression: true
regressionThreshold: 0.05
baselineStrategy: latest # 'latest', 'tagged', or 'specific'
baselineRunId: ar-20260115-abc # For 'specific' strategy
StrategyDescriptionUse Case
latestMost recent passing runContinuous improvement tracking
taggedRun with specific tag (e.g., release-v1.0)Release comparisons
specificFixed run IDA/B testing, audits

Use the history command to find run IDs:

Terminal window
# List recent runs
akit history
# Filter by scenario
akit history --scenario customer-support
# Get the last successful run
akit history --status passed --limit 1

If a run ID doesn’t exist, you’ll see a helpful error:

┌─────────────────────────────────────────────────────────────┐
│ ✗ Failed to Compare Runs │
├─────────────────────────────────────────────────────────────┤
│ Run not found: ar-nonexistent │
│ │
│ Suggestions: │
│ • Check that both run IDs exist │
│ • Run "artemiskit history" to see available runs │
│ • Verify storage configuration in artemis.config.yaml │
└─────────────────────────────────────────────────────────────┘