Skip to content

artemiskit compare

Compare metrics between two evaluation runs to detect regressions.

Terminal window
artemiskit compare <baseline> <current> [options]
akit compare <baseline> <current> [options]
ArgumentDescription
baselineRun ID of the baseline run
currentRun ID of the current run
OptionDescriptionDefault
--threshold <number>Regression threshold (0-1)0.05 (5%)
--config <path>Path to config fileartemis.config.yaml

Compare two runs by their IDs:

Terminal window
akit compare ar-20260115-abc ar-20260118-def

Allow up to 10% regression before failing:

Terminal window
akit compare ar-20260115-abc ar-20260118-def --threshold 0.1

The compare command displays a formatted comparison table:

╔════════════════════════════════════════════════════════════╗
║ COMPARISON RESULTS ║
╠════════════════════════════════════════════════════════════╣
║ Metric Baseline Current Delta ║
╟────────────────────────────────────────────────────────────╢
║ Success Rate 95.0% 92.0% -3.00% ║
║ Median Latency 200ms 180ms -20.00ms ║
║ Total Tokens 1,250 1,180 -70.00 ║
╚════════════════════════════════════════════════════════════╝
Baseline: ar-20260115-abc
Current: ar-20260118-def
✓ No regression detected
  • Green — Improvement (higher success rate, lower latency/tokens)
  • Red — Regression (lower success rate, higher latency/tokens)
  • Dim — No change

When success rate drops beyond the threshold:

✗ Regression detected! Success rate dropped by 8.0% (threshold: 5%)

In non-TTY environments (CI/CD pipelines, redirected output), a simplified plain-text format is used:

=== COMPARISON RESULTS ===
Success Rate: 95.0% -> 92.0% (-3.00%)
Median Latency: 200ms -> 180ms (-20.00ms)
Total Tokens: 1250 -> 1180 (-70.00)
CodeMeaning
0No significant regressions (within threshold)
1Regressions detected (above threshold)

Use in GitHub Actions to gate deployments:

- name: Run evaluation
run: akit run scenarios/qa.yaml
- name: Compare with baseline
run: akit compare ${{ env.BASELINE_RUN_ID }} ${{ env.CURRENT_RUN_ID }} --threshold 0.05

Configure baseline selection in your config:

artemis.config.yaml
ci:
failOnRegression: true
regressionThreshold: 0.05
baselineStrategy: latest # 'latest', 'tagged', or 'specific'
baselineRunId: ar-20260115-abc # For 'specific' strategy

If a run ID doesn’t exist, you’ll see a helpful error:

┌─────────────────────────────────────────────────────────────┐
│ ✗ Failed to Compare Runs │
├─────────────────────────────────────────────────────────────┤
│ Run not found: ar-nonexistent │
│ │
│ Suggestions: │
│ • Check that both run IDs exist │
│ • Run "artemiskit history" to see available runs │
│ • Verify storage configuration in artemis.config.yaml │
└─────────────────────────────────────────────────────────────┘