artemiskit compare

Compare metrics between two evaluation runs to detect regressions.

Synopsis

artemiskit compare <baseline> <current> [options]
akit compare <baseline> <current> [options]

Arguments

Argument	Description
`baseline`	Run ID of the baseline run
`current`	Run ID of the current run

Options

Option	Description	Default
`--threshold <number>`	Regression threshold (0-1)	`0.05` (5%)
`--config <path>`	Path to config file	`artemis.config.yaml`

Examples

Basic Comparison

Compare two runs by their IDs:

akit compare ar-20260115-abc ar-20260118-def

With Custom Threshold

Allow up to 10% regression before failing:

akit compare ar-20260115-abc ar-20260118-def --threshold 0.1

Output Formats

Terminal Output

The compare command displays a formatted comparison table:

╔════════════════════════════════════════════════════════════╗
║                    COMPARISON RESULTS                      ║
╠════════════════════════════════════════════════════════════╣
║ Metric                Baseline     Current          Delta  ║
╟────────────────────────────────────────────────────────────╢
║ Success Rate            95.0%       92.0%         -3.00%   ║
║ Median Latency          200ms       180ms        -20.00ms  ║
║ Total Tokens            1,250       1,180        -70.00    ║
╚════════════════════════════════════════════════════════════╝

Baseline: ar-20260115-abc
Current:  ar-20260118-def

✓ No regression detected

HTML Report

Generate a visual comparison report:

akit compare ar-baseline ar-current --html

The HTML report includes:

Metrics Overview: Side-by-side comparison cards for success rate, latency, tokens
Change Summary: Badges showing regressions, improvements, unchanged, new, removed cases
Case Comparison Table: Filterable list of all test cases with change indicators
Response Diff: Click any case to expand and view baseline vs current responses

JSON Output

For programmatic access:

akit compare ar-baseline ar-current --json

Delta Colors

Green — Improvement (higher success rate, lower latency/tokens)
Red — Regression (lower success rate, higher latency/tokens)
Dim — No change

Regression Detection

When success rate drops beyond the threshold:

✗ Regression detected! Success rate dropped by 8.0% (threshold: 5%)

CI/CD Output

In non-TTY environments (CI/CD pipelines, redirected output), a simplified plain-text format is used:

=== COMPARISON RESULTS ===

Success Rate:   95.0% -> 92.0%  (-3.00%)
Median Latency: 200ms -> 180ms  (-20.00ms)
Total Tokens:   1250 -> 1180  (-70.00)

Exit Codes

Code	Meaning
`0`	No significant regressions (within threshold)
`1`	Regressions detected (above threshold)

CI/CD Integration

Use in GitHub Actions to gate deployments:

- name: Run evaluation
  run: akit run scenarios/qa.yaml

- name: Compare with baseline
  run: akit compare ${{ env.BASELINE_RUN_ID }} ${{ env.CURRENT_RUN_ID }} --threshold 0.05

Setting Regression Thresholds

Choose thresholds based on your use case:

Threshold	Use Case	Description
`0.01` (1%)	Production-critical	Strict, blocks small regressions
`0.05` (5%)	Standard CI/CD	Default, reasonable tolerance
`0.10` (10%)	Development branches	Lenient, catches major issues only
`0.20` (20%)	Experimental features	Very lenient, major regressions only

Example Threshold Selection

# Production deployment - strict
akit compare $BASELINE $CURRENT --threshold 0.01

# PR checks - standard
akit compare $BASELINE $CURRENT --threshold 0.05

# Feature branch - lenient
akit compare $BASELINE $CURRENT --threshold 0.10

Baseline Strategies

Configure baseline selection in your config:

ci:
  failOnRegression: true
  regressionThreshold: 0.05
  baselineStrategy: latest  # 'latest', 'tagged', or 'specific'
  baselineRunId: ar-20260115-abc  # For 'specific' strategy

Baseline Strategy Options

Strategy	Description	Use Case
`latest`	Most recent passing run	Continuous improvement tracking
`tagged`	Run with specific tag (e.g., `release-v1.0`)	Release comparisons
`specific`	Fixed run ID	A/B testing, audits

Finding Baselines

Use the history command to find run IDs:

# List recent runs
akit history

# Filter by scenario
akit history --scenario customer-support

# Get the last successful run
akit history --status passed --limit 1

Error Handling

If a run ID doesn’t exist, you’ll see a helpful error:

┌─────────────────────────────────────────────────────────────┐
│  ✗ Failed to Compare Runs                                   │
├─────────────────────────────────────────────────────────────┤
│  Run not found: ar-nonexistent                              │
│                                                             │
│  Suggestions:                                               │
│  • Check that both run IDs exist                            │
│  • Run "artemiskit history" to see available runs           │
│  • Verify storage configuration in artemis.config.yaml      │
└─────────────────────────────────────────────────────────────┘

artemiskit compare

artemiskit compare

Synopsis

Arguments

Options

Examples

Basic Comparison

With Custom Threshold

Output Formats

Terminal Output

HTML Report

JSON Output

Delta Colors

Regression Detection

CI/CD Output

Exit Codes

CI/CD Integration

Setting Regression Thresholds

Example Threshold Selection

Baseline Strategies

Baseline Strategy Options

Finding Baselines

Error Handling

See Also