artemiskit compare
artemiskit compare
Section titled “artemiskit compare”Compare metrics between two evaluation runs to detect regressions.
Synopsis
Section titled “Synopsis”artemiskit compare <baseline> <current> [options]akit compare <baseline> <current> [options]Arguments
Section titled “Arguments”| Argument | Description |
|---|---|
baseline | Run ID of the baseline run |
current | Run ID of the current run |
Options
Section titled “Options”| Option | Description | Default |
|---|---|---|
--threshold <number> | Regression threshold (0-1) | 0.05 (5%) |
--config <path> | Path to config file | artemis.config.yaml |
Examples
Section titled “Examples”Basic Comparison
Section titled “Basic Comparison”Compare two runs by their IDs:
akit compare ar-20260115-abc ar-20260118-defWith Custom Threshold
Section titled “With Custom Threshold”Allow up to 10% regression before failing:
akit compare ar-20260115-abc ar-20260118-def --threshold 0.1Output Formats
Section titled “Output Formats”Terminal Output
Section titled “Terminal Output”The compare command displays a formatted comparison table:
╔════════════════════════════════════════════════════════════╗║ COMPARISON RESULTS ║╠════════════════════════════════════════════════════════════╣║ Metric Baseline Current Delta ║╟────────────────────────────────────────────────────────────╢║ Success Rate 95.0% 92.0% -3.00% ║║ Median Latency 200ms 180ms -20.00ms ║║ Total Tokens 1,250 1,180 -70.00 ║╚════════════════════════════════════════════════════════════╝
Baseline: ar-20260115-abcCurrent: ar-20260118-def
✓ No regression detectedHTML Report
Section titled “HTML Report”Generate a visual comparison report:
akit compare ar-baseline ar-current --htmlThe HTML report includes:
- Metrics Overview: Side-by-side comparison cards for success rate, latency, tokens
- Change Summary: Badges showing regressions, improvements, unchanged, new, removed cases
- Case Comparison Table: Filterable list of all test cases with change indicators
- Response Diff: Click any case to expand and view baseline vs current responses
JSON Output
Section titled “JSON Output”For programmatic access:
akit compare ar-baseline ar-current --jsonDelta Colors
Section titled “Delta Colors”- Green — Improvement (higher success rate, lower latency/tokens)
- Red — Regression (lower success rate, higher latency/tokens)
- Dim — No change
Regression Detection
Section titled “Regression Detection”When success rate drops beyond the threshold:
✗ Regression detected! Success rate dropped by 8.0% (threshold: 5%)CI/CD Output
Section titled “CI/CD Output”In non-TTY environments (CI/CD pipelines, redirected output), a simplified plain-text format is used:
=== COMPARISON RESULTS ===
Success Rate: 95.0% -> 92.0% (-3.00%)Median Latency: 200ms -> 180ms (-20.00ms)Total Tokens: 1250 -> 1180 (-70.00)Exit Codes
Section titled “Exit Codes”| Code | Meaning |
|---|---|
0 | No significant regressions (within threshold) |
1 | Regressions detected (above threshold) |
CI/CD Integration
Section titled “CI/CD Integration”Use in GitHub Actions to gate deployments:
- name: Run evaluation run: akit run scenarios/qa.yaml
- name: Compare with baseline run: akit compare ${{ env.BASELINE_RUN_ID }} ${{ env.CURRENT_RUN_ID }} --threshold 0.05Setting Regression Thresholds
Section titled “Setting Regression Thresholds”Choose thresholds based on your use case:
| Threshold | Use Case | Description |
|---|---|---|
0.01 (1%) | Production-critical | Strict, blocks small regressions |
0.05 (5%) | Standard CI/CD | Default, reasonable tolerance |
0.10 (10%) | Development branches | Lenient, catches major issues only |
0.20 (20%) | Experimental features | Very lenient, major regressions only |
Example Threshold Selection
Section titled “Example Threshold Selection”# Production deployment - strictakit compare $BASELINE $CURRENT --threshold 0.01
# PR checks - standardakit compare $BASELINE $CURRENT --threshold 0.05
# Feature branch - lenientakit compare $BASELINE $CURRENT --threshold 0.10Baseline Strategies
Section titled “Baseline Strategies”Configure baseline selection in your config:
ci: failOnRegression: true regressionThreshold: 0.05 baselineStrategy: latest # 'latest', 'tagged', or 'specific' baselineRunId: ar-20260115-abc # For 'specific' strategyBaseline Strategy Options
Section titled “Baseline Strategy Options”| Strategy | Description | Use Case |
|---|---|---|
latest | Most recent passing run | Continuous improvement tracking |
tagged | Run with specific tag (e.g., release-v1.0) | Release comparisons |
specific | Fixed run ID | A/B testing, audits |
Finding Baselines
Section titled “Finding Baselines”Use the history command to find run IDs:
# List recent runsakit history
# Filter by scenarioakit history --scenario customer-support
# Get the last successful runakit history --status passed --limit 1Error Handling
Section titled “Error Handling”If a run ID doesn’t exist, you’ll see a helpful error:
┌─────────────────────────────────────────────────────────────┐│ ✗ Failed to Compare Runs │├─────────────────────────────────────────────────────────────┤│ Run not found: ar-nonexistent ││ ││ Suggestions: ││ • Check that both run IDs exist ││ • Run "artemiskit history" to see available runs ││ • Verify storage configuration in artemis.config.yaml │└─────────────────────────────────────────────────────────────┘See Also
Section titled “See Also”- Run Command — Run evaluations
- History Command — View available runs
- CI/CD Integration — Automate in your pipeline