artemiskit compare
artemiskit compare
Section titled “artemiskit compare”Compare metrics between two evaluation runs to detect regressions.
Synopsis
Section titled “Synopsis”artemiskit compare <baseline> <current> [options]akit compare <baseline> <current> [options]Arguments
Section titled “Arguments”| Argument | Description |
|---|---|
baseline | Run ID of the baseline run |
current | Run ID of the current run |
Options
Section titled “Options”| Option | Description | Default |
|---|---|---|
--threshold <number> | Regression threshold (0-1) | 0.05 (5%) |
--config <path> | Path to config file | artemis.config.yaml |
Examples
Section titled “Examples”Basic Comparison
Section titled “Basic Comparison”Compare two runs by their IDs:
akit compare ar-20260115-abc ar-20260118-defWith Custom Threshold
Section titled “With Custom Threshold”Allow up to 10% regression before failing:
akit compare ar-20260115-abc ar-20260118-def --threshold 0.1Output
Section titled “Output”The compare command displays a formatted comparison table:
╔════════════════════════════════════════════════════════════╗║ COMPARISON RESULTS ║╠════════════════════════════════════════════════════════════╣║ Metric Baseline Current Delta ║╟────────────────────────────────────────────────────────────╢║ Success Rate 95.0% 92.0% -3.00% ║║ Median Latency 200ms 180ms -20.00ms ║║ Total Tokens 1,250 1,180 -70.00 ║╚════════════════════════════════════════════════════════════╝
Baseline: ar-20260115-abcCurrent: ar-20260118-def
✓ No regression detectedDelta Colors
Section titled “Delta Colors”- Green — Improvement (higher success rate, lower latency/tokens)
- Red — Regression (lower success rate, higher latency/tokens)
- Dim — No change
Regression Detection
Section titled “Regression Detection”When success rate drops beyond the threshold:
✗ Regression detected! Success rate dropped by 8.0% (threshold: 5%)CI/CD Output
Section titled “CI/CD Output”In non-TTY environments (CI/CD pipelines, redirected output), a simplified plain-text format is used:
=== COMPARISON RESULTS ===
Success Rate: 95.0% -> 92.0% (-3.00%)Median Latency: 200ms -> 180ms (-20.00ms)Total Tokens: 1250 -> 1180 (-70.00)Exit Codes
Section titled “Exit Codes”| Code | Meaning |
|---|---|
0 | No significant regressions (within threshold) |
1 | Regressions detected (above threshold) |
CI/CD Integration
Section titled “CI/CD Integration”Use in GitHub Actions to gate deployments:
- name: Run evaluation run: akit run scenarios/qa.yaml
- name: Compare with baseline run: akit compare ${{ env.BASELINE_RUN_ID }} ${{ env.CURRENT_RUN_ID }} --threshold 0.05Baseline Strategies
Section titled “Baseline Strategies”Configure baseline selection in your config:
ci: failOnRegression: true regressionThreshold: 0.05 baselineStrategy: latest # 'latest', 'tagged', or 'specific' baselineRunId: ar-20260115-abc # For 'specific' strategyError Handling
Section titled “Error Handling”If a run ID doesn’t exist, you’ll see a helpful error:
┌─────────────────────────────────────────────────────────────┐│ ✗ Failed to Compare Runs │├─────────────────────────────────────────────────────────────┤│ Run not found: ar-nonexistent ││ ││ Suggestions: ││ • Check that both run IDs exist ││ • Run "artemiskit history" to see available runs ││ • Verify storage configuration in artemis.config.yaml │└─────────────────────────────────────────────────────────────┘See Also
Section titled “See Also”- Run Command — Run evaluations
- History Command — View available runs
- CI/CD Integration — Automate in your pipeline