artemiskit run
artemiskit run
Section titled “artemiskit run”Run scenario-based evaluations against your LLM.
Synopsis
Section titled “Synopsis”artemiskit run <scenario> [options]akit run <scenario> [options]Arguments
Section titled “Arguments”| Argument | Description |
|---|---|
scenario | Path to scenario file, directory, or glob pattern |
The scenario argument supports:
- Single file:
scenarios/test.yaml - Directory:
scenarios/(runs all.yaml/.ymlfiles recursively) - Glob pattern:
scenarios/**/*.yaml(matches pattern)
Options
Section titled “Options”| Option | Short | Description | Default |
|---|---|---|---|
--provider | -p | LLM provider to use | From config/scenario |
--model | -m | Model name | From config/scenario |
--output | -o | Output directory for results | artemis-output |
--verbose | -v | Verbose output | false |
--tags | -t | Filter test cases by tags (space-separated) | All cases |
--save | Save results to storage (enabled by default) | true | |
--concurrency | -c | Number of concurrent test cases per scenario | 1 |
--parallel | Number of scenarios to run in parallel | Sequential | |
--timeout | Timeout per test case (ms) | From config | |
--retries | Number of retries per test case | From config | |
--config | Path to config file | artemis.config.yaml | |
--redact | Enable PII/sensitive data redaction | false | |
--redact-patterns | Custom redaction patterns (space-separated) | Default patterns | |
--ci | CI mode: machine-readable output, no colors/spinners | false | |
--summary | Summary output format: json, text, or security | text | |
--baseline | Compare against baseline and detect regression | false | |
--threshold | Regression threshold (0-1), e.g., 0.05 for 5% | 0.05 | |
--budget | Maximum budget in USD - fail if estimated cost exceeds | None | |
--export | Export format: markdown or junit | None | |
--export-output | Output directory for exports | ./artemis-exports |
Redaction Patterns
Section titled “Redaction Patterns”Built-in patterns:
email— Email addressesphone— Phone numbers (various formats)credit_card— Credit card numbersssn— Social Security Numbersapi_key— API keys (common formats)ipv4— IPv4 addressesjwt— JWT tokensaws_key— AWS access keyssecrets— Generic secrets (password=, secret=, etc.)
Custom regex patterns:
Pass regex patterns as plain strings (without delimiters). The g flag is added automatically.
# Mix built-in and custom patternsakit run scenario.yaml --redact --redact-patterns email phone "\b[A-Z]{2}\d{6}\b"Examples
Section titled “Examples”Basic Run
Section titled “Basic Run”akit run scenarios/qa-test.yamlRun All Scenarios in a Directory
Section titled “Run All Scenarios in a Directory”akit run scenarios/Run with Glob Pattern
Section titled “Run with Glob Pattern”akit run "scenarios/**/*.yaml"Run Scenarios in Parallel
Section titled “Run Scenarios in Parallel”Run 4 scenarios concurrently:
akit run scenarios/ --parallel 4With Provider Override
Section titled “With Provider Override”akit run scenarios/qa-test.yaml -p anthropic -m claude-3-5-sonnet-20241022Filter by Tags
Section titled “Filter by Tags”akit run scenarios/qa-test.yaml --tags regression securitySave Results
Section titled “Save Results”akit run scenarios/qa-test.yaml --save -o ./reportsConcurrent Execution
Section titled “Concurrent Execution”akit run scenarios/qa-test.yaml --concurrency 5Verbose Output
Section titled “Verbose Output”akit run scenarios/qa-test.yaml -vCustom Config
Section titled “Custom Config”akit run scenarios/qa-test.yaml --config ./custom-config.yamlWith Redaction
Section titled “With Redaction”Redact PII from results using built-in patterns:
akit run scenarios/qa-test.yaml --redactWith specific patterns:
akit run scenarios/qa-test.yaml --redact --redact-patterns email phone api_keyCI/CD Integration
Section titled “CI/CD Integration”ArtemisKit provides several options for CI/CD pipeline integration:
CI Mode:
# Machine-readable output, no colors or spinnersakit run scenarios/ --ciCI mode outputs environment variable-style results:
ARTEMISKIT_RESULT=PASSARTEMISKIT_SCENARIOS_TOTAL=5ARTEMISKIT_SCENARIOS_PASSED=5ARTEMISKIT_CASES_TOTAL=25ARTEMISKIT_CASES_PASSED=24ARTEMISKIT_CASES_FAILED=1ARTEMISKIT_SUCCESS_RATE=96.0ARTEMISKIT_DURATION_MS=12345JSON Summary:
# Get full JSON summary for parsingakit run scenarios/ --summary jsonSecurity Summary:
# Get security-focused summary for compliance reportingakit run scenarios/ --summary securityBaseline Comparison
Section titled “Baseline Comparison”Compare runs against a baseline to detect regressions:
# First, set a baselineakit baseline set <run-id>
# Then run with baseline comparisonakit run scenarios/qa-test.yaml --baselineWith custom threshold (10% regression tolerance):
akit run scenarios/qa-test.yaml --baseline --threshold 0.10In CI pipelines:
# CI mode with baseline comparisonakit run scenarios/ --ci --baseline --threshold 0.05If a regression is detected (success rate drops more than the threshold), the command exits with code 1.
Cost Tracking and Budget
Section titled “Cost Tracking and Budget”ArtemisKit automatically estimates costs based on token usage. Use --budget to enforce cost limits:
# Fail if estimated cost exceeds $1.00akit run scenarios/ --budget 1.00Cost is displayed in the summary output:
Results: 10/10 passed (100%)Tokens: 15,234 (prompt: 12,000, completion: 3,234)Estimated Cost: $0.0234In CI pipelines with budget enforcement:
# Fail the pipeline if costs exceed budgetakit run scenarios/ --ci --budget 5.00Markdown Export
Section titled “Markdown Export”Export results to markdown format for documentation, compliance reports, or audit trails:
# Export to default directory (./artemis-exports)akit run scenarios/qa-test.yaml --export markdown
# Export to custom directoryakit run scenarios/qa-test.yaml --export markdown --export-output ./compliance-reportsThe markdown report includes:
- Summary table with pass/fail rates, latency, and cost metrics
- Detailed results for failed test cases
- Configuration used for the run
- Redaction summary (if enabled)
JUnit XML Export
Section titled “JUnit XML Export”Export results as JUnit XML for CI/CD integration with Jenkins, GitHub Actions, GitLab CI, and other systems:
# Export to default directory (./artemis-exports)akit run scenarios/qa-test.yaml --export junit
# Export to custom directoryakit run scenarios/qa-test.yaml --export junit --export-output ./test-resultsThe JUnit XML includes:
- Test suite metadata (run ID, provider, model, success rate)
- Individual test cases with pass/fail status
- Failure details with matcher type and expected values
- System output with actual responses
- Timing information for each test
GitHub Actions example:
- name: Run ArtemisKit tests run: akit run scenarios/ --export junit --export-output ./test-results
- name: Publish Test Results uses: EnricoMi/publish-unit-test-result-action@v2 if: always() with: files: test-results/*.xmlExit Codes
Section titled “Exit Codes”| Code | Meaning |
|---|---|
| 0 | All test cases passed (and no regression if --baseline used) |
| 1 | One or more test cases failed, or regression detected |
| 2 | Configuration or runtime error |
Output
Section titled “Output”When --save is used, ArtemisKit creates files in the output directory:
run_manifest.json— Complete run metadata and results
The manifest includes:
- Run ID and timestamps
- Provider and model used
- All test case results with pass/fail status
- Response latencies and token counts
- Git information (if in a git repo)
Example Output
Section titled “Example Output”Running scenario: qa-testProvider: openai (gpt-5)
✓ greeting-test (234ms) ✓ math-test (189ms) ✗ complex-reasoning (512ms) Expected: contains ["explanation"] Got: Response did not contain expected values
Results: 2/3 passed (66.7%)Total time: 1.2sSee Also
Section titled “See Also”- Scenario Format
- Expectations
- Red Team Command — Security testing with automated attacks
- Compare Command