Skip to content

artemiskit run

Run scenario-based evaluations against your LLM.

Terminal window
artemiskit run <scenario> [options]
akit run <scenario> [options]
ArgumentDescription
scenarioPath to scenario file, directory, or glob pattern

The scenario argument supports:

  • Single file: scenarios/test.yaml
  • Directory: scenarios/ (runs all .yaml/.yml files recursively)
  • Glob pattern: scenarios/**/*.yaml (matches pattern)
OptionShortDescriptionDefault
--provider-pLLM provider to useFrom config/scenario
--model-mModel nameFrom config/scenario
--output-oOutput directory for resultsartemis-output
--verbose-vVerbose outputfalse
--tags-tFilter test cases by tags (space-separated)All cases
--saveSave results to storage (enabled by default)true
--concurrency-cNumber of concurrent test cases per scenario1
--parallelNumber of scenarios to run in parallelSequential
--timeoutTimeout per test case (ms)From config
--retriesNumber of retries per test caseFrom config
--configPath to config fileartemis.config.yaml
--redactEnable PII/sensitive data redactionfalse
--redact-patternsCustom redaction patterns (space-separated)Default patterns
--ciCI mode: machine-readable output, no colors/spinnersfalse
--summarySummary output format: json, text, or securitytext
--baselineCompare against baseline and detect regressionfalse
--thresholdRegression threshold (0-1), e.g., 0.05 for 5%0.05
--budgetMaximum budget in USD - fail if estimated cost exceedsNone
--exportExport format: markdown or junitNone
--export-outputOutput directory for exports./artemis-exports

Built-in patterns:

  • email — Email addresses
  • phone — Phone numbers (various formats)
  • credit_card — Credit card numbers
  • ssn — Social Security Numbers
  • api_key — API keys (common formats)
  • ipv4 — IPv4 addresses
  • jwt — JWT tokens
  • aws_key — AWS access keys
  • secrets — Generic secrets (password=, secret=, etc.)

Custom regex patterns:

Pass regex patterns as plain strings (without delimiters). The g flag is added automatically.

Terminal window
# Mix built-in and custom patterns
akit run scenario.yaml --redact --redact-patterns email phone "\b[A-Z]{2}\d{6}\b"
Terminal window
akit run scenarios/qa-test.yaml
Terminal window
akit run scenarios/
Terminal window
akit run "scenarios/**/*.yaml"

Run 4 scenarios concurrently:

Terminal window
akit run scenarios/ --parallel 4
Terminal window
akit run scenarios/qa-test.yaml -p anthropic -m claude-3-5-sonnet-20241022
Terminal window
akit run scenarios/qa-test.yaml --tags regression security
Terminal window
akit run scenarios/qa-test.yaml --save -o ./reports
Terminal window
akit run scenarios/qa-test.yaml --concurrency 5
Terminal window
akit run scenarios/qa-test.yaml -v
Terminal window
akit run scenarios/qa-test.yaml --config ./custom-config.yaml

Redact PII from results using built-in patterns:

Terminal window
akit run scenarios/qa-test.yaml --redact

With specific patterns:

Terminal window
akit run scenarios/qa-test.yaml --redact --redact-patterns email phone api_key

ArtemisKit provides several options for CI/CD pipeline integration:

CI Mode:

Terminal window
# Machine-readable output, no colors or spinners
akit run scenarios/ --ci

CI mode outputs environment variable-style results:

ARTEMISKIT_RESULT=PASS
ARTEMISKIT_SCENARIOS_TOTAL=5
ARTEMISKIT_SCENARIOS_PASSED=5
ARTEMISKIT_CASES_TOTAL=25
ARTEMISKIT_CASES_PASSED=24
ARTEMISKIT_CASES_FAILED=1
ARTEMISKIT_SUCCESS_RATE=96.0
ARTEMISKIT_DURATION_MS=12345

JSON Summary:

Terminal window
# Get full JSON summary for parsing
akit run scenarios/ --summary json

Security Summary:

Terminal window
# Get security-focused summary for compliance reporting
akit run scenarios/ --summary security

Compare runs against a baseline to detect regressions:

Terminal window
# First, set a baseline
akit baseline set <run-id>
# Then run with baseline comparison
akit run scenarios/qa-test.yaml --baseline

With custom threshold (10% regression tolerance):

Terminal window
akit run scenarios/qa-test.yaml --baseline --threshold 0.10

In CI pipelines:

Terminal window
# CI mode with baseline comparison
akit run scenarios/ --ci --baseline --threshold 0.05

If a regression is detected (success rate drops more than the threshold), the command exits with code 1.

ArtemisKit automatically estimates costs based on token usage. Use --budget to enforce cost limits:

Terminal window
# Fail if estimated cost exceeds $1.00
akit run scenarios/ --budget 1.00

Cost is displayed in the summary output:

Results: 10/10 passed (100%)
Tokens: 15,234 (prompt: 12,000, completion: 3,234)
Estimated Cost: $0.0234

In CI pipelines with budget enforcement:

Terminal window
# Fail the pipeline if costs exceed budget
akit run scenarios/ --ci --budget 5.00

Export results to markdown format for documentation, compliance reports, or audit trails:

Terminal window
# Export to default directory (./artemis-exports)
akit run scenarios/qa-test.yaml --export markdown
# Export to custom directory
akit run scenarios/qa-test.yaml --export markdown --export-output ./compliance-reports

The markdown report includes:

  • Summary table with pass/fail rates, latency, and cost metrics
  • Detailed results for failed test cases
  • Configuration used for the run
  • Redaction summary (if enabled)

Export results as JUnit XML for CI/CD integration with Jenkins, GitHub Actions, GitLab CI, and other systems:

Terminal window
# Export to default directory (./artemis-exports)
akit run scenarios/qa-test.yaml --export junit
# Export to custom directory
akit run scenarios/qa-test.yaml --export junit --export-output ./test-results

The JUnit XML includes:

  • Test suite metadata (run ID, provider, model, success rate)
  • Individual test cases with pass/fail status
  • Failure details with matcher type and expected values
  • System output with actual responses
  • Timing information for each test

GitHub Actions example:

- name: Run ArtemisKit tests
run: akit run scenarios/ --export junit --export-output ./test-results
- name: Publish Test Results
uses: EnricoMi/publish-unit-test-result-action@v2
if: always()
with:
files: test-results/*.xml
CodeMeaning
0All test cases passed (and no regression if --baseline used)
1One or more test cases failed, or regression detected
2Configuration or runtime error

When --save is used, ArtemisKit creates files in the output directory:

  • run_manifest.json — Complete run metadata and results

The manifest includes:

  • Run ID and timestamps
  • Provider and model used
  • All test case results with pass/fail status
  • Response latencies and token counts
  • Git information (if in a git repo)
Running scenario: qa-test
Provider: openai (gpt-5)
✓ greeting-test (234ms)
✓ math-test (189ms)
✗ complex-reasoning (512ms)
Expected: contains ["explanation"]
Got: Response did not contain expected values
Results: 2/3 passed (66.7%)
Total time: 1.2s