Skip to content

CI/CD Integration

CI/CD Integration

Automate LLM evaluation as part of your deployment pipeline. ArtemisKit provides CI-friendly output formats, JUnit XML export, and validation tools.

Quick Start

# Validate scenarios first (fail fast)
akit validate scenarios/ --strict

# Run tests with JUnit export for CI platforms
akit run scenarios/ --ci --export junit --export-output ./test-results

# Run security tests
akit redteam scenarios/chatbot.yaml --export junit --export-output ./security-results

GitHub Actions

Basic Workflow with JUnit Reports

name: LLM Evaluation

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Bun
        uses: oven-sh/setup-bun@v2

      - name: Install ArtemisKit
        run: bun add -g @artemiskit/cli

      - name: Validate Scenarios
        run: akit validate scenarios/ --strict --export junit --export-output ./validation-results

      - name: Run Evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: akit run scenarios/ --ci --export junit --export-output ./test-results

      - name: Publish Test Results
        uses: EnricoMi/publish-unit-test-result-action@v2
        if: always()
        with:
          files: |
            validation-results/*.xml
            test-results/*.xml

      - name: Upload Artifacts
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: evaluation-results
          path: |
            validation-results/
            test-results/

With Regression Detection

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Download Baseline
        uses: actions/download-artifact@v4
        with:
          name: baseline-results
          path: ./baseline
        continue-on-error: true

      - name: Install ArtemisKit
        run: npm install -g @artemiskit/cli

      - name: Run Evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: akit run scenarios/regression.yaml --save

      - name: Compare with Baseline
        if: hashFiles('baseline/run_manifest.json') != ''
        run: |
          akit compare baseline/run_manifest.json artemis-output/run_manifest.json --strict

      - name: Update Baseline
        if: github.ref == 'refs/heads/main'
        uses: actions/upload-artifact@v4
        with:
          name: baseline-results
          path: artemis-output/

Exit Codes

Use exit codes to fail builds:

Code	Meaning	Action
0	All tests passed	Continue
1	Tests failed	Fail build
2	Configuration error	Fail build

Environment Variables

Store secrets securely:

env:
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Scheduled Runs

Run evaluations on a schedule:

on:
  schedule:
    - cron: '0 0 * * *'  # Daily at midnight

GitLab CI

llm-evaluation:
  image: node:20
  script:
    - npm install -g @artemiskit/cli
    - akit run scenarios/regression.yaml --save
  artifacts:
    paths:
      - artemis-output/
    expire_in: 30 days
  variables:
    OPENAI_API_KEY: $OPENAI_API_KEY

Best Practices

Store API keys as secrets — Never commit API keys
Use deterministic seeds — Ensure reproducible results
Set reasonable timeouts — Prevent hanging builds
Archive results — Keep history for comparison
Run on PRs — Catch regressions before merge

See Also

Validate Command — Pre-flight scenario validation
Run Command — Execute scenarios with JUnit export
Red Team Command — Security testing with JUnit export
Compare Command — Compare runs for regression
Baseline Command — Manage baselines for regression detection