Skip to content

CI/CD Integration

Automate LLM evaluation as part of your deployment pipeline.

.github/workflows/llm-evaluation.yml
name: LLM Evaluation
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install ArtemisKit
run: npm install -g @artemiskit/cli
- name: Run Evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: akit run scenarios/regression.yaml --save
- name: Upload Results
uses: actions/upload-artifact@v4
with:
name: evaluation-results
path: artemis-output/
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download Baseline
uses: actions/download-artifact@v4
with:
name: baseline-results
path: ./baseline
continue-on-error: true
- name: Install ArtemisKit
run: npm install -g @artemiskit/cli
- name: Run Evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: akit run scenarios/regression.yaml --save
- name: Compare with Baseline
if: hashFiles('baseline/run_manifest.json') != ''
run: |
akit compare baseline/run_manifest.json artemis-output/run_manifest.json --strict
- name: Update Baseline
if: github.ref == 'refs/heads/main'
uses: actions/upload-artifact@v4
with:
name: baseline-results
path: artemis-output/

Use exit codes to fail builds:

CodeMeaningAction
0All tests passedContinue
1Tests failedFail build
2Configuration errorFail build

Store secrets securely:

env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Run evaluations on a schedule:

on:
schedule:
- cron: '0 0 * * *' # Daily at midnight
.gitlab-ci.yml
llm-evaluation:
image: node:20
script:
- npm install -g @artemiskit/cli
- akit run scenarios/regression.yaml --save
artifacts:
paths:
- artemis-output/
expire_in: 30 days
variables:
OPENAI_API_KEY: $OPENAI_API_KEY
  1. Store API keys as secrets — Never commit API keys
  2. Use deterministic seeds — Ensure reproducible results
  3. Set reasonable timeouts — Prevent hanging builds
  4. Archive results — Keep history for comparison
  5. Run on PRs — Catch regressions before merge