Getting Started with LLM Testing: A Practical Guide
Testing LLM applications is different from traditional software testing. Outputs are non-deterministic, edge cases are infinite, and traditional assertion-based testing doesn’t work. This guide will show you how to set up effective LLM testing with ArtemisKit.
Prerequisites
Before starting, you’ll need:
- Node.js 18+ installed
- An OpenAI API key (or another supported provider)
- A terminal/command line
Step 1: Install ArtemisKit
Install the CLI globally:
npm install -g @artemiskit/cliVerify the installation:
akit --versionStep 2: Configure Your Provider
Set your API key as an environment variable:
export OPENAI_API_KEY=sk-your-key-hereFor persistent configuration, add it to your shell profile (.bashrc, .zshrc, etc.) or use a .env file.
Step 3: Create Your First Scenario
Create a file called scenario.yaml:
name: customer-support-botdescription: Tests for our customer support chatbotprovider: openaimodel: gpt-4o
cases: # Basic functionality test - id: business-hours prompt: "What are your business hours?" expected: type: contains values: - "Monday" - "Friday" mode: any
# Policy question - id: return-policy prompt: "How do I return a product?" expected: type: combined operator: and expectations: - type: contains values: ["30 days"] mode: any - type: not_contains values: ["I don't know", "I'm not sure"] mode: any
# Edge case handling - id: gibberish-input prompt: "asdfghjkl random gibberish" expected: type: llm_grader rubric: "Response should politely ask for clarification" threshold: 0.7Step 4: Run Your First Test
Execute the test:
akit run scenario.yamlYou’ll see output like:
Running scenario: customer-support-botProvider: openai (gpt-4o)
✓ business-hours (234ms) ✓ return-policy (189ms) ✓ gibberish-input (312ms)
Results: 3/3 passed (100%)Run ID: ar-20260130-abc123Step 5: Understanding Evaluators
ArtemisKit provides 11 evaluator types. Here are the most common:
Contains/Not Contains
Simple keyword matching:
expected: type: contains values: - "return policy" - "refund" mode: any # 'any' = at least one, 'all' = all requiredexpected: type: not_contains values: - "I don't know" - "I'm not sure" mode: anySimilarity
Check if the meaning matches (not exact words) using embeddings or LLM:
expected: type: similarity value: "Our return policy allows returns within 30 days of purchase" threshold: 0.75 mode: embedding # or 'llm' for LLM-based comparisonLLM Grader
Use another model to evaluate quality:
expected: type: llm_grader rubric: | Evaluate the response based on: - Is it helpful and accurate? - Is it professionally written? threshold: 0.7JSON Schema
Validate structured outputs:
expected: type: json_schema schema: type: object required: - answer - confidence properties: answer: type: string confidence: type: number minimum: 0 maximum: 1Step 6: Adding Security Tests
Once your functional tests pass, add security red-teaming:
akit redteam scenario.yaml --count 50This generates 50 adversarial prompts testing for:
- Prompt injection
- Jailbreak attempts
- Role spoofing
- Encoding bypasses
Review the report to see if any attacks succeeded.
Step 7: CI/CD Integration
Add ArtemisKit to your CI pipeline. Here’s a GitHub Actions example:
name: LLM Tests
on: [push, pull_request]
jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4
- name: Setup Node.js uses: actions/setup-node@v4 with: node-version: '20'
- name: Install ArtemisKit run: npm install -g @artemiskit/cli
- name: Run Quality Tests run: akit run scenario.yaml env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Run Security Tests run: akit redteam scenario.yaml --count 20 env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}Step 8: Comparing Runs
Compare two test runs to detect regressions:
akit compare ar-baseline-id ar-current-idThis shows metrics comparison and flags any regressions in success rate.
Best Practices
1. Test at Multiple Levels
- Unit tests: Individual prompt/response pairs
- Integration tests: Full conversation flows
- Security tests: Adversarial inputs
- Performance tests: Behavior under load
2. Use Realistic Scenarios
Don’t just test happy paths. Include:
- Ambiguous inputs
- Edge cases
- Malformed requests
- Multi-language inputs (if applicable)
3. Set Thresholds Appropriately
- Start with loose thresholds
- Tighten as your model improves
- Document acceptable failure rates
4. Run Tests Continuously
- Every PR should pass LLM tests
- Security tests on every deployment
- Performance tests before major releases
Next Steps
Now that you have basic testing working:
- Add more scenarios — Cover more of your application’s functionality
- Customize evaluators — Create domain-specific checks
- Integrate stress testing — Know your performance limits
- Set up monitoring — Track quality over time
Resources:
Ready to secure your LLM?
ArtemisKit is free, open-source, and ready to help you test, secure, and stress-test your AI applications.