Understanding Semantic Similarity Evaluation for LLMs
Traditional software testing relies on exact matching: if the output equals the expected value, the test passes. LLMs don’t work that way. Ask the same question twice and you’ll get different—but equally valid—responses.
This is where semantic similarity evaluation comes in.
The Problem with Exact Matching
Consider testing a customer support bot. You expect it to explain the return policy.
Expected response:
“You can return items within 30 days of purchase for a full refund.”
Actual responses might be:
“Our return policy allows returns up to 30 days after your purchase date. You’ll receive a complete refund.”
“Items may be returned within thirty days of purchase for a full refund.”
“You have 30 days from purchase to return items and get your money back.”
All three convey the same information. All would fail an exact match test. All should pass a semantic similarity test.
How Semantic Similarity Works
Semantic similarity compares the meaning of two texts, not their exact wording. Here’s how:
1. Text Embedding
Both texts are converted to dense vector representations (embeddings) using a model trained to capture semantic meaning.
"30-day return policy" → [0.2, -0.1, 0.5, ...] (768+ dimensions)"Items returnable within a month" → [0.19, -0.12, 0.48, ...]2. Cosine Similarity
The vectors are compared using cosine similarity, which measures the angle between them:
similarity = (A · B) / (||A|| × ||B||)- 1.0 = Identical meaning
- 0.0 = Unrelated
- -1.0 = Opposite meaning (rare in practice)
3. Threshold Comparison
If similarity exceeds your threshold, the test passes:
0.85 similarity > 0.75 threshold → PASSUsing Similarity in ArtemisKit
Configure similarity evaluation in your scenario file:
cases: - id: return-policy-test prompt: "What is your return policy?" expected: type: similarity value: "Items can be returned within 30 days for a full refund" threshold: 0.75Configuration Options
expected: type: similarity # Required: the reference text to compare against value: "Your expected response content"
# Optional: similarity threshold (default: 0.75) threshold: 0.75
# Optional: evaluation mode - 'embedding', 'llm', or omit for auto mode: embedding
# Optional: embedding model (for embedding mode) embeddingModel: text-embedding-3-large
# Optional: LLM model (for llm mode) model: gpt-4oMode Behavior
- Auto (default): Tries embedding first, falls back to LLM if embeddings unavailable
- Embedding: Uses only embeddings; fails if embedding function unavailable
- LLM: Uses only LLM-based comparison; skips embedding entirely
Choosing the Right Threshold
Threshold selection depends on your use case:
High Threshold (0.85-0.95)
Use when responses should closely match expected content:
- Factual information that must be accurate
- Regulatory disclosures
- Safety-critical information
# Legal disclaimer must be accuratecases: - id: investment-risk-disclosure prompt: "What are the risks of this investment?" expected: type: similarity value: "Past performance does not guarantee future results. You may lose some or all of your investment." threshold: 0.90Medium Threshold (0.70-0.85)
Use for general content where variations are acceptable:
- Customer support responses
- Product descriptions
- General explanations
# General explanation, variations okaycases: - id: shipping-info prompt: "How does shipping work?" expected: type: similarity value: "Orders ship within 2-3 business days via standard shipping" threshold: 0.75Low Threshold (0.55-0.70)
Use when you care about topic relevance more than exact content:
- Open-ended questions
- Creative responses
- General topic matching
# Just needs to be about returns, not specific contentcases: - id: returns-topic prompt: "Tell me about returns" expected: type: similarity value: "Information about returning products and getting refunds" threshold: 0.60Combining with Other Evaluators
Similarity works well combined with other evaluators using the combined type:
cases: - id: return-policy-comprehensive prompt: "What is your return policy?" expected: type: combined operator: and expectations: # Must mention key facts - type: contains values: ["30 days", "refund"] mode: any
# Must NOT include disclaimers - type: not_contains values: ["no returns", "final sale"] mode: any
# Overall meaning should match - type: similarity value: "Return items within 30 days for a full refund" threshold: 0.70When NOT to Use Similarity
Similarity isn’t always the right tool:
1. Structured Outputs
For JSON or structured data, use schema validation:
expected: type: json_schema schema: type: object required: ["status", "message"]2. Specific Values
When exact values matter, use contains:
expected: type: contains values: ["Order #12345"] mode: any3. Format Requirements
When format matters more than content, use regex:
expected: type: regex pattern: "^\\d{3}-\\d{3}-\\d{4}$" # Phone number format4. Nuanced Quality
When subtle quality matters, use LLM grader:
expected: type: llm_grader rubric: "Response should be empathetic and acknowledge the customer's frustration" threshold: 0.7Performance Considerations
Similarity evaluation requires embedding API calls:
- Each comparison = 2 embedding calls (response + reference)
- Consider caching for repeated reference texts
- Batch tests when possible
ArtemisKit caches embeddings automatically within a test run.
Debugging Failed Tests
When similarity tests fail unexpectedly:
1. Check the Actual Similarity Score
akit run scenario.yaml --verboseOutput includes the actual score:
✗ Scenario: return-policy-test Similarity: 0.68 (threshold: 0.75) Reference: "Return items within 30 days..." Actual: "We don't accept returns after 30 days..."2. Adjust Threshold
If the response is acceptable but slightly below threshold, consider lowering it:
threshold: 0.65 # Was 0.753. Refine Reference Text
Sometimes the reference text needs adjustment:
# Too specificvalue: "Return items within 30 days of purchase for a full refund to your original payment method"
# Better - captures core meaningvalue: "30-day return policy with full refunds"Best Practices
- Start with medium thresholds (0.70-0.75) and adjust based on results
- Use concise reference texts — focus on key meaning, not exact wording
- Combine with other evaluators for comprehensive coverage
- Test your thresholds — run against known-good and known-bad responses
- Document threshold choices — explain why each threshold was chosen
Conclusion
Semantic similarity bridges the gap between rigid exact matching and completely subjective evaluation. It lets you test that LLM outputs convey the right meaning without requiring exact wording.
Start with the default threshold (0.75), observe actual similarity scores, and adjust based on your quality requirements.
Learn more:
Ready to secure your LLM?
ArtemisKit is free, open-source, and ready to help you test, secure, and stress-test your AI applications.