Files
claim-guard-be/docs/SIMILARITY_THRESHOLD.md

244 lines
5.5 KiB
Markdown

# Similarity Threshold Configuration
## Overview
The similarity threshold feature allows you to control the precision of vector search results by setting a minimum similarity score required for results to be returned. This ensures that only highly relevant matches are included in search results.
## Default Configuration
- **Default Threshold**: `0.85` (85% similarity)
- **Environment Variable**: `VECTOR_SIMILARITY_THRESHOLD`
- **Range**: 0.0 to 1.0 (0% to 100% similarity)
## API Endpoints
### 1. Get Current Threshold
```http
GET /api/pgvector/threshold
```
**Response:**
```json
{
"threshold": 0.85,
"description": "Minimum similarity score required for search results (0.0 - 1.0)"
}
```
### 2. Set Threshold
```http
POST /api/pgvector/threshold
Content-Type: application/json
{
"threshold": 0.90
}
```
**Response:**
```json
{
"message": "Similarity threshold updated successfully",
"threshold": 0.9,
"previousThreshold": 0.85
}
```
### 3. Advanced Vector Search
```http
POST /api/pgvector/advanced-search
Content-Type: application/json
{
"query": "diabetes mellitus type 2",
"limit": 10,
"category": "ICD10",
"threshold": 0.90
}
```
## Search Methods
### Standard Vector Search
- Uses cosine similarity
- Default threshold from environment variable
- Good for general use cases
### Advanced Vector Search
- Combines cosine and euclidean similarity metrics
- Weighted scoring: 70% cosine + 30% euclidean
- Higher precision results
- Recommended for production use
### Hybrid Search
- Combines vector similarity with text search
- Uses threshold from environment variable
- Best balance of semantic and text matching
## Threshold Recommendations
### Medical Coding Use Cases
| Use Case | Recommended Threshold | Description |
| ---------------------------- | --------------------- | --------------------------------------------- |
| **High Precision Diagnosis** | 0.90 - 0.95 | Very strict matching for critical diagnoses |
| **Standard Medical Coding** | 0.85 - 0.90 | Recommended for most medical coding scenarios |
| **General Medical Search** | 0.80 - 0.85 | Good balance between precision and recall |
| **Research & Exploration** | 0.70 - 0.80 | More lenient for research purposes |
### Environment-Specific Settings
#### Production Environment
```bash
VECTOR_SIMILARITY_THRESHOLD=0.85
```
#### Development Environment
```bash
VECTOR_SIMILARITY_THRESHOLD=0.70
```
#### Testing Environment
```bash
VECTOR_SIMILARITY_THRESHOLD=0.75
```
## Implementation Details
### Environment Variable
```bash
# Set in .env file
VECTOR_SIMILARITY_THRESHOLD=0.85
# Or set as system environment variable
export VECTOR_SIMILARITY_THRESHOLD=0.85
```
### Runtime Configuration
```typescript
// Get current threshold
const currentThreshold = pgVectorService.getSimilarityThreshold();
// Set new threshold
pgVectorService.setSimilarityThreshold(0.9);
```
### SQL Query Optimization
The system automatically optimizes SQL queries to:
- Filter results at database level using threshold
- Order results by similarity score
- Use appropriate vector similarity operators
## Performance Impact
### Higher Threshold (0.90+)
- ✅ Fewer results to process
- ✅ Higher precision
- ❌ May miss relevant results
- ❌ Slower query execution (more filtering)
### Lower Threshold (0.70-)
- ✅ Faster query execution
- ✅ More comprehensive results
- ❌ Lower precision
- ❌ More irrelevant results
### Optimal Range (0.80-0.90)
- ✅ Good balance of precision and performance
- ✅ Suitable for most medical coding scenarios
- ✅ Reasonable query execution time
## Troubleshooting
### Common Issues
1. **No Results Returned**
- Check if threshold is too high
- Verify embeddings are generated
- Check database connection
2. **Too Many Results**
- Increase threshold value
- Use advanced search method
- Add category filters
3. **Performance Issues**
- Optimize threshold for your use case
- Use database indexes
- Consider batch processing
### Debug Commands
```bash
# Check current threshold
curl -X GET http://localhost:3000/api/pgvector/threshold
# Get embedding statistics
curl -X GET http://localhost:3000/api/pgvector/stats
# Test with different thresholds
curl -X POST http://localhost:3000/api/pgvector/advanced-search \
-H "Content-Type: application/json" \
-d '{"query": "test", "threshold": 0.80}'
```
## Best Practices
1. **Start with Default**: Begin with threshold 0.85
2. **Test Incrementally**: Adjust threshold in small increments (0.05)
3. **Monitor Results**: Evaluate precision vs. recall trade-offs
4. **Environment Specific**: Use different thresholds for different environments
5. **Document Changes**: Keep track of threshold changes and their impact
## Migration Guide
### From Previous Version
If upgrading from a version without configurable threshold:
1. **Set Environment Variable**:
```bash
VECTOR_SIMILARITY_THRESHOLD=0.85
```
2. **Update Search Calls**:
```typescript
// Old way (hardcoded 0.7)
const results = await service.vectorSearch(query, limit, category, 0.7);
// New way (uses environment variable)
const results = await service.vectorSearch(query, limit, category);
```
3. **Test New Thresholds**:
```bash
# Test with current threshold
curl -X GET http://localhost:3000/api/pgvector/threshold
# Adjust if needed
curl -X POST http://localhost:3000/api/pgvector/threshold \
-H "Content-Type: application/json" \
-d '{"threshold": 0.90}'
```