Files
claim-guard-be/docs/SIMILARITY_THRESHOLD.md

5.5 KiB

Similarity Threshold Configuration

Overview

The similarity threshold feature allows you to control the precision of vector search results by setting a minimum similarity score required for results to be returned. This ensures that only highly relevant matches are included in search results.

Default Configuration

  • Default Threshold: 0.85 (85% similarity)
  • Environment Variable: VECTOR_SIMILARITY_THRESHOLD
  • Range: 0.0 to 1.0 (0% to 100% similarity)

API Endpoints

1. Get Current Threshold

GET /api/pgvector/threshold

Response:

{
  "threshold": 0.85,
  "description": "Minimum similarity score required for search results (0.0 - 1.0)"
}

2. Set Threshold

POST /api/pgvector/threshold
Content-Type: application/json

{
  "threshold": 0.90
}

Response:

{
  "message": "Similarity threshold updated successfully",
  "threshold": 0.9,
  "previousThreshold": 0.85
}
POST /api/pgvector/advanced-search
Content-Type: application/json

{
  "query": "diabetes mellitus type 2",
  "limit": 10,
  "category": "ICD10",
  "threshold": 0.90
}

Search Methods

  • Uses cosine similarity
  • Default threshold from environment variable
  • Good for general use cases
  • Combines cosine and euclidean similarity metrics
  • Weighted scoring: 70% cosine + 30% euclidean
  • Higher precision results
  • Recommended for production use
  • Combines vector similarity with text search
  • Uses threshold from environment variable
  • Best balance of semantic and text matching

Threshold Recommendations

Medical Coding Use Cases

Use Case Recommended Threshold Description
High Precision Diagnosis 0.90 - 0.95 Very strict matching for critical diagnoses
Standard Medical Coding 0.85 - 0.90 Recommended for most medical coding scenarios
General Medical Search 0.80 - 0.85 Good balance between precision and recall
Research & Exploration 0.70 - 0.80 More lenient for research purposes

Environment-Specific Settings

Production Environment

VECTOR_SIMILARITY_THRESHOLD=0.85

Development Environment

VECTOR_SIMILARITY_THRESHOLD=0.70

Testing Environment

VECTOR_SIMILARITY_THRESHOLD=0.75

Implementation Details

Environment Variable

# Set in .env file
VECTOR_SIMILARITY_THRESHOLD=0.85

# Or set as system environment variable
export VECTOR_SIMILARITY_THRESHOLD=0.85

Runtime Configuration

// Get current threshold
const currentThreshold = pgVectorService.getSimilarityThreshold();

// Set new threshold
pgVectorService.setSimilarityThreshold(0.9);

SQL Query Optimization

The system automatically optimizes SQL queries to:

  • Filter results at database level using threshold
  • Order results by similarity score
  • Use appropriate vector similarity operators

Performance Impact

Higher Threshold (0.90+)

  • Fewer results to process
  • Higher precision
  • May miss relevant results
  • Slower query execution (more filtering)

Lower Threshold (0.70-)

  • Faster query execution
  • More comprehensive results
  • Lower precision
  • More irrelevant results

Optimal Range (0.80-0.90)

  • Good balance of precision and performance
  • Suitable for most medical coding scenarios
  • Reasonable query execution time

Troubleshooting

Common Issues

  1. No Results Returned

    • Check if threshold is too high
    • Verify embeddings are generated
    • Check database connection
  2. Too Many Results

    • Increase threshold value
    • Use advanced search method
    • Add category filters
  3. Performance Issues

    • Optimize threshold for your use case
    • Use database indexes
    • Consider batch processing

Debug Commands

# Check current threshold
curl -X GET http://localhost:3000/api/pgvector/threshold

# Get embedding statistics
curl -X GET http://localhost:3000/api/pgvector/stats

# Test with different thresholds
curl -X POST http://localhost:3000/api/pgvector/advanced-search \
  -H "Content-Type: application/json" \
  -d '{"query": "test", "threshold": 0.80}'

Best Practices

  1. Start with Default: Begin with threshold 0.85
  2. Test Incrementally: Adjust threshold in small increments (0.05)
  3. Monitor Results: Evaluate precision vs. recall trade-offs
  4. Environment Specific: Use different thresholds for different environments
  5. Document Changes: Keep track of threshold changes and their impact

Migration Guide

From Previous Version

If upgrading from a version without configurable threshold:

  1. Set Environment Variable:

    VECTOR_SIMILARITY_THRESHOLD=0.85
    
  2. Update Search Calls:

    // Old way (hardcoded 0.7)
    const results = await service.vectorSearch(query, limit, category, 0.7);
    
    // New way (uses environment variable)
    const results = await service.vectorSearch(query, limit, category);
    
  3. Test New Thresholds:

    # Test with current threshold
    curl -X GET http://localhost:3000/api/pgvector/threshold
    
    # Adjust if needed
    curl -X POST http://localhost:3000/api/pgvector/threshold \
      -H "Content-Type: application/json" \
      -d '{"threshold": 0.90}'