# Similarity Threshold Configuration ## Overview The similarity threshold feature allows you to control the precision of vector search results by setting a minimum similarity score required for results to be returned. This ensures that only highly relevant matches are included in search results. ## Default Configuration - **Default Threshold**: `0.85` (85% similarity) - **Environment Variable**: `VECTOR_SIMILARITY_THRESHOLD` - **Range**: 0.0 to 1.0 (0% to 100% similarity) ## API Endpoints ### 1. Get Current Threshold ```http GET /api/pgvector/threshold ``` **Response:** ```json { "threshold": 0.85, "description": "Minimum similarity score required for search results (0.0 - 1.0)" } ``` ### 2. Set Threshold ```http POST /api/pgvector/threshold Content-Type: application/json { "threshold": 0.90 } ``` **Response:** ```json { "message": "Similarity threshold updated successfully", "threshold": 0.9, "previousThreshold": 0.85 } ``` ### 3. Advanced Vector Search ```http POST /api/pgvector/advanced-search Content-Type: application/json { "query": "diabetes mellitus type 2", "limit": 10, "category": "ICD10", "threshold": 0.90 } ``` ## Search Methods ### Standard Vector Search - Uses cosine similarity - Default threshold from environment variable - Good for general use cases ### Advanced Vector Search - Combines cosine and euclidean similarity metrics - Weighted scoring: 70% cosine + 30% euclidean - Higher precision results - Recommended for production use ### Hybrid Search - Combines vector similarity with text search - Uses threshold from environment variable - Best balance of semantic and text matching ## Threshold Recommendations ### Medical Coding Use Cases | Use Case | Recommended Threshold | Description | | ---------------------------- | --------------------- | --------------------------------------------- | | **High Precision Diagnosis** | 0.90 - 0.95 | Very strict matching for critical diagnoses | | **Standard Medical Coding** | 0.85 - 0.90 | Recommended for most medical coding scenarios | | **General Medical Search** | 0.80 - 0.85 | Good balance between precision and recall | | **Research & Exploration** | 0.70 - 0.80 | More lenient for research purposes | ### Environment-Specific Settings #### Production Environment ```bash VECTOR_SIMILARITY_THRESHOLD=0.85 ``` #### Development Environment ```bash VECTOR_SIMILARITY_THRESHOLD=0.70 ``` #### Testing Environment ```bash VECTOR_SIMILARITY_THRESHOLD=0.75 ``` ## Implementation Details ### Environment Variable ```bash # Set in .env file VECTOR_SIMILARITY_THRESHOLD=0.85 # Or set as system environment variable export VECTOR_SIMILARITY_THRESHOLD=0.85 ``` ### Runtime Configuration ```typescript // Get current threshold const currentThreshold = pgVectorService.getSimilarityThreshold(); // Set new threshold pgVectorService.setSimilarityThreshold(0.9); ``` ### SQL Query Optimization The system automatically optimizes SQL queries to: - Filter results at database level using threshold - Order results by similarity score - Use appropriate vector similarity operators ## Performance Impact ### Higher Threshold (0.90+) - ✅ Fewer results to process - ✅ Higher precision - ❌ May miss relevant results - ❌ Slower query execution (more filtering) ### Lower Threshold (0.70-) - ✅ Faster query execution - ✅ More comprehensive results - ❌ Lower precision - ❌ More irrelevant results ### Optimal Range (0.80-0.90) - ✅ Good balance of precision and performance - ✅ Suitable for most medical coding scenarios - ✅ Reasonable query execution time ## Troubleshooting ### Common Issues 1. **No Results Returned** - Check if threshold is too high - Verify embeddings are generated - Check database connection 2. **Too Many Results** - Increase threshold value - Use advanced search method - Add category filters 3. **Performance Issues** - Optimize threshold for your use case - Use database indexes - Consider batch processing ### Debug Commands ```bash # Check current threshold curl -X GET http://localhost:3000/api/pgvector/threshold # Get embedding statistics curl -X GET http://localhost:3000/api/pgvector/stats # Test with different thresholds curl -X POST http://localhost:3000/api/pgvector/advanced-search \ -H "Content-Type: application/json" \ -d '{"query": "test", "threshold": 0.80}' ``` ## Best Practices 1. **Start with Default**: Begin with threshold 0.85 2. **Test Incrementally**: Adjust threshold in small increments (0.05) 3. **Monitor Results**: Evaluate precision vs. recall trade-offs 4. **Environment Specific**: Use different thresholds for different environments 5. **Document Changes**: Keep track of threshold changes and their impact ## Migration Guide ### From Previous Version If upgrading from a version without configurable threshold: 1. **Set Environment Variable**: ```bash VECTOR_SIMILARITY_THRESHOLD=0.85 ``` 2. **Update Search Calls**: ```typescript // Old way (hardcoded 0.7) const results = await service.vectorSearch(query, limit, category, 0.7); // New way (uses environment variable) const results = await service.vectorSearch(query, limit, category); ``` 3. **Test New Thresholds**: ```bash # Test with current threshold curl -X GET http://localhost:3000/api/pgvector/threshold # Adjust if needed curl -X POST http://localhost:3000/api/pgvector/threshold \ -H "Content-Type: application/json" \ -d '{"threshold": 0.90}' ```