claim-guard-be/docs/SIMILARITY_THRESHOLD.md

# Similarity Threshold Configuration

## Overview

The similarity threshold feature allows you to control the precision of vector search results by setting a minimum similarity score required for results to be returned. This ensures that only highly relevant matches are included in search results.

## Default Configuration

- **Default Threshold**: `0.85` (85% similarity)
- **Environment Variable**: `VECTOR_SIMILARITY_THRESHOLD`
- **Range**: 0.0 to 1.0 (0% to 100% similarity)

## API Endpoints

### 1. Get Current Threshold

```http
GET /api/pgvector/threshold
```

**Response:**

```json
{
  "threshold": 0.85,
  "description": "Minimum similarity score required for search results (0.0 - 1.0)"
}
```

### 2. Set Threshold

```http
POST /api/pgvector/threshold
Content-Type: application/json

{
  "threshold": 0.90
}
```

**Response:**

```json
{
  "message": "Similarity threshold updated successfully",
  "threshold": 0.9,
  "previousThreshold": 0.85
}
```

### 3. Advanced Vector Search

```http
POST /api/pgvector/advanced-search
Content-Type: application/json

{
  "query": "diabetes mellitus type 2",
  "limit": 10,
  "category": "ICD10",
  "threshold": 0.90
}
```

## Search Methods

### Standard Vector Search

- Uses cosine similarity
- Default threshold from environment variable
- Good for general use cases

### Advanced Vector Search

- Combines cosine and euclidean similarity metrics
- Weighted scoring: 70% cosine + 30% euclidean
- Higher precision results
- Recommended for production use

### Hybrid Search

- Combines vector similarity with text search
- Uses threshold from environment variable
- Best balance of semantic and text matching

## Threshold Recommendations

### Medical Coding Use Cases

| Use Case                     | Recommended Threshold | Description                                   |
| ---------------------------- | --------------------- | --------------------------------------------- |
| **High Precision Diagnosis** | 0.90 - 0.95           | Very strict matching for critical diagnoses   |
| **Standard Medical Coding**  | 0.85 - 0.90           | Recommended for most medical coding scenarios |
| **General Medical Search**   | 0.80 - 0.85           | Good balance between precision and recall     |
| **Research & Exploration**   | 0.70 - 0.80           | More lenient for research purposes            |

### Environment-Specific Settings

#### Production Environment

```bash
VECTOR_SIMILARITY_THRESHOLD=0.85
```

#### Development Environment

```bash
VECTOR_SIMILARITY_THRESHOLD=0.70
```

#### Testing Environment

```bash
VECTOR_SIMILARITY_THRESHOLD=0.75
```

## Implementation Details

### Environment Variable

```bash
# Set in .env file
VECTOR_SIMILARITY_THRESHOLD=0.85

# Or set as system environment variable
export VECTOR_SIMILARITY_THRESHOLD=0.85
```

### Runtime Configuration

```typescript
// Get current threshold
const currentThreshold = pgVectorService.getSimilarityThreshold();

// Set new threshold
pgVectorService.setSimilarityThreshold(0.9);
```

### SQL Query Optimization

The system automatically optimizes SQL queries to:

- Filter results at database level using threshold
- Order results by similarity score
- Use appropriate vector similarity operators

## Performance Impact

### Higher Threshold (0.90+)

- ✅ Fewer results to process
- ✅ Higher precision
- ❌ May miss relevant results
- ❌ Slower query execution (more filtering)

### Lower Threshold (0.70-)

- ✅ Faster query execution
- ✅ More comprehensive results
- ❌ Lower precision
- ❌ More irrelevant results

### Optimal Range (0.80-0.90)

- ✅ Good balance of precision and performance
- ✅ Suitable for most medical coding scenarios
- ✅ Reasonable query execution time

## Troubleshooting

### Common Issues

1. **No Results Returned**
   - Check if threshold is too high
   - Verify embeddings are generated
   - Check database connection

2. **Too Many Results**
   - Increase threshold value
   - Use advanced search method
   - Add category filters

3. **Performance Issues**
   - Optimize threshold for your use case
   - Use database indexes
   - Consider batch processing

### Debug Commands

```bash
# Check current threshold
curl -X GET http://localhost:3000/api/pgvector/threshold

# Get embedding statistics
curl -X GET http://localhost:3000/api/pgvector/stats

# Test with different thresholds
curl -X POST http://localhost:3000/api/pgvector/advanced-search \
  -H "Content-Type: application/json" \
  -d '{"query": "test", "threshold": 0.80}'
```

## Best Practices

1. **Start with Default**: Begin with threshold 0.85
2. **Test Incrementally**: Adjust threshold in small increments (0.05)
3. **Monitor Results**: Evaluate precision vs. recall trade-offs
4. **Environment Specific**: Use different thresholds for different environments
5. **Document Changes**: Keep track of threshold changes and their impact

## Migration Guide

### From Previous Version

If upgrading from a version without configurable threshold:

1. **Set Environment Variable**:

   ```bash
   VECTOR_SIMILARITY_THRESHOLD=0.85
   ```

2. **Update Search Calls**:

   ```typescript
   // Old way (hardcoded 0.7)
   const results = await service.vectorSearch(query, limit, category, 0.7);

   // New way (uses environment variable)
   const results = await service.vectorSearch(query, limit, category);
   ```

3. **Test New Thresholds**:

   ```bash
   # Test with current threshold
   curl -X GET http://localhost:3000/api/pgvector/threshold

   # Adjust if needed
   curl -X POST http://localhost:3000/api/pgvector/threshold \
     -H "Content-Type: application/json" \
     -d '{"threshold": 0.90}'
   ```