try fix similaruuty and add seed for master excel icd
This commit is contained in:
73
docs/ENVIRONMENT_VARIABLES.md
Normal file
73
docs/ENVIRONMENT_VARIABLES.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# Environment Variables
|
||||
|
||||
## Database Configuration
|
||||
- `DATABASE_URL`: PostgreSQL connection string
|
||||
- Example: `postgresql://username:password@localhost:5432/claim_guard_db`
|
||||
|
||||
## OpenAI Configuration
|
||||
- `OPENAI_API_KEY`: Your OpenAI API key for embeddings
|
||||
- `OPENAI_API_MODEL`: OpenAI model for embeddings (default: `text-embedding-ada-002`)
|
||||
|
||||
## Vector Search Configuration
|
||||
- `VECTOR_SIMILARITY_THRESHOLD`: Minimum similarity threshold for vector search (default: `0.85`)
|
||||
- Range: 0.0 to 1.0
|
||||
- Higher values = more strict matching
|
||||
- Recommended: 0.85 for production, 0.7 for development
|
||||
|
||||
## Application Configuration
|
||||
- `PORT`: Application port (default: 3000)
|
||||
- `NODE_ENV`: Environment mode (development/production)
|
||||
|
||||
## Example .env file
|
||||
```bash
|
||||
# Database
|
||||
DATABASE_URL="postgresql://username:password@localhost:5432/claim_guard_db"
|
||||
|
||||
# OpenAI
|
||||
OPENAI_API_KEY="your-openai-api-key-here"
|
||||
OPENAI_API_MODEL="text-embedding-ada-002"
|
||||
|
||||
# Vector Search
|
||||
VECTOR_SIMILARITY_THRESHOLD=0.85
|
||||
|
||||
# App
|
||||
PORT=3000
|
||||
NODE_ENV=development
|
||||
```
|
||||
|
||||
## Similarity Threshold Guidelines
|
||||
|
||||
### Production Environment
|
||||
- **High Precision**: 0.90 - 0.95 (very strict matching)
|
||||
- **Standard**: 0.85 - 0.90 (recommended for most use cases)
|
||||
- **Balanced**: 0.80 - 0.85 (good balance between precision and recall)
|
||||
|
||||
### Development Environment
|
||||
- **Testing**: 0.70 - 0.80 (more lenient for testing)
|
||||
- **Debugging**: 0.60 - 0.70 (very lenient for development)
|
||||
|
||||
### How to Set Threshold
|
||||
|
||||
#### Via Environment Variable
|
||||
```bash
|
||||
export VECTOR_SIMILARITY_THRESHOLD=0.90
|
||||
```
|
||||
|
||||
#### Via .env file
|
||||
```bash
|
||||
VECTOR_SIMILARITY_THRESHOLD=0.90
|
||||
```
|
||||
|
||||
#### Via API (Runtime)
|
||||
```bash
|
||||
POST /api/pgvector/threshold
|
||||
{
|
||||
"threshold": 0.90
|
||||
}
|
||||
```
|
||||
|
||||
## Impact of Threshold Changes
|
||||
|
||||
- **Higher Threshold (0.90+)**: Fewer results, higher precision, more relevant matches
|
||||
- **Lower Threshold (0.70-)**: More results, lower precision, may include less relevant matches
|
||||
- **Optimal Range (0.80-0.90)**: Good balance between precision and recall for most medical coding use cases
|
||||
243
docs/SIMILARITY_THRESHOLD.md
Normal file
243
docs/SIMILARITY_THRESHOLD.md
Normal file
@@ -0,0 +1,243 @@
|
||||
# Similarity Threshold Configuration
|
||||
|
||||
## Overview
|
||||
|
||||
The similarity threshold feature allows you to control the precision of vector search results by setting a minimum similarity score required for results to be returned. This ensures that only highly relevant matches are included in search results.
|
||||
|
||||
## Default Configuration
|
||||
|
||||
- **Default Threshold**: `0.85` (85% similarity)
|
||||
- **Environment Variable**: `VECTOR_SIMILARITY_THRESHOLD`
|
||||
- **Range**: 0.0 to 1.0 (0% to 100% similarity)
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### 1. Get Current Threshold
|
||||
|
||||
```http
|
||||
GET /api/pgvector/threshold
|
||||
```
|
||||
|
||||
**Response:**
|
||||
|
||||
```json
|
||||
{
|
||||
"threshold": 0.85,
|
||||
"description": "Minimum similarity score required for search results (0.0 - 1.0)"
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Set Threshold
|
||||
|
||||
```http
|
||||
POST /api/pgvector/threshold
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"threshold": 0.90
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
|
||||
```json
|
||||
{
|
||||
"message": "Similarity threshold updated successfully",
|
||||
"threshold": 0.9,
|
||||
"previousThreshold": 0.85
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Advanced Vector Search
|
||||
|
||||
```http
|
||||
POST /api/pgvector/advanced-search
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"query": "diabetes mellitus type 2",
|
||||
"limit": 10,
|
||||
"category": "ICD10",
|
||||
"threshold": 0.90
|
||||
}
|
||||
```
|
||||
|
||||
## Search Methods
|
||||
|
||||
### Standard Vector Search
|
||||
|
||||
- Uses cosine similarity
|
||||
- Default threshold from environment variable
|
||||
- Good for general use cases
|
||||
|
||||
### Advanced Vector Search
|
||||
|
||||
- Combines cosine and euclidean similarity metrics
|
||||
- Weighted scoring: 70% cosine + 30% euclidean
|
||||
- Higher precision results
|
||||
- Recommended for production use
|
||||
|
||||
### Hybrid Search
|
||||
|
||||
- Combines vector similarity with text search
|
||||
- Uses threshold from environment variable
|
||||
- Best balance of semantic and text matching
|
||||
|
||||
## Threshold Recommendations
|
||||
|
||||
### Medical Coding Use Cases
|
||||
|
||||
| Use Case | Recommended Threshold | Description |
|
||||
| ---------------------------- | --------------------- | --------------------------------------------- |
|
||||
| **High Precision Diagnosis** | 0.90 - 0.95 | Very strict matching for critical diagnoses |
|
||||
| **Standard Medical Coding** | 0.85 - 0.90 | Recommended for most medical coding scenarios |
|
||||
| **General Medical Search** | 0.80 - 0.85 | Good balance between precision and recall |
|
||||
| **Research & Exploration** | 0.70 - 0.80 | More lenient for research purposes |
|
||||
|
||||
### Environment-Specific Settings
|
||||
|
||||
#### Production Environment
|
||||
|
||||
```bash
|
||||
VECTOR_SIMILARITY_THRESHOLD=0.85
|
||||
```
|
||||
|
||||
#### Development Environment
|
||||
|
||||
```bash
|
||||
VECTOR_SIMILARITY_THRESHOLD=0.70
|
||||
```
|
||||
|
||||
#### Testing Environment
|
||||
|
||||
```bash
|
||||
VECTOR_SIMILARITY_THRESHOLD=0.75
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Environment Variable
|
||||
|
||||
```bash
|
||||
# Set in .env file
|
||||
VECTOR_SIMILARITY_THRESHOLD=0.85
|
||||
|
||||
# Or set as system environment variable
|
||||
export VECTOR_SIMILARITY_THRESHOLD=0.85
|
||||
```
|
||||
|
||||
### Runtime Configuration
|
||||
|
||||
```typescript
|
||||
// Get current threshold
|
||||
const currentThreshold = pgVectorService.getSimilarityThreshold();
|
||||
|
||||
// Set new threshold
|
||||
pgVectorService.setSimilarityThreshold(0.9);
|
||||
```
|
||||
|
||||
### SQL Query Optimization
|
||||
|
||||
The system automatically optimizes SQL queries to:
|
||||
|
||||
- Filter results at database level using threshold
|
||||
- Order results by similarity score
|
||||
- Use appropriate vector similarity operators
|
||||
|
||||
## Performance Impact
|
||||
|
||||
### Higher Threshold (0.90+)
|
||||
|
||||
- ✅ Fewer results to process
|
||||
- ✅ Higher precision
|
||||
- ❌ May miss relevant results
|
||||
- ❌ Slower query execution (more filtering)
|
||||
|
||||
### Lower Threshold (0.70-)
|
||||
|
||||
- ✅ Faster query execution
|
||||
- ✅ More comprehensive results
|
||||
- ❌ Lower precision
|
||||
- ❌ More irrelevant results
|
||||
|
||||
### Optimal Range (0.80-0.90)
|
||||
|
||||
- ✅ Good balance of precision and performance
|
||||
- ✅ Suitable for most medical coding scenarios
|
||||
- ✅ Reasonable query execution time
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **No Results Returned**
|
||||
- Check if threshold is too high
|
||||
- Verify embeddings are generated
|
||||
- Check database connection
|
||||
|
||||
2. **Too Many Results**
|
||||
- Increase threshold value
|
||||
- Use advanced search method
|
||||
- Add category filters
|
||||
|
||||
3. **Performance Issues**
|
||||
- Optimize threshold for your use case
|
||||
- Use database indexes
|
||||
- Consider batch processing
|
||||
|
||||
### Debug Commands
|
||||
|
||||
```bash
|
||||
# Check current threshold
|
||||
curl -X GET http://localhost:3000/api/pgvector/threshold
|
||||
|
||||
# Get embedding statistics
|
||||
curl -X GET http://localhost:3000/api/pgvector/stats
|
||||
|
||||
# Test with different thresholds
|
||||
curl -X POST http://localhost:3000/api/pgvector/advanced-search \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"query": "test", "threshold": 0.80}'
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start with Default**: Begin with threshold 0.85
|
||||
2. **Test Incrementally**: Adjust threshold in small increments (0.05)
|
||||
3. **Monitor Results**: Evaluate precision vs. recall trade-offs
|
||||
4. **Environment Specific**: Use different thresholds for different environments
|
||||
5. **Document Changes**: Keep track of threshold changes and their impact
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### From Previous Version
|
||||
|
||||
If upgrading from a version without configurable threshold:
|
||||
|
||||
1. **Set Environment Variable**:
|
||||
|
||||
```bash
|
||||
VECTOR_SIMILARITY_THRESHOLD=0.85
|
||||
```
|
||||
|
||||
2. **Update Search Calls**:
|
||||
|
||||
```typescript
|
||||
// Old way (hardcoded 0.7)
|
||||
const results = await service.vectorSearch(query, limit, category, 0.7);
|
||||
|
||||
// New way (uses environment variable)
|
||||
const results = await service.vectorSearch(query, limit, category);
|
||||
```
|
||||
|
||||
3. **Test New Thresholds**:
|
||||
|
||||
```bash
|
||||
# Test with current threshold
|
||||
curl -X GET http://localhost:3000/api/pgvector/threshold
|
||||
|
||||
# Adjust if needed
|
||||
curl -X POST http://localhost:3000/api/pgvector/threshold \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"threshold": 0.90}'
|
||||
```
|
||||
Reference in New Issue
Block a user