try fix similaruuty and add seed for master excel icd

2025-08-23 03:25:15 +07:00
parent b77beb2d85
commit 0ad656ce35
14 changed files with 1274 additions and 223 deletions
@@ -0,0 +1,73 @@
+# Environment Variables
+
+## Database Configuration
+- `DATABASE_URL`: PostgreSQL connection string
+  - Example: `postgresql://username:password@localhost:5432/claim_guard_db`
+
+## OpenAI Configuration
+- `OPENAI_API_KEY`: Your OpenAI API key for embeddings
+- `OPENAI_API_MODEL`: OpenAI model for embeddings (default: `text-embedding-ada-002`)
+
+## Vector Search Configuration
+- `VECTOR_SIMILARITY_THRESHOLD`: Minimum similarity threshold for vector search (default: `0.85`)
+  - Range: 0.0 to 1.0
+  - Higher values = more strict matching
+  - Recommended: 0.85 for production, 0.7 for development
+
+## Application Configuration
+- `PORT`: Application port (default: 3000)
+- `NODE_ENV`: Environment mode (development/production)
+
+## Example .env file
+```bash
+# Database
+DATABASE_URL="postgresql://username:password@localhost:5432/claim_guard_db"
+
+# OpenAI
+OPENAI_API_KEY="your-openai-api-key-here"
+OPENAI_API_MODEL="text-embedding-ada-002"
+
+# Vector Search
+VECTOR_SIMILARITY_THRESHOLD=0.85
+
+# App
+PORT=3000
+NODE_ENV=development
+```
+
+## Similarity Threshold Guidelines
+
+### Production Environment
+- **High Precision**: 0.90 - 0.95 (very strict matching)
+- **Standard**: 0.85 - 0.90 (recommended for most use cases)
+- **Balanced**: 0.80 - 0.85 (good balance between precision and recall)
+
+### Development Environment
+- **Testing**: 0.70 - 0.80 (more lenient for testing)
+- **Debugging**: 0.60 - 0.70 (very lenient for development)
+
+### How to Set Threshold
+
+#### Via Environment Variable
+```bash
+export VECTOR_SIMILARITY_THRESHOLD=0.90
+```
+
+#### Via .env file
+```bash
+VECTOR_SIMILARITY_THRESHOLD=0.90
+```
+
+#### Via API (Runtime)
+```bash
+POST /api/pgvector/threshold
+{
+  "threshold": 0.90
+}
+```
+
+## Impact of Threshold Changes
+
+- **Higher Threshold (0.90+)**: Fewer results, higher precision, more relevant matches
+- **Lower Threshold (0.70-)**: More results, lower precision, may include less relevant matches
+- **Optimal Range (0.80-0.90)**: Good balance between precision and recall for most medical coding use cases
@@ -0,0 +1,243 @@
+# Similarity Threshold Configuration
+
+## Overview
+
+The similarity threshold feature allows you to control the precision of vector search results by setting a minimum similarity score required for results to be returned. This ensures that only highly relevant matches are included in search results.
+
+## Default Configuration
+
+- **Default Threshold**: `0.85` (85% similarity)
+- **Environment Variable**: `VECTOR_SIMILARITY_THRESHOLD`
+- **Range**: 0.0 to 1.0 (0% to 100% similarity)
+
+## API Endpoints
+
+### 1. Get Current Threshold
+
+```http
+GET /api/pgvector/threshold
+```
+
+**Response:**
+
+```json
+{
+  "threshold": 0.85,
+  "description": "Minimum similarity score required for search results (0.0 - 1.0)"
+}
+```
+
+### 2. Set Threshold
+
+```http
+POST /api/pgvector/threshold
+Content-Type: application/json
+
+{
+  "threshold": 0.90
+}
+```
+
+**Response:**
+
+```json
+{
+  "message": "Similarity threshold updated successfully",
+  "threshold": 0.9,
+  "previousThreshold": 0.85
+}
+```
+
+### 3. Advanced Vector Search
+
+```http
+POST /api/pgvector/advanced-search
+Content-Type: application/json
+
+{
+  "query": "diabetes mellitus type 2",
+  "limit": 10,
+  "category": "ICD10",
+  "threshold": 0.90
+}
+```
+
+## Search Methods
+
+### Standard Vector Search
+
+- Uses cosine similarity
+- Default threshold from environment variable
+- Good for general use cases
+
+### Advanced Vector Search
+
+- Combines cosine and euclidean similarity metrics
+- Weighted scoring: 70% cosine + 30% euclidean
+- Higher precision results
+- Recommended for production use
+
+### Hybrid Search
+
+- Combines vector similarity with text search
+- Uses threshold from environment variable
+- Best balance of semantic and text matching
+
+## Threshold Recommendations
+
+### Medical Coding Use Cases
+
+| Use Case                     | Recommended Threshold | Description                                   |
+| ---------------------------- | --------------------- | --------------------------------------------- |
+| **High Precision Diagnosis** | 0.90 - 0.95           | Very strict matching for critical diagnoses   |
+| **Standard Medical Coding**  | 0.85 - 0.90           | Recommended for most medical coding scenarios |
+| **General Medical Search**   | 0.80 - 0.85           | Good balance between precision and recall     |
+| **Research & Exploration**   | 0.70 - 0.80           | More lenient for research purposes            |
+
+### Environment-Specific Settings
+
+#### Production Environment
+
+```bash
+VECTOR_SIMILARITY_THRESHOLD=0.85
+```
+
+#### Development Environment
+
+```bash
+VECTOR_SIMILARITY_THRESHOLD=0.70
+```
+
+#### Testing Environment
+
+```bash
+VECTOR_SIMILARITY_THRESHOLD=0.75
+```
+
+## Implementation Details
+
+### Environment Variable
+
+```bash
+# Set in .env file
+VECTOR_SIMILARITY_THRESHOLD=0.85
+
+# Or set as system environment variable
+export VECTOR_SIMILARITY_THRESHOLD=0.85
+```
+
+### Runtime Configuration
+
+```typescript
+// Get current threshold
+const currentThreshold = pgVectorService.getSimilarityThreshold();
+
+// Set new threshold
+pgVectorService.setSimilarityThreshold(0.9);
+```
+
+### SQL Query Optimization
+
+The system automatically optimizes SQL queries to:
+
+- Filter results at database level using threshold
+- Order results by similarity score
+- Use appropriate vector similarity operators
+
+## Performance Impact
+
+### Higher Threshold (0.90+)
+
+- ✅ Fewer results to process
+- ✅ Higher precision
+- ❌ May miss relevant results
+- ❌ Slower query execution (more filtering)
+
+### Lower Threshold (0.70-)
+
+- ✅ Faster query execution
+- ✅ More comprehensive results
+- ❌ Lower precision
+- ❌ More irrelevant results
+
+### Optimal Range (0.80-0.90)
+
+- ✅ Good balance of precision and performance
+- ✅ Suitable for most medical coding scenarios
+- ✅ Reasonable query execution time
+
+## Troubleshooting
+
+### Common Issues
+
+1. **No Results Returned**
+   - Check if threshold is too high
+   - Verify embeddings are generated
+   - Check database connection
+
+2. **Too Many Results**
+   - Increase threshold value
+   - Use advanced search method
+   - Add category filters
+
+3. **Performance Issues**
+   - Optimize threshold for your use case
+   - Use database indexes
+   - Consider batch processing
+
+### Debug Commands
+
+```bash
+# Check current threshold
+curl -X GET http://localhost:3000/api/pgvector/threshold
+
+# Get embedding statistics
+curl -X GET http://localhost:3000/api/pgvector/stats
+
+# Test with different thresholds
+curl -X POST http://localhost:3000/api/pgvector/advanced-search \
+  -H "Content-Type: application/json" \
+  -d '{"query": "test", "threshold": 0.80}'
+```
+
+## Best Practices
+
+1. **Start with Default**: Begin with threshold 0.85
+2. **Test Incrementally**: Adjust threshold in small increments (0.05)
+3. **Monitor Results**: Evaluate precision vs. recall trade-offs
+4. **Environment Specific**: Use different thresholds for different environments
+5. **Document Changes**: Keep track of threshold changes and their impact
+
+## Migration Guide
+
+### From Previous Version
+
+If upgrading from a version without configurable threshold:
+
+1. **Set Environment Variable**:
+
+   ```bash
+   VECTOR_SIMILARITY_THRESHOLD=0.85
+   ```
+
+2. **Update Search Calls**:
+
+   ```typescript
+   // Old way (hardcoded 0.7)
+   const results = await service.vectorSearch(query, limit, category, 0.7);
+
+   // New way (uses environment variable)
+   const results = await service.vectorSearch(query, limit, category);
+   ```
+
+3. **Test New Thresholds**:
+
+   ```bash
+   # Test with current threshold
+   curl -X GET http://localhost:3000/api/pgvector/threshold
+
+   # Adjust if needed
+   curl -X POST http://localhost:3000/api/pgvector/threshold \
+     -H "Content-Type: application/json" \
+     -d '{"threshold": 0.90}'
+   ```