try fix similaruuty and add seed for master excel icd

2025-08-23 03:25:15 +07:00
parent b77beb2d85
commit 0ad656ce35
14 changed files with 1274 additions and 223 deletions
--- a/docs/ENVIRONMENT_VARIABLES.md
+++ b/docs/ENVIRONMENT_VARIABLES.md
@@ -0,0 +1,73 @@
+# Environment Variables
+
+## Database Configuration
+- `DATABASE_URL`: PostgreSQL connection string
+  - Example: `postgresql://username:password@localhost:5432/claim_guard_db`
+
+## OpenAI Configuration
+- `OPENAI_API_KEY`: Your OpenAI API key for embeddings
+- `OPENAI_API_MODEL`: OpenAI model for embeddings (default: `text-embedding-ada-002`)
+
+## Vector Search Configuration
+- `VECTOR_SIMILARITY_THRESHOLD`: Minimum similarity threshold for vector search (default: `0.85`)
+  - Range: 0.0 to 1.0
+  - Higher values = more strict matching
+  - Recommended: 0.85 for production, 0.7 for development
+
+## Application Configuration
+- `PORT`: Application port (default: 3000)
+- `NODE_ENV`: Environment mode (development/production)
+
+## Example .env file
+```bash
+# Database
+DATABASE_URL="postgresql://username:password@localhost:5432/claim_guard_db"
+
+# OpenAI
+OPENAI_API_KEY="your-openai-api-key-here"
+OPENAI_API_MODEL="text-embedding-ada-002"
+
+# Vector Search
+VECTOR_SIMILARITY_THRESHOLD=0.85
+
+# App
+PORT=3000
+NODE_ENV=development
+```
+
+## Similarity Threshold Guidelines
+
+### Production Environment
+- **High Precision**: 0.90 - 0.95 (very strict matching)
+- **Standard**: 0.85 - 0.90 (recommended for most use cases)
+- **Balanced**: 0.80 - 0.85 (good balance between precision and recall)
+
+### Development Environment
+- **Testing**: 0.70 - 0.80 (more lenient for testing)
+- **Debugging**: 0.60 - 0.70 (very lenient for development)
+
+### How to Set Threshold
+
+#### Via Environment Variable
+```bash
+export VECTOR_SIMILARITY_THRESHOLD=0.90
+```
+
+#### Via .env file
+```bash
+VECTOR_SIMILARITY_THRESHOLD=0.90
+```
+
+#### Via API (Runtime)
+```bash
+POST /api/pgvector/threshold
+{
+  "threshold": 0.90
+}
+```
+
+## Impact of Threshold Changes
+
+- **Higher Threshold (0.90+)**: Fewer results, higher precision, more relevant matches
+- **Lower Threshold (0.70-)**: More results, lower precision, may include less relevant matches
+- **Optimal Range (0.80-0.90)**: Good balance between precision and recall for most medical coding use cases