Feature: IRIS EMBEDDING (Feature 051) Status: Production-Ready (v0.5.2+) Performance: 346x faster than manual embedding generation
IRIS EMBEDDING provides automatic document vectorization with intelligent model caching, eliminating the 720x performance penalty from repeated model loading. When enabled, embedding models stay in memory and process all document insertions and queries through a centralized cache.
Key Benefits:
- ⚡ 346x speedup - 1,746 documents in 3.5 seconds vs 20 minutes
- 🎯 95% cache hit rate - Models persist across requests
- 🚀 50ms average latency - Cached embeddings complete in <100ms
- 💾 Automatic fallback - GPU OOM? Falls back to CPU automatically
- 🔄 Multi-field support - Combine title, abstract, and content into single embeddings
Test Dataset: 1,746 PMC medical papers with multi-field vectorization
| Method | Time | Model Loads | Cache Hit Rate | Docs/Second |
|---|---|---|---|---|
| Manual (baseline) | 20 minutes | 1,746 (every row) | 0% | 1.5 |
| IRIS EMBEDDING | 3.5 seconds | 1 (cached) | 95% | 499 |
| Speedup | 346x faster | 1,746x fewer loads | 95% efficiency | 333x throughput |
Hardware: Apple M1 Max (MPS acceleration)
Model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
Configuration: Batch size 32, device auto-selection
- Small collections (<100 docs): 10-50x speedup
- Medium collections (100-1,000 docs): 100-200x speedup
- Large collections (>1,000 docs): 300-500x speedup
Speedup increases with collection size due to model loading overhead amortization.
from iris_vector_rag import create_pipeline
from iris_vector_rag.core.models import Document
# Enable IRIS EMBEDDING support
pipeline = create_pipeline(
'basic',
embedding_config='medical_embeddings_v1' # IRIS EMBEDDING config name
)
# Documents auto-vectorize on INSERT with cached models
docs = [
Document(
page_content="Type 2 diabetes is characterized by insulin resistance...",
metadata={"source": "medical_text.pdf", "page": 127}
)
]
pipeline.load_documents(documents=docs)
# Queries auto-vectorize using same cached model
result = pipeline.query("What is diabetes?", top_k=5)Create an embedding configuration to define model, device, and processing parameters:
from iris_vector_rag.embeddings.iris_embedding import configure_embedding
# Create embedding configuration
config = configure_embedding(
name="medical_embeddings_v1",
model_name="sentence-transformers/all-MiniLM-L6-v2",
device_preference="auto", # auto, cuda, mps, cpu
batch_size=32,
enable_entity_extraction=True,
entity_types=["Disease", "Medication", "Symptom"]
)
# Use with any pipeline
pipeline = create_pipeline('basic', embedding_config='medical_embeddings_v1')Combine multiple document fields (title, abstract, conclusions) into a single embedding for richer semantic search:
from iris_vector_rag.core.models import Document
# Document with multiple content fields
doc = Document(
page_content="", # Will be auto-filled from metadata fields
metadata={
"title": "Type 2 Diabetes Treatment",
"abstract": "A comprehensive review of treatment approaches...",
"conclusions": "Insulin therapy combined with lifestyle changes...",
"source": "PMC123456"
}
)
# Configure multi-field embedding
pipeline = create_pipeline(
'basic',
embedding_config='paper_embeddings',
multi_field_source=['title', 'abstract', 'conclusions'] # Concatenate these fields
)
pipeline.load_documents(documents=[doc])
# → Embedding generated from: "Type 2 Diabetes Treatment. A comprehensive review..."Benefits:
- Captures context from multiple document sections
- Improves search relevance for academic papers and structured content
- Preserves original metadata fields for filtering
IRIS EMBEDDING automatically selects the best available device:
config = configure_embedding(
name="auto_device_config",
model_name="sentence-transformers/all-MiniLM-L6-v2",
device_preference="auto" # Tries: CUDA → MPS → CPU
)Device Priority:
- CUDA (NVIDIA GPUs) - Fastest for large models
- MPS (Apple Silicon) - Optimized for M1/M2 Macs
- CPU - Universal fallback
Automatic Fallback: If GPU runs out of memory during processing, IRIS EMBEDDING automatically falls back to CPU without failing the operation.
Configure batch size for optimal throughput:
config = configure_embedding(
name="batch_optimized",
model_name="sentence-transformers/all-MiniLM-L6-v2",
batch_size=64, # Process 64 documents per batch
device_preference="cuda"
)Batch Size Guidelines:
- CPU: 8-16 (limited by RAM)
- MPS (Apple Silicon): 32-64 (limited by unified memory)
- CUDA (NVIDIA): 64-128 (limited by VRAM)
Larger batches improve throughput but increase memory usage.
Enable automatic entity extraction during vectorization:
config = configure_embedding(
name="entity_aware_embeddings",
model_name="sentence-transformers/all-MiniLM-L6-v2",
enable_entity_extraction=True,
entity_types=["Disease", "Medication", "Symptom", "Treatment"],
entity_extraction_model="en_core_web_sm" # spaCy model
)Benefits:
- Extracted entities stored in metadata for filtering
- Enables hybrid retrieval (semantic + entity-based)
- Powers GraphRAG knowledge graph construction
| Use Case | Model | Dimensions | Speed | Quality |
|---|---|---|---|---|
| General purpose | all-MiniLM-L6-v2 |
384 | Fast | Good |
| High quality | all-mpnet-base-v2 |
768 | Medium | Excellent |
| Multilingual | paraphrase-multilingual-mpnet-base-v2 |
768 | Medium | Good |
| Medical domain | dmis-lab/biobert-base-cased-v1.1 |
768 | Medium | Domain-specific |
| Legal domain | nlpaueb/legal-bert-base-uncased |
768 | Medium | Domain-specific |
Use any HuggingFace embedding model:
config = configure_embedding(
name="custom_model",
model_name="your-org/your-embedding-model",
device_preference="auto"
)Requirements:
- Must be compatible with
sentence-transformerslibrary - Must output fixed-dimension vectors
- Must be accessible via HuggingFace model hub or local path
- Large document collections (>1,000 documents)
- Frequent re-indexing or incremental updates
- Real-time vectorization requirements
- Memory-constrained environments (model stays in memory, no repeated loading)
- Multi-field vectorization needs (academic papers, structured documents)
- Entity-aware retrieval (medical, legal, scientific domains)
- Small collections (<100 documents) - Overhead not worth the benefit
- One-time indexing - Model caching provides minimal value
- Custom embedding logic - Use manual embeddings if you need full control
- External embedding services (OpenAI, Cohere) - Use API-based embeddings instead
config = configure_embedding(
# Required
name="config_name", # Unique configuration identifier
model_name="sentence-transformers/all-MiniLM-L6-v2", # HuggingFace model
# Device and Performance
device_preference="auto", # auto | cuda | mps | cpu
batch_size=32, # Documents per batch
max_seq_length=512, # Max tokens per document
normalize_embeddings=True, # L2 normalization
# Entity Extraction (optional)
enable_entity_extraction=False, # Extract entities during vectorization
entity_types=["Disease", "Medication"], # Entity types to extract
entity_extraction_model="en_core_web_sm", # spaCy model for extraction
# Multi-Field Vectorization (optional)
multi_field_source=["title", "abstract"], # Metadata fields to concatenate
multi_field_separator=". ", # Separator between fields
# Advanced
cache_folder="./model_cache", # Model cache directory
trust_remote_code=False, # Trust remote HuggingFace code
model_kwargs={}, # Additional model arguments
)# Model cache location
export SENTENCE_TRANSFORMERS_HOME=/path/to/cache
# HuggingFace token (for private models)
export HUGGINGFACE_TOKEN=your_token_here
# Device override
export CUDA_VISIBLE_DEVICES=0,1Symptom: RuntimeError: CUDA out of memory
Solutions:
- Reduce batch size:
batch_size=16orbatch_size=8 - Use smaller model:
all-MiniLM-L6-v2(384D) instead ofall-mpnet-base-v2(768D) - Enable automatic fallback:
device_preference="auto"(falls back to CPU) - Clear CUDA cache:
torch.cuda.empty_cache()
Symptom: Vectorization slower than expected
Solutions:
- Check device:
print(config.device)- Should becudaormps, notcpu - Increase batch size:
batch_size=64orbatch_size=128(if memory allows) - Reduce max_seq_length:
max_seq_length=256(if documents are short) - Verify model is cached: First run loads model, subsequent runs should be 10-100x faster
Symptom: OSError: Model 'model-name' not found
Solutions:
- Check model name spelling: Must match HuggingFace model hub exactly
- Check internet connection: Model downloads on first use
- Use local path:
model_name="/path/to/local/model" - Check HuggingFace token: Required for private models
Symptom: Low cache hit rate (<50%)
Solutions:
- Check configuration consistency: Same
embedding_configname for all operations - Verify model persistence: Model should load once and stay in memory
- Check batch processing: Large batches improve cache efficiency
- Review logs: Check for repeated model loads (indicates configuration mismatch)
IRIS EMBEDDING uses a three-tier caching strategy:
- Session Cache: In-memory model instances (lasts for process lifetime)
- Disk Cache: Downloaded model weights (HuggingFace cache)
- Embedding Cache: Computed embeddings stored in IRIS tables
Cache Invalidation: Only when configuration changes or model updates
IRIS EMBEDDING integrates with IRIS SQL at the table level:
-- Create table with auto-vectorization
CREATE TABLE documents (
id INT,
content VARCHAR(5000),
embedding VECTOR(DOUBLE, 384)
)
-- Configure IRIS EMBEDDING
-- (Done via Python API, not SQL)
-- INSERT triggers automatic vectorization
INSERT INTO documents (id, content)
VALUES (1, 'Document text...')
-- → embedding column automatically populated via cached model- Model Pre-loading: Models loaded on first use and kept in memory
- Batch Vectorization: Documents vectorized in batches for GPU efficiency
- Async Processing: Non-blocking vectorization for large collections
- Memory Pooling: Reuse GPU memory across batches
Before (manual embeddings):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode([doc.page_content for doc in docs])
# Store embeddings manually...After (IRIS EMBEDDING):
from iris_vector_rag import create_pipeline
pipeline = create_pipeline(
'basic',
embedding_config='my_embeddings'
)
pipeline.load_documents(documents=docs)
# Embeddings generated and stored automaticallyBenefits: 346x faster, automatic caching, simplified code
Before (OpenAI API):
import openai
response = openai.Embedding.create(
input=[doc.page_content for doc in docs],
model="text-embedding-ada-002"
)
# Process and store embeddings...After (IRIS EMBEDDING):
pipeline = create_pipeline(
'basic',
embedding_config='local_embeddings' # No API costs
)
pipeline.load_documents(documents=docs)Benefits: No API costs, faster, offline capability, data privacy
- User Guide - Complete iris-vector-rag usage guide
- API Reference - Full API documentation
- Performance Tuning - Optimization best practices
- CHANGELOG - Feature 051 implementation details