IRIS EMBEDDING: Auto-Vectorization Guide

Feature: IRIS EMBEDDING (Feature 051) Status: Production-Ready (v0.5.2+) Performance: 346x faster than manual embedding generation

Overview

IRIS EMBEDDING provides automatic document vectorization with intelligent model caching, eliminating the 720x performance penalty from repeated model loading. When enabled, embedding models stay in memory and process all document insertions and queries through a centralized cache.

Key Benefits:

⚡ 346x speedup - 1,746 documents in 3.5 seconds vs 20 minutes
🎯 95% cache hit rate - Models persist across requests
🚀 50ms average latency - Cached embeddings complete in <100ms
💾 Automatic fallback - GPU OOM? Falls back to CPU automatically
🔄 Multi-field support - Combine title, abstract, and content into single embeddings

Performance Benchmarks

Real-World Results

Test Dataset: 1,746 PMC medical papers with multi-field vectorization

Method	Time	Model Loads	Cache Hit Rate	Docs/Second
Manual (baseline)	20 minutes	1,746 (every row)	0%	1.5
IRIS EMBEDDING	3.5 seconds	1 (cached)	95%	499
Speedup	346x faster	1,746x fewer loads	95% efficiency	333x throughput

Hardware: Apple M1 Max (MPS acceleration) Model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) Configuration: Batch size 32, device auto-selection

Scaling Characteristics

Small collections (<100 docs): 10-50x speedup
Medium collections (100-1,000 docs): 100-200x speedup
Large collections (>1,000 docs): 300-500x speedup

Speedup increases with collection size due to model loading overhead amortization.

Quick Start

Basic Usage

from iris_vector_rag import create_pipeline
from iris_vector_rag.core.models import Document

# Enable IRIS EMBEDDING support
pipeline = create_pipeline(
    'basic',
    embedding_config='medical_embeddings_v1'  # IRIS EMBEDDING config name
)

# Documents auto-vectorize on INSERT with cached models
docs = [
    Document(
        page_content="Type 2 diabetes is characterized by insulin resistance...",
        metadata={"source": "medical_text.pdf", "page": 127}
    )
]

pipeline.load_documents(documents=docs)

# Queries auto-vectorize using same cached model
result = pipeline.query("What is diabetes?", top_k=5)

Configuration

Create an embedding configuration to define model, device, and processing parameters:

from iris_vector_rag.embeddings.iris_embedding import configure_embedding

# Create embedding configuration
config = configure_embedding(
    name="medical_embeddings_v1",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    device_preference="auto",     # auto, cuda, mps, cpu
    batch_size=32,
    enable_entity_extraction=True,
    entity_types=["Disease", "Medication", "Symptom"]
)

# Use with any pipeline
pipeline = create_pipeline('basic', embedding_config='medical_embeddings_v1')

Advanced Features

Multi-Field Vectorization

Combine multiple document fields (title, abstract, conclusions) into a single embedding for richer semantic search:

from iris_vector_rag.core.models import Document

# Document with multiple content fields
doc = Document(
    page_content="",  # Will be auto-filled from metadata fields
    metadata={
        "title": "Type 2 Diabetes Treatment",
        "abstract": "A comprehensive review of treatment approaches...",
        "conclusions": "Insulin therapy combined with lifestyle changes...",
        "source": "PMC123456"
    }
)

# Configure multi-field embedding
pipeline = create_pipeline(
    'basic',
    embedding_config='paper_embeddings',
    multi_field_source=['title', 'abstract', 'conclusions']  # Concatenate these fields
)

pipeline.load_documents(documents=[doc])
# → Embedding generated from: "Type 2 Diabetes Treatment. A comprehensive review..."

Benefits:

Captures context from multiple document sections
Improves search relevance for academic papers and structured content
Preserves original metadata fields for filtering

Device Auto-Selection

IRIS EMBEDDING automatically selects the best available device:

config = configure_embedding(
    name="auto_device_config",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    device_preference="auto"  # Tries: CUDA → MPS → CPU
)

Device Priority:

CUDA (NVIDIA GPUs) - Fastest for large models
MPS (Apple Silicon) - Optimized for M1/M2 Macs
CPU - Universal fallback

Automatic Fallback: If GPU runs out of memory during processing, IRIS EMBEDDING automatically falls back to CPU without failing the operation.

Batch Processing

Configure batch size for optimal throughput:

config = configure_embedding(
    name="batch_optimized",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    batch_size=64,  # Process 64 documents per batch
    device_preference="cuda"
)

Batch Size Guidelines:

CPU: 8-16 (limited by RAM)
MPS (Apple Silicon): 32-64 (limited by unified memory)
CUDA (NVIDIA): 64-128 (limited by VRAM)

Larger batches improve throughput but increase memory usage.

Entity Extraction Integration

Enable automatic entity extraction during vectorization:

config = configure_embedding(
    name="entity_aware_embeddings",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    enable_entity_extraction=True,
    entity_types=["Disease", "Medication", "Symptom", "Treatment"],
    entity_extraction_model="en_core_web_sm"  # spaCy model
)

Benefits:

Extracted entities stored in metadata for filtering
Enables hybrid retrieval (semantic + entity-based)
Powers GraphRAG knowledge graph construction

Model Selection

Recommended Models

Use Case	Model	Dimensions	Speed	Quality
General purpose	`all-MiniLM-L6-v2`	384	Fast	Good
High quality	`all-mpnet-base-v2`	768	Medium	Excellent
Multilingual	`paraphrase-multilingual-mpnet-base-v2`	768	Medium	Good
Medical domain	`dmis-lab/biobert-base-cased-v1.1`	768	Medium	Domain-specific
Legal domain	`nlpaueb/legal-bert-base-uncased`	768	Medium	Domain-specific

Custom Models

Use any HuggingFace embedding model:

config = configure_embedding(
    name="custom_model",
    model_name="your-org/your-embedding-model",
    device_preference="auto"
)

Requirements:

Must be compatible with sentence-transformers library
Must output fixed-dimension vectors
Must be accessible via HuggingFace model hub or local path

When to Use IRIS EMBEDDING

✅ Ideal Use Cases

Large document collections (>1,000 documents)
Frequent re-indexing or incremental updates
Real-time vectorization requirements
Memory-constrained environments (model stays in memory, no repeated loading)
Multi-field vectorization needs (academic papers, structured documents)
Entity-aware retrieval (medical, legal, scientific domains)

❌ When NOT to Use

Small collections (<100 documents) - Overhead not worth the benefit
One-time indexing - Model caching provides minimal value
Custom embedding logic - Use manual embeddings if you need full control
External embedding services (OpenAI, Cohere) - Use API-based embeddings instead

Configuration Reference

Full Configuration Options

config = configure_embedding(
    # Required
    name="config_name",                           # Unique configuration identifier
    model_name="sentence-transformers/all-MiniLM-L6-v2",  # HuggingFace model

    # Device and Performance
    device_preference="auto",                     # auto | cuda | mps | cpu
    batch_size=32,                               # Documents per batch
    max_seq_length=512,                          # Max tokens per document
    normalize_embeddings=True,                   # L2 normalization

    # Entity Extraction (optional)
    enable_entity_extraction=False,              # Extract entities during vectorization
    entity_types=["Disease", "Medication"],      # Entity types to extract
    entity_extraction_model="en_core_web_sm",    # spaCy model for extraction

    # Multi-Field Vectorization (optional)
    multi_field_source=["title", "abstract"],    # Metadata fields to concatenate
    multi_field_separator=". ",                  # Separator between fields

    # Advanced
    cache_folder="./model_cache",                # Model cache directory
    trust_remote_code=False,                     # Trust remote HuggingFace code
    model_kwargs={},                             # Additional model arguments
)

Environment Variables

# Model cache location
export SENTENCE_TRANSFORMERS_HOME=/path/to/cache

# HuggingFace token (for private models)
export HUGGINGFACE_TOKEN=your_token_here

# Device override
export CUDA_VISIBLE_DEVICES=0,1

Troubleshooting

GPU Out of Memory

Symptom: RuntimeError: CUDA out of memory

Solutions:

Reduce batch size: batch_size=16 or batch_size=8
Use smaller model: all-MiniLM-L6-v2 (384D) instead of all-mpnet-base-v2 (768D)
Enable automatic fallback: device_preference="auto" (falls back to CPU)
Clear CUDA cache: torch.cuda.empty_cache()

Slow Performance

Symptom: Vectorization slower than expected

Solutions:

Check device: print(config.device) - Should be cuda or mps, not cpu
Increase batch size: batch_size=64 or batch_size=128 (if memory allows)
Reduce max_seq_length: max_seq_length=256 (if documents are short)
Verify model is cached: First run loads model, subsequent runs should be 10-100x faster

Model Not Found

Symptom: OSError: Model 'model-name' not found

Solutions:

Check model name spelling: Must match HuggingFace model hub exactly
Check internet connection: Model downloads on first use
Use local path: model_name="/path/to/local/model"
Check HuggingFace token: Required for private models

Cache Misses

Symptom: Low cache hit rate (<50%)

Solutions:

Check configuration consistency: Same embedding_config name for all operations
Verify model persistence: Model should load once and stay in memory
Check batch processing: Large batches improve cache efficiency
Review logs: Check for repeated model loads (indicates configuration mismatch)

Architecture Details

Model Caching Strategy

IRIS EMBEDDING uses a three-tier caching strategy:

Session Cache: In-memory model instances (lasts for process lifetime)
Disk Cache: Downloaded model weights (HuggingFace cache)
Embedding Cache: Computed embeddings stored in IRIS tables

Cache Invalidation: Only when configuration changes or model updates

SQL Integration

IRIS EMBEDDING integrates with IRIS SQL at the table level:

-- Create table with auto-vectorization
CREATE TABLE documents (
    id INT,
    content VARCHAR(5000),
    embedding VECTOR(DOUBLE, 384)
)

-- Configure IRIS EMBEDDING
-- (Done via Python API, not SQL)

-- INSERT triggers automatic vectorization
INSERT INTO documents (id, content)
VALUES (1, 'Document text...')
-- → embedding column automatically populated via cached model

Performance Optimizations

Model Pre-loading: Models loaded on first use and kept in memory
Batch Vectorization: Documents vectorized in batches for GPU efficiency
Async Processing: Non-blocking vectorization for large collections
Memory Pooling: Reuse GPU memory across batches

Migration Guide

From Manual Embeddings

Before (manual embeddings):

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode([doc.page_content for doc in docs])
# Store embeddings manually...

After (IRIS EMBEDDING):

from iris_vector_rag import create_pipeline

pipeline = create_pipeline(
    'basic',
    embedding_config='my_embeddings'
)
pipeline.load_documents(documents=docs)
# Embeddings generated and stored automatically

Benefits: 346x faster, automatic caching, simplified code

From OpenAI Embeddings

Before (OpenAI API):

import openai

response = openai.Embedding.create(
    input=[doc.page_content for doc in docs],
    model="text-embedding-ada-002"
)
# Process and store embeddings...

After (IRIS EMBEDDING):

pipeline = create_pipeline(
    'basic',
    embedding_config='local_embeddings'  # No API costs
)
pipeline.load_documents(documents=docs)

Benefits: No API costs, faster, offline capability, data privacy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IRIS EMBEDDING: Auto-Vectorization Guide

Overview

Performance Benchmarks

Real-World Results

Scaling Characteristics

Quick Start

Basic Usage

Configuration

Advanced Features

Multi-Field Vectorization

Device Auto-Selection

Batch Processing

Entity Extraction Integration

Model Selection

Recommended Models

Custom Models

When to Use IRIS EMBEDDING

✅ Ideal Use Cases

❌ When NOT to Use

Configuration Reference

Full Configuration Options

Environment Variables

Troubleshooting

GPU Out of Memory

Slow Performance

Model Not Found

Cache Misses

Architecture Details

Model Caching Strategy

SQL Integration

Performance Optimizations

Migration Guide

From Manual Embeddings

From OpenAI Embeddings

See Also

FilesExpand file tree

IRIS_EMBEDDING_GUIDE.md

Latest commit

History

IRIS_EMBEDDING_GUIDE.md

File metadata and controls

IRIS EMBEDDING: Auto-Vectorization Guide

Overview

Performance Benchmarks

Real-World Results

Scaling Characteristics

Quick Start

Basic Usage

Configuration

Advanced Features

Multi-Field Vectorization

Device Auto-Selection

Batch Processing

Entity Extraction Integration

Model Selection

Recommended Models

Custom Models

When to Use IRIS EMBEDDING

✅ Ideal Use Cases

❌ When NOT to Use

Configuration Reference

Full Configuration Options

Environment Variables

Troubleshooting

GPU Out of Memory

Slow Performance

Model Not Found

Cache Misses

Architecture Details

Model Caching Strategy

SQL Integration

Performance Optimizations

Migration Guide

From Manual Embeddings

From OpenAI Embeddings

See Also