Skip to content

Latest commit

 

History

History
415 lines (307 loc) · 12.9 KB

File metadata and controls

415 lines (307 loc) · 12.9 KB

IRIS EMBEDDING: Auto-Vectorization Guide

Feature: IRIS EMBEDDING (Feature 051) Status: Production-Ready (v0.5.2+) Performance: 346x faster than manual embedding generation

Overview

IRIS EMBEDDING provides automatic document vectorization with intelligent model caching, eliminating the 720x performance penalty from repeated model loading. When enabled, embedding models stay in memory and process all document insertions and queries through a centralized cache.

Key Benefits:

  • 346x speedup - 1,746 documents in 3.5 seconds vs 20 minutes
  • 🎯 95% cache hit rate - Models persist across requests
  • 🚀 50ms average latency - Cached embeddings complete in <100ms
  • 💾 Automatic fallback - GPU OOM? Falls back to CPU automatically
  • 🔄 Multi-field support - Combine title, abstract, and content into single embeddings

Performance Benchmarks

Real-World Results

Test Dataset: 1,746 PMC medical papers with multi-field vectorization

Method Time Model Loads Cache Hit Rate Docs/Second
Manual (baseline) 20 minutes 1,746 (every row) 0% 1.5
IRIS EMBEDDING 3.5 seconds 1 (cached) 95% 499
Speedup 346x faster 1,746x fewer loads 95% efficiency 333x throughput

Hardware: Apple M1 Max (MPS acceleration) Model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) Configuration: Batch size 32, device auto-selection

Scaling Characteristics

  • Small collections (<100 docs): 10-50x speedup
  • Medium collections (100-1,000 docs): 100-200x speedup
  • Large collections (>1,000 docs): 300-500x speedup

Speedup increases with collection size due to model loading overhead amortization.

Quick Start

Basic Usage

from iris_vector_rag import create_pipeline
from iris_vector_rag.core.models import Document

# Enable IRIS EMBEDDING support
pipeline = create_pipeline(
    'basic',
    embedding_config='medical_embeddings_v1'  # IRIS EMBEDDING config name
)

# Documents auto-vectorize on INSERT with cached models
docs = [
    Document(
        page_content="Type 2 diabetes is characterized by insulin resistance...",
        metadata={"source": "medical_text.pdf", "page": 127}
    )
]

pipeline.load_documents(documents=docs)

# Queries auto-vectorize using same cached model
result = pipeline.query("What is diabetes?", top_k=5)

Configuration

Create an embedding configuration to define model, device, and processing parameters:

from iris_vector_rag.embeddings.iris_embedding import configure_embedding

# Create embedding configuration
config = configure_embedding(
    name="medical_embeddings_v1",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    device_preference="auto",     # auto, cuda, mps, cpu
    batch_size=32,
    enable_entity_extraction=True,
    entity_types=["Disease", "Medication", "Symptom"]
)

# Use with any pipeline
pipeline = create_pipeline('basic', embedding_config='medical_embeddings_v1')

Advanced Features

Multi-Field Vectorization

Combine multiple document fields (title, abstract, conclusions) into a single embedding for richer semantic search:

from iris_vector_rag.core.models import Document

# Document with multiple content fields
doc = Document(
    page_content="",  # Will be auto-filled from metadata fields
    metadata={
        "title": "Type 2 Diabetes Treatment",
        "abstract": "A comprehensive review of treatment approaches...",
        "conclusions": "Insulin therapy combined with lifestyle changes...",
        "source": "PMC123456"
    }
)

# Configure multi-field embedding
pipeline = create_pipeline(
    'basic',
    embedding_config='paper_embeddings',
    multi_field_source=['title', 'abstract', 'conclusions']  # Concatenate these fields
)

pipeline.load_documents(documents=[doc])
# → Embedding generated from: "Type 2 Diabetes Treatment. A comprehensive review..."

Benefits:

  • Captures context from multiple document sections
  • Improves search relevance for academic papers and structured content
  • Preserves original metadata fields for filtering

Device Auto-Selection

IRIS EMBEDDING automatically selects the best available device:

config = configure_embedding(
    name="auto_device_config",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    device_preference="auto"  # Tries: CUDA → MPS → CPU
)

Device Priority:

  1. CUDA (NVIDIA GPUs) - Fastest for large models
  2. MPS (Apple Silicon) - Optimized for M1/M2 Macs
  3. CPU - Universal fallback

Automatic Fallback: If GPU runs out of memory during processing, IRIS EMBEDDING automatically falls back to CPU without failing the operation.

Batch Processing

Configure batch size for optimal throughput:

config = configure_embedding(
    name="batch_optimized",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    batch_size=64,  # Process 64 documents per batch
    device_preference="cuda"
)

Batch Size Guidelines:

  • CPU: 8-16 (limited by RAM)
  • MPS (Apple Silicon): 32-64 (limited by unified memory)
  • CUDA (NVIDIA): 64-128 (limited by VRAM)

Larger batches improve throughput but increase memory usage.

Entity Extraction Integration

Enable automatic entity extraction during vectorization:

config = configure_embedding(
    name="entity_aware_embeddings",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    enable_entity_extraction=True,
    entity_types=["Disease", "Medication", "Symptom", "Treatment"],
    entity_extraction_model="en_core_web_sm"  # spaCy model
)

Benefits:

  • Extracted entities stored in metadata for filtering
  • Enables hybrid retrieval (semantic + entity-based)
  • Powers GraphRAG knowledge graph construction

Model Selection

Recommended Models

Use Case Model Dimensions Speed Quality
General purpose all-MiniLM-L6-v2 384 Fast Good
High quality all-mpnet-base-v2 768 Medium Excellent
Multilingual paraphrase-multilingual-mpnet-base-v2 768 Medium Good
Medical domain dmis-lab/biobert-base-cased-v1.1 768 Medium Domain-specific
Legal domain nlpaueb/legal-bert-base-uncased 768 Medium Domain-specific

Custom Models

Use any HuggingFace embedding model:

config = configure_embedding(
    name="custom_model",
    model_name="your-org/your-embedding-model",
    device_preference="auto"
)

Requirements:

  • Must be compatible with sentence-transformers library
  • Must output fixed-dimension vectors
  • Must be accessible via HuggingFace model hub or local path

When to Use IRIS EMBEDDING

✅ Ideal Use Cases

  • Large document collections (>1,000 documents)
  • Frequent re-indexing or incremental updates
  • Real-time vectorization requirements
  • Memory-constrained environments (model stays in memory, no repeated loading)
  • Multi-field vectorization needs (academic papers, structured documents)
  • Entity-aware retrieval (medical, legal, scientific domains)

❌ When NOT to Use

  • Small collections (<100 documents) - Overhead not worth the benefit
  • One-time indexing - Model caching provides minimal value
  • Custom embedding logic - Use manual embeddings if you need full control
  • External embedding services (OpenAI, Cohere) - Use API-based embeddings instead

Configuration Reference

Full Configuration Options

config = configure_embedding(
    # Required
    name="config_name",                           # Unique configuration identifier
    model_name="sentence-transformers/all-MiniLM-L6-v2",  # HuggingFace model

    # Device and Performance
    device_preference="auto",                     # auto | cuda | mps | cpu
    batch_size=32,                               # Documents per batch
    max_seq_length=512,                          # Max tokens per document
    normalize_embeddings=True,                   # L2 normalization

    # Entity Extraction (optional)
    enable_entity_extraction=False,              # Extract entities during vectorization
    entity_types=["Disease", "Medication"],      # Entity types to extract
    entity_extraction_model="en_core_web_sm",    # spaCy model for extraction

    # Multi-Field Vectorization (optional)
    multi_field_source=["title", "abstract"],    # Metadata fields to concatenate
    multi_field_separator=". ",                  # Separator between fields

    # Advanced
    cache_folder="./model_cache",                # Model cache directory
    trust_remote_code=False,                     # Trust remote HuggingFace code
    model_kwargs={},                             # Additional model arguments
)

Environment Variables

# Model cache location
export SENTENCE_TRANSFORMERS_HOME=/path/to/cache

# HuggingFace token (for private models)
export HUGGINGFACE_TOKEN=your_token_here

# Device override
export CUDA_VISIBLE_DEVICES=0,1

Troubleshooting

GPU Out of Memory

Symptom: RuntimeError: CUDA out of memory

Solutions:

  1. Reduce batch size: batch_size=16 or batch_size=8
  2. Use smaller model: all-MiniLM-L6-v2 (384D) instead of all-mpnet-base-v2 (768D)
  3. Enable automatic fallback: device_preference="auto" (falls back to CPU)
  4. Clear CUDA cache: torch.cuda.empty_cache()

Slow Performance

Symptom: Vectorization slower than expected

Solutions:

  1. Check device: print(config.device) - Should be cuda or mps, not cpu
  2. Increase batch size: batch_size=64 or batch_size=128 (if memory allows)
  3. Reduce max_seq_length: max_seq_length=256 (if documents are short)
  4. Verify model is cached: First run loads model, subsequent runs should be 10-100x faster

Model Not Found

Symptom: OSError: Model 'model-name' not found

Solutions:

  1. Check model name spelling: Must match HuggingFace model hub exactly
  2. Check internet connection: Model downloads on first use
  3. Use local path: model_name="/path/to/local/model"
  4. Check HuggingFace token: Required for private models

Cache Misses

Symptom: Low cache hit rate (<50%)

Solutions:

  1. Check configuration consistency: Same embedding_config name for all operations
  2. Verify model persistence: Model should load once and stay in memory
  3. Check batch processing: Large batches improve cache efficiency
  4. Review logs: Check for repeated model loads (indicates configuration mismatch)

Architecture Details

Model Caching Strategy

IRIS EMBEDDING uses a three-tier caching strategy:

  1. Session Cache: In-memory model instances (lasts for process lifetime)
  2. Disk Cache: Downloaded model weights (HuggingFace cache)
  3. Embedding Cache: Computed embeddings stored in IRIS tables

Cache Invalidation: Only when configuration changes or model updates

SQL Integration

IRIS EMBEDDING integrates with IRIS SQL at the table level:

-- Create table with auto-vectorization
CREATE TABLE documents (
    id INT,
    content VARCHAR(5000),
    embedding VECTOR(DOUBLE, 384)
)

-- Configure IRIS EMBEDDING
-- (Done via Python API, not SQL)

-- INSERT triggers automatic vectorization
INSERT INTO documents (id, content)
VALUES (1, 'Document text...')
-- → embedding column automatically populated via cached model

Performance Optimizations

  1. Model Pre-loading: Models loaded on first use and kept in memory
  2. Batch Vectorization: Documents vectorized in batches for GPU efficiency
  3. Async Processing: Non-blocking vectorization for large collections
  4. Memory Pooling: Reuse GPU memory across batches

Migration Guide

From Manual Embeddings

Before (manual embeddings):

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode([doc.page_content for doc in docs])
# Store embeddings manually...

After (IRIS EMBEDDING):

from iris_vector_rag import create_pipeline

pipeline = create_pipeline(
    'basic',
    embedding_config='my_embeddings'
)
pipeline.load_documents(documents=docs)
# Embeddings generated and stored automatically

Benefits: 346x faster, automatic caching, simplified code

From OpenAI Embeddings

Before (OpenAI API):

import openai

response = openai.Embedding.create(
    input=[doc.page_content for doc in docs],
    model="text-embedding-ada-002"
)
# Process and store embeddings...

After (IRIS EMBEDDING):

pipeline = create_pipeline(
    'basic',
    embedding_config='local_embeddings'  # No API costs
)
pipeline.load_documents(documents=docs)

Benefits: No API costs, faster, offline capability, data privacy

See Also