FORetrieval is a multimodal document retrieval library built on top of colpali-engine. It indexes document pages as images using late-interaction models (ColPali, ColQwen2, ColQwen2.5) and retrieves the most relevant pages for a given query. It is used by FORag as its retrieval backend.
Key features:
- Two storage backends — local file-based (Colpali legacy
.ptfiles) or Qdrant embedded vector store (default) - Metadata generation — filesystem metadata always; AI-generated tags, language detection, and short descriptions optionally
- Metadata filtering — filter the retrieval pool by
ext,mtime,language,tags,document_type, or arbitrary regex patterns before scoring - Docling ingestion — optional semantic PDF chunking using Docling, producing image chunks aligned with document structure
- Heatmap and circle visualisation — relevance overlays for retrieved pages
uv sync
# Optional extras:
uv sync --extra qdrant # Qdrant storage backend (recommended for large indexes)
uv sync --extra docling # Docling-based PDF chunking
uv sync --extra embedding_server # Remote vLLM embedding server (adds paramiko for auto-deploy)
uv sync --extra quantization # 4-bit / 8-bit local model quantization (adds bitsandbytes)Required by pdf2image for PDF-to-image conversion:
Debian / Ubuntu
sudo apt-get install -y poppler-utilsSpeeds up ColQwen2 / Gemma-based models significantly:
uv pip install flash-attnColPali uses multi-billion parameter models. A GPU is strongly recommended for indexing and search. Weak or older GPUs (sm_70+) work fine; CPU is supported but slow.
from foretrieval import MultiModalRetrieverModel
# Index a folder of PDFs
model = MultiModalRetrieverModel.from_pretrained(
"vidore/colqwen2.5-v0.2",
index_root="my_indexes",
storage_qdrant=True, # use Qdrant backend (default)
)
model.index(
input_path="path/to/docs/",
index_name="my_index",
store_collection_with_index=True,
)
# Load an existing index and search
model = MultiModalRetrieverModel.from_index(
index_path="my_index",
index_root="my_indexes",
)
results = model.search("maximum output current", k=3)
for r in results:
print(r.doc_id, r.page_num, r.score)FORetrieval supports two backends for storing embeddings:
| Backend | Constructor flag | Description |
|---|---|---|
| Qdrant (default) | storage_qdrant=True |
Embeddings stored in a local embedded Qdrant database under <index_root>/<index_name>/qdrant/. Does not load all embeddings into RAM. Requires foretrieval[qdrant]. |
| Local | storage_qdrant=False |
Embeddings saved as .pt files, loaded into memory at search time. No extra dependency. |
When loading an existing index with from_index(), the backend is read automatically from the saved index_config.json.gz — no manual flag needed.
# Create with Qdrant backend
model = MultiModalRetrieverModel.from_pretrained(..., storage_qdrant=True)
# Create with local backend
model = MultiModalRetrieverModel.from_pretrained(..., storage_qdrant=False)
# Load existing index — backend auto-detected
model = MultiModalRetrieverModel.from_index(index_path="my_index", index_root=".")Metadata can be attached to each document at indexing time. Two levels are available:
Filesystem metadata (no AI required): always populated from the file itself.
| Field | Source |
|---|---|
stem, ext, mime |
filename and MIME type |
mtime |
file modification time (ISO-8601 UTC) |
page_count |
number of pages (PDFs only) |
author, title |
embedded PDF metadata (may be absent) |
image_width, image_height |
dimensions (images only) |
AI-generated metadata (requires an LLM provider): language, tags, document_type, short_description.
from foretrieval.metadata import ai_metadata_provider_factory
from foretrieval.models_metadata import build_metadata_list_for_dir
# No-AI provider: filesystem fields only
provider = ai_metadata_provider_factory(None)
# AI provider: enriches with language, tags, document_type, short_description
provider = ai_metadata_provider_factory({
"provider": "openrouter",
"name": "mistralai/mistral-small-3.2-24b-instruct",
"api_key": "...",
})
metadata_list = build_metadata_list_for_dir(Path("docs/"), provider)
model.index(
input_path="docs/",
index_name="my_index",
metadata=metadata_list,
)When an index was built with metadata, search() accepts a filter_metadata dict that restricts the scoring pool to matching documents only.
from foretrieval.models_metadata import MetadataFilter
# Only PDF files
results = model.search("max current", k=3, filter_metadata={"ext": ".pdf"})
# Files modified after a date
results = model.search("max current", k=3, filter_metadata={
"mtime": {">=": "2025-01-01T00:00:00Z"}
})
# Multiple criteria (AND by default)
results = model.search("max current", k=3, filter_metadata={
"ext": ".pdf",
"language": "en",
})
# OR logic
results = model.search("max current", k=3, filter_metadata={
"ext": [".pdf", ".docx"],
"logic": "OR",
})| Filter field | Type | Description |
|---|---|---|
ext |
str or list[str] |
File extension(s) |
mtime |
dict |
Operators: >=, <=, >, <, == against ISO-8601 string |
language |
str or list[str] |
Language code(s), e.g. "en" |
tags |
str or list[str] |
Any tag in common (requires AI metadata) |
document_type |
str or list[str] |
Document type (requires AI metadata) |
logic |
"AND" or "OR" |
How to combine criteria (default: "AND") |
Any other key is matched by exact string equality against the stored metadata dict.
Use the regex field for substring or pattern matching on any text field. Patterns use Python re.search and are always case-insensitive:
# Files whose name contains "general"
results = model.search("max current", k=3, filter_metadata={
"regex": {"stem": "general"}
})
# Title contains "motor" or "pump"
results = model.search("specs", k=3, filter_metadata={
"regex": {"title": "motor|pump"}
})
# Combine with ext filter
results = model.search("specs", k=3, filter_metadata={
"ext": ".pdf",
"regex": {"stem": "^report_2025"},
})When the filter matches no documents, search() returns an empty list [] without raising.
FORetrieval optionally uses Docling to convert PDFs into semantically meaningful image chunks rather than whole pages. Each chunk corresponds to a coherent region of text and associated figures.
model = MultiModalRetrieverModel.from_pretrained(
"vidore/colqwen2.5-v0.2",
ingestion={"backend": "docling"},
index_root="my_indexes",
)
model.index(input_path="docs/", index_name="chunked_index")Results include a chunk_num field identifying the exact Docling chunk within the page.
Install the dev dependencies first:
uv sync --extra devNo API keys, no GPU required — runs in seconds:
pytest -m "not slow and not integration"pytest tests/test_metadata_no_ai.pySet at least one API key:
export OPENROUTER_API_KEY=...
export OPENAI_API_KEY=...
export MISTRAL_API_KEY=...
export OLLAMA_HOST=http://localhost:11434 # + optionally OLLAMA_MODEL (default: mistral-small-latest)pytest tests/test_metadata_ai.py -vAll available backends are detected automatically and the suite runs once per backend.
# Unit tests (no GPU needed, Qdrant mocked)
pytest tests/test_qdrant.py -m "not slow and not integration"
# Full integration test (GPU + qdrant-client required)
pytest tests/test_qdrant.py -m "slow and integration"pytest tests/test_metadata_filter.pyFull ColPali indexing and search:
pytest -m slow| Marker | Meaning |
|---|---|
slow |
GPU-dependent or computationally expensive |
integration |
Requires a live API key or Ollama daemon |
FORetrieval can offload all embedding computation to a remote GPU server running vLLM. The local machine only loads the processor (tokenizer + image preprocessor) — no model weights, no GPU required locally.
Requirements:
- vLLM ≥ 0.19.0 on the remote server
- Only ColQwen3 / ColQwen3.5 models are supported by the vLLM
/poolingendpoint. ColPali, ColQwen2, and ColQwen2.5 are not supported. - Recommended model:
athrael-soju/colqwen3.5-4.5B-v3(rank 3 on ViDoRe V3, 320-dim, Apache 2.0)
from foretrieval import MultiModalRetrieverModel
from foretrieval.embedding_server import EmbeddingServerConfig
cfg = EmbeddingServerConfig(
url="http://gpu-server:8000",
model_name="athrael-soju/colqwen3.5-4.5B-v3",
)
model = MultiModalRetrieverModel.from_pretrained(
"athrael-soju/colqwen3.5-4.5B-v3",
index_root="my_indexes",
embedding_server=cfg,
)
model.index("path/to/docs/", index_name="my_index")
results = model.search("maximum altitude", k=3)Set auto_deploy=True to have FORetrieval SSH to the GPU server and start the vLLM Docker container automatically if it is not already running. Requires foretrieval[embedding_server] (adds paramiko).
cfg = EmbeddingServerConfig(
url="http://gpu-server:8000",
model_name="athrael-soju/colqwen3.5-4.5B-v3",
auto_deploy=True,
ssh_host="gpu-server", # SSH target
ssh_user="myuser", # optional, defaults to $USER
n_gpus=-1, # -1 = all available GPUs (auto-detected via nvidia-smi)
)The manager pulls vllm/vllm-openai:latest, starts the container with --tensor-parallel-size N, and writes a metadata file at ~/.foretrieval/deployment.json on the remote. Subsequent calls detect the running container and skip redeployment.
cfg = EmbeddingServerConfig(
url="https://gpu-server:8000",
model_name="athrael-soju/colqwen3.5-4.5B-v3",
api_key="my-secret-token", # Authorization: Bearer header
verify_ssl=False, # for self-signed certificates
)Deploy vLLM with --api-key my-secret-token to require authentication.
If port 8000 is not directly reachable, open an SSH tunnel first:
ssh -fNL 8000:localhost:8000 gpu-serverThen use http://localhost:8000 as the URL.
| Field | Default | Description |
|---|---|---|
url |
required | Base URL of the vLLM server |
model_name |
required | HuggingFace model ID (must contain colqwen3) |
auto_deploy |
false |
SSH + Docker auto-deploy |
ssh_host |
None |
SSH hostname (required when auto_deploy=True) |
ssh_user |
None |
SSH username (defaults to $USER) |
ssh_key_path |
None |
Path to SSH private key (defaults to SSH agent) |
n_gpus |
-1 |
Number of GPUs (-1 = all available) |
port |
8000 |
Port exposed on the remote server |
hf_token |
None |
HuggingFace token for gated models |
api_key |
None |
Bearer token for server authentication |
verify_ssl |
True |
Verify SSL certificates |
batch_size |
4 |
Images per request (auto-halved on OOM) |
request_timeout |
120 |
HTTP timeout in seconds |
For local (non-remote) inference, 4-bit and 8-bit quantization reduce VRAM usage via BitsAndBytes. Requires foretrieval[quantization] and a CUDA device.
model = MultiModalRetrieverModel.from_pretrained(
"vidore/colqwen2.5-v0.2",
load_in_4bit=True, # or load_in_8bit=True
bnb_4bit_quant_type="nf4", # "nf4" (default) or "fp4"
bnb_4bit_compute_dtype="float16", # compute dtype
)FORetrieval was originally forked from Byaldi, a wrapper around the ColPali repository. It has since diverged significantly to add metadata generation and filtering, Qdrant storage, Docling ingestion, and heatmap visualisation.