Skip to content

Commit e3d8851

Browse files
fix(wsi): Add multi-file pyramid + WSI selection for DICOM files (#270)
* Add `select_dicom_files` which filters out non-WSI files using `SOPClassUID` and selects onlu the highest resolution image from multi-file pyramids to be processed * Integrate into `wsi` service to handle files with `.dcm` extension * Refactor `_scan_files` in `PydicomHandler` to reduce complexity, and add WSI /pyramid filtering as an option * Update docs accordingly, including mention of highdicom now being optional
1 parent a13d24b commit e3d8851

8 files changed

Lines changed: 752 additions & 176 deletions

File tree

src/aignostics/CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,7 @@ aignostics application run submit --application-id heta --files slide.svs
246246
aignostics dataset download --collection-id TCGA-LUAD --output-dir ./data
247247

248248
# Get WSI info
249-
aignostics wsi info slide.svs
249+
aignostics wsi inspect slide.svs
250250
```
251251

252252
## GUI Launch

src/aignostics/application/_service.py

Lines changed: 27 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,7 @@
1414
from loguru import logger
1515

1616
from aignostics.bucket import Service as BucketService
17-
from aignostics.constants import (
18-
TEST_APP_APPLICATION_ID,
19-
)
17+
from aignostics.constants import TEST_APP_APPLICATION_ID
2018
from aignostics.platform import (
2119
LIST_APPLICATION_RUNS_MAX_PAGE_SIZE,
2220
ApiException,
@@ -32,9 +30,7 @@
3230
RunOutput,
3331
RunState,
3432
)
35-
from aignostics.platform import (
36-
Service as PlatformService,
37-
)
33+
from aignostics.platform import Service as PlatformService
3834
from aignostics.utils import BaseService, Health, sanitize_path_component
3935
from aignostics.wsi import Service as WSIService
4036

@@ -324,60 +320,60 @@ def generate_metadata_from_source_directory( # noqa: PLR0913, PLR0917
324320
"""Generate metadata from the source directory.
325321
326322
Steps:
327-
1. Recursively files ending with supported extensions in the source directory
328-
2. Creates a dict with the following columns
323+
1. Recursively scans files ending with supported extensions in the source directory
324+
2. For DICOM files (.dcm), filters out auxiliary and redundant files
325+
3. Creates a dict for each file with the following fields:
329326
- external_id (str): The external_id of the file, by default equivalent to the absolute file name
330327
- source (str): The absolute filename
331-
- checksum_base64_crc32c (str): The CRC32C checksum of the file constructed, base64 encoded
328+
- checksum_base64_crc32c (str): The CRC32C checksum of the file, base64 encoded
332329
- resolution_mpp (float): The microns per pixel, inspecting the base layer
333-
- height_px: The height of the image in pixels, inspecting the base layer
334-
- width_px: The width of the image in pixels, inspecting the base layer
335-
- Further attributes depending on the application and it's version
336-
3. Applies the optional mappings to fill in additional metadata fields in the dict.
330+
- height_px (int): The height of the image in pixels, inspecting the base layer
331+
- width_px (int): The width of the image in pixels, inspecting the base layer
332+
- Further attributes depending on the application and its version
333+
4. Applies the optional mappings to fill in additional metadata fields in the dict
337334
338335
Args:
339-
source_directory (Path): The source directory to generate metadata from.
340-
application_id (str): The ID of the application.
341-
application_version (str|None): The version of the application (semver).
342-
If not given latest version is used.
343-
with_gui_metadata (bool): If True, include additional metadata for GUI.
344-
mappings (list[str]): Mappings of the form '<regexp>:<key>=<value>,<key>=<value>,...'.
336+
source_directory: The source directory to generate metadata from.
337+
application_id: The ID of the application.
338+
application_version: The version of the application (semver).
339+
If not given, latest version is used.
340+
with_gui_metadata: If True, include additional metadata for GUI display.
341+
mappings: Mappings of the form '<regexp>:<key>=<value>,<key>=<value>,...'.
345342
The regular expression is matched against the external_id attribute of the entry.
346343
The key/value pairs are applied to the entry if the pattern matches.
347-
with_extra_metadata (bool): If True, include extra metadata from the WSIService.
344+
with_extra_metadata: If True, include extra metadata from the WSIService.
348345
349346
Returns:
350-
dict[str, Any]: The generated metadata.
347+
List of metadata dictionaries, one per processable file found.
351348
352349
Raises:
353-
Exception: If the metadata cannot be generated.
354-
355-
Raises:
356-
NotFoundError: If the application version with the given ID is not found.
357-
ValueError: If
358-
the source directory does not exist
359-
or is not a directory.
350+
NotFoundException: If the application version with the given ID is not found.
351+
ValueError: If the source directory does not exist or is not a directory.
360352
RuntimeError: If the metadata generation fails unexpectedly.
361353
"""
362354
logger.trace("Generating metadata from source directory: {}", source_directory)
363355

364356
# TODO(Helmut): Use it
365357
_ = Service().application_version(application_id, application_version)
366358

367-
metadata = []
359+
metadata: list[dict[str, Any]] = []
360+
wsi_service = WSIService()
368361

369362
try:
370363
extensions = get_supported_extensions_for_application(application_id)
371364
for extension in extensions:
372-
for file_path in source_directory.glob(f"**/*{extension}"):
365+
files_to_process = wsi_service.get_wsi_files_to_process(source_directory, extension)
366+
367+
for file_path in files_to_process:
373368
# Generate CRC32C checksum with crc32c and encode as base64
374369
hash_sum = crc32c.CRC32CHash()
375370
with file_path.open("rb") as f:
376371
while chunk := f.read(1024):
377372
hash_sum.update(chunk)
378373
checksum = str(base64.b64encode(hash_sum.digest()), "UTF-8")
374+
379375
try:
380-
image_metadata = WSIService().get_metadata(file_path)
376+
image_metadata = wsi_service.get_metadata(file_path)
381377
width = image_metadata["dimensions"]["width"]
382378
height = image_metadata["dimensions"]["height"]
383379
mpp = image_metadata["resolution"]["mpp_x"]

src/aignostics/wsi/CLAUDE.md

Lines changed: 146 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@ The WSI module provides comprehensive support for medical imaging files, particu
1919
**CLI Commands (`_cli.py`):**
2020

2121
- `wsi inspect` - Display WSI file metadata and properties
22-
- `wsi dicom-inspect` - Inspect DICOM-specific metadata
23-
- `wsi dicom-geojson-import` - Import GeoJSON annotations to DICOM
22+
- `wsi dicom inspect` - Inspect DICOM-specific metadata
23+
- `wsi dicom geojson_import` - Import GeoJSON annotations to DICOM
2424

2525
**GUI Component (`_gui.py`):**
2626

@@ -210,6 +210,138 @@ def get_tile(
210210
return tile
211211
```
212212

213+
214+
### DICOM WSI File Filtering
215+
216+
**Multi-File DICOM Pyramid Selection (`_utils.select_dicom_files()`):**
217+
218+
The WSI module automatically handles multi-file DICOM pyramids (whole slide images stored across multiple DICOM instances) by selecting only the highest resolution file from each pyramid. This prevents redundant processing since OpenSlide can automatically find related pyramid files in the same directory.
219+
220+
**Implementation Location:**
221+
222+
The DICOM file selection logic is implemented in `_utils.py` as `select_dicom_files()`. This function **only depends on pydicom** (not highdicom), making it compatible with Python 3.14+ where highdicom is not available.
223+
224+
**Service Integration (`Service.get_wsi_files_to_process()`):**
225+
```python
226+
from aignostics.wsi import Service
227+
from pathlib import Path
228+
229+
# Get filtered DICOM files
230+
files = Service.get_wsi_files_to_process(
231+
path=Path("/data/dicoms"),
232+
extension=".dcm"
233+
)
234+
# Returns only highest resolution WSI files
235+
236+
# For non-DICOM formats, returns all files
237+
tiff_files = Service.get_wsi_files_to_process(
238+
path=Path("/data/slides"),
239+
extension=".tiff"
240+
)
241+
# Returns all .tiff files (no filtering)
242+
```
243+
244+
**Direct Usage (Advanced):**
245+
```python
246+
from aignostics.wsi._utils import select_dicom_files
247+
from pathlib import Path
248+
249+
# Directly filter DICOM files (used internally by Service)
250+
dicom_files = select_dicom_files(Path("/data/dicoms"))
251+
# Returns only highest resolution WSI files
252+
```
253+
254+
**Filtering Strategy:**
255+
```python
256+
def select_dicom_files(path: Path) -> list[Path]:
257+
"""Select WSI files only, excluding auxiliary and redundant files.
258+
259+
Filtering Strategy:
260+
1. SOPClassUID filtering - Only process VL Whole Slide Microscopy Image Storage
261+
- Include: 1.2.840.10008.5.1.4.1.1.77.1.6 (VL WSI)
262+
- Exclude: 1.2.840.10008.5.1.4.1.1.66.4 (Segmentation Storage)
263+
- Exclude: Other non-WSI DICOM types
264+
265+
2. ImageType filtering - Exclude auxiliary images
266+
- Keep: VOLUME images only
267+
- Exclude: THUMBNAIL, LABEL, OVERVIEW, MACRO, ANNOTATION, LOCALIZER
268+
269+
3. PyramidUID grouping - Group multi-file pyramids
270+
- Files with same PyramidUID are part of one logical WSI
271+
- Files without PyramidUID are treated as standalone WSIs
272+
273+
4. Resolution selection - Keep highest resolution per pyramid
274+
- Based on TotalPixelMatrixRows × TotalPixelMatrixColumns
275+
- Excludes all lower resolution levels
276+
277+
Reference: https://dicom.nema.org/medical/dicom/current/output/chtml/part03/chapter_7.html
278+
"""
279+
```
280+
281+
**Key Behaviors:**
282+
283+
- **SOPClassUID validation**: Only processes VL Whole Slide Microscopy Image Storage files (1.2.840.10008.5.1.4.1.1.77.1.6)
284+
- **Non-WSI exclusion**: Automatically excludes segmentations (1.2.840.10008.5.1.4.1.1.66.4), annotations, and other DICOM object types
285+
- **ImageType filtering**: Excludes THUMBNAIL, LABEL, OVERVIEW, MACRO, ANNOTATION, and LOCALIZER image types
286+
- **PyramidUID grouping**: Groups files by PyramidUID (DICOM tag identifying multi-resolution pyramids)
287+
- **Resolution selection**: For each pyramid, keeps only the file with largest TotalPixelMatrixRows × TotalPixelMatrixColumns
288+
- **Standalone handling**: Files without PyramidUID are treated as standalone WSI images and preserved
289+
- **Graceful degradation**: Files with missing attributes are logged and treated as standalone (not excluded)
290+
- **Debug logging**: Excluded files are logged at DEBUG level with pyramid/exclusion details
291+
292+
**DICOM WSI Structure:**
293+
294+
In the DICOM Whole Slide Imaging standard:
295+
- **PyramidUID**: Uniquely identifies a single multi-resolution pyramid that may span multiple files
296+
- **SeriesInstanceUID**: Groups related images (may include multiple pyramids, thumbnails, labels)
297+
- **TotalPixelMatrixRows/Columns**: Represents full image dimensions at the highest resolution level
298+
299+
**Example Scenario:**
300+
```
301+
Input Directory:
302+
├── pyramid_level_0.dcm (10000×10000 px, PyramidUID: ABC123) ← KEPT
303+
├── pyramid_level_1.dcm (5000×5000 px, PyramidUID: ABC123) ← EXCLUDED
304+
├── pyramid_level_2.dcm (2500×2500 px, PyramidUID: ABC123) ← EXCLUDED
305+
├── thumbnail.dcm (256×256 px, PyramidUID: ABC123, ImageType: THUMBNAIL) ← EXCLUDED
306+
├── segmentation.dcm (10000×10000 px, SOPClassUID: Segmentation) ← EXCLUDED
307+
└── standalone.dcm (8000×8000 px, No PyramidUID) ← KEPT
308+
309+
Result: Only pyramid_level_0.dcm and standalone.dcm are processed
310+
```
311+
312+
**Error Handling:**
313+
314+
- Files with missing SOPClassUID are logged as warnings and excluded (malformed DICOM)
315+
- Files with PyramidUID but missing TotalPixelMatrix* attributes are treated as standalone
316+
- Files that cannot be read by pydicom are logged at DEBUG level and skipped
317+
- AttributeError and general exceptions are caught to prevent processing pipeline failure
318+
319+
**Integration with Application Module:**
320+
321+
The Application module uses this filtering automatically when generating metadata:
322+
```python
323+
# In Application Service
324+
from aignostics.wsi import Service as WSIService
325+
326+
# Filtering happens automatically for DICOM files
327+
files = WSIService.get_wsi_files_to_process(source_directory, ".dcm")
328+
for file_path in files:
329+
# Only highest resolution WSI files are processed
330+
metadata = WSIService.get_metadata(file_path)
331+
```
332+
333+
**Module Architecture:**
334+
335+
The DICOM file selection functionality is organized as follows:
336+
- **`_utils.py`**: Contains `select_dicom_files()` and `_find_highest_resolution_files()` helper
337+
- Only depends on `pydicom`, `pathlib`, `collections.defaultdict`, and `loguru`
338+
- Compatible with Python 3.14+ (no highdicom dependency)
339+
- **`_service.py`**: Uses `select_dicom_files()` in `get_wsi_files_to_process()`
340+
- **`_pydicom_handler.py`**: Uses `select_dicom_files()` for metadata extraction with `wsi_only=True`
341+
- This module still requires highdicom for annotation/measurement features
342+
- Only the CLI commands that need highdicom (geojson import, detailed inspection) use PydicomHandler
343+
344+
213345
## Usage Patterns
214346

215347
### Basic WSI Operations
@@ -264,13 +396,10 @@ for wsi in wsi_files:
264396
aignostics wsi inspect slide.svs
265397

266398
# Inspect DICOM metadata
267-
aignostics wsi dicom-inspect scan.dcm
399+
aignostics wsi dicom inspect scan.dcm
268400

269401
# Import GeoJSON annotations
270-
aignostics wsi dicom-geojson-import \
271-
--dicom-file scan.dcm \
272-
--geojson-file annotations.json \
273-
--output annotated.dcm
402+
aignostics wsi dicom geojson_import scan.dcm annotations.json
274403
```
275404

276405
## Dependencies & Integration
@@ -285,8 +414,17 @@ aignostics wsi dicom-geojson-import \
285414

286415
- `openslide-python` - Core WSI reading functionality
287416
- `Pillow` - Image processing and thumbnail generation
288-
- `pydicom` - DICOM file handling
417+
- `pydicom` - DICOM file handling (required for basic DICOM WSI operations)
289418
- `numpy` - Array manipulation for pixel data
419+
- `highdicom` - DICOM annotation/measurement features (optional, not available on Python 3.14+)
420+
421+
**Python 3.14+ Compatibility:**
422+
423+
The core WSI functionality (thumbnail generation, metadata extraction, DICOM file selection) works on Python 3.14+ without highdicom. Only the following CLI commands require highdicom and are unavailable on Python 3.14+:
424+
- `aignostics wsi dicom geojson_import` - Import GeoJSON to DICOM annotations
425+
- Detailed annotation/measurement inspection features
426+
427+
The DICOM file selection logic (`select_dicom_files()`) works on all Python versions since it only depends on `pydicom`.
290428

291429
### Format Support Matrix
292430

src/aignostics/wsi/_cli.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@ def dicom_inspect(
132132
],
133133
verbose: Annotated[bool, typer.Option(help="Verbose output")] = False,
134134
summary: Annotated[bool, typer.Option(help="Show only summary information")] = False,
135+
wsi_only: Annotated[bool, typer.Option(help="Filter to WSI files only")] = False,
135136
) -> None: # pylint: disable=W0613
136137
"""Inspect DICOM files at any hierarchy level."""
137138
if not _check_highdicom_available():
@@ -142,7 +143,7 @@ def dicom_inspect(
142143

143144
try:
144145
with PydicomHandler.from_file(str(path)) as handler:
145-
metadata = handler.get_metadata(verbose)
146+
metadata = handler.get_metadata(verbose, wsi_only)
146147

147148
if metadata["type"] == "empty":
148149
console.print("[bold red]No DICOM files found in the specified path.[/bold red]")

0 commit comments

Comments
 (0)