Thread Flare - Ray Debug Container

A Python-based debug container for testing Ray cluster resources, thread spawning, and cgroup limits on K8s variants.
Vibe-coded ♾️ by Claude Sonnet 4 + Windsurf

Container Variants

Thread Flare comes in two variants to suit different deployment needs:

Slim Variant (Recommended)

Base: Python 3.10-slim
Size: ~400MB
Use Case: CPU-only environments, general debugging
Ray: 2.37.0 with default components
Build: ./build-and-deploy.sh slim
Run: ./run-and-capture-logs.sh slim

CUDA Variant (GPU-Enabled)

Base: NVIDIA CUDA 12.4.1 + Ubuntu 22.04
Size: ~2GB
Use Case: GPU-enabled environments, NVIDIA nv-ingest debugging
Ray: 2.37.0 with full GPU support
Build: ./build-and-deploy.sh cuda
Run: ./run-and-capture-logs.sh cuda

Features

Comprehensive Ray Testing: Tests Ray 2.37.0+ with cluster_resources, available_resources, nodes() APIs
Ray Cgroup Detection: Tests how Ray detects and uses cgroup memory/CPU limits
Ray Pipeline Simulation: Simulates nv-ingest Ray pipeline patterns with remote tasks
Cgroup v1 & v2 Detection: Comprehensive testing of both cgroup versions
Cgroup Limits Analysis: Tests pids.max, memory limits, and CPU limits in both versions
Multiprocessing Fork Testing: Tests fork context used by nv-ingest
Subprocess Spawning: Tests process group creation patterns
Signal Handling: Tests PDEATHSIG and signal availability
Thread Limit Testing: Spawns threads until failure to test system limits
File Descriptor Limits: Tests FD limits that affect Ray/multiprocessing
System Introspection: Uses psutil for CPU and memory information
Real-time Logging: Outputs timestamped logs to stdout
OpenShift: Support to run as a pod in OpenShift

Quick Start

Local Development

# Build the container
./build-and-deploy.sh

# Run locally
docker run --rm -it thread-flare:latest

./build-and-deploy.sh slim
# OR: docker build -f Dockerfile.slim -t thread-flare-slim:latest .

Run locally:

./run-and-capture-logs.sh slim
# OR: docker run --rm thread-flare-slim:latest

Deploy to K8s:

kubectl apply -f pod.yaml  # Uses slim variant by default
kubectl logs -f pod/thread-flare

CUDA Variant (GPU-enabled)

Build the container:

./build-and-deploy.sh cuda
# OR: docker build -f Dockerfile.cuda -t thread-flare-cuda:latest .

Run locally (requires nvidia-container-toolkit):

./run-and-capture-logs.sh cuda
# OR: docker run --gpus all --rm thread-flare-cuda:latest

Deploy to K8s (requires GPU nodes):

# Edit pod.yaml to use thread-flare-cuda:latest
kubectl apply -f pod.yaml
kubectl logs -f pod/thread-flare

View logs:
```
oc logs -f pod/thread-flare
```

Log Capture and Analysis

Thread Flare includes scripts to automatically capture and analyze detailed test results:

Local Testing with Log Capture

# Run Thread Flare locally and save logs to timestamped files
./run-and-capture-logs.sh [slim|cuda] [--thread-limit N]

slim or cuda selects the container variant (default: slim)
--thread-limit N limits the number of threads spawned (overrides env var)

This creates:

logs/thread_flare_YYYYMMDD_HHMMSS.log - Complete detailed output
logs/thread_flare_YYYYMMDD_HHMMSS_summary.txt - Key metrics summary

Kubernetes Testing with Log Capture

# Deploy to K8s and capture logs
./run-k8s-and-capture-logs.sh [--thread-limit N]

# With custom namespace and timeout
NAMESPACE=my-namespace TIMEOUT=600 ./run-k8s-and-capture-logs.sh --thread-limit 5000

--thread-limit N limits the number of threads spawned (overrides env var)

You can also pass a CLI argument directly if running the Python script:

docker run --rm thread-flare-slim:latest python /workspace/thread_flare.py --thread-limit 5000

This automatically:

Deploys the Thread Flare pod
Waits for pod to be ready
Captures all output to timestamped log files
Generates a summary with key metrics
Cleans up the pod when done

Log Analysis

The summary files extract key information:

Python version and system resources
Cgroup detection results (v1/v2)
Ray resource detection and comparison with system
Thread limit testing results
Test status and any failures

Example summary snippet:

Python Version: 3.12.11
System Resources: 16 CPU cores, 15.03 GB RAM, 10.72 GB available, 28.7% used
Environment: Docker container, x86_64 architecture
GPU Detection: nvidia-smi found 2 GPUs (RTX 4090, 24GB each) or "Command not found"
=== Cgroup/Process Limits ===
/sys/fs/cgroup/pids/pids.max: 18456
/proc/self/limits (processes): Max processes            4096                 4096                 processes
Cgroup v2: pids.max: 18456, memory.max: unlimited
Ray Resources: 16 CPU, 9.97GB memory, 4.27GB object store
Thread Test: Created 18,391 threads before failure
Warnings: /dev/shm size warning, Ray deprecation warning
Test Status: ✅ All tests completed successfully

Files

Dockerfile - Container definition with Python 3.12, Ray 2.37.0+, and psutil
thread_flare.py - Main Thread Flare script that runs all tests
pod.yaml - k8s pod specification
build-and-deploy.sh - Build and deployment helper script
run-and-capture-logs.sh - NEW: Run locally and capture detailed logs to files
run-k8s-and-capture-logs.sh - NEW: Deploy to K8s and capture logs

What It Tests

Process Limits: Reads /proc/self/limits and ulimit -u
System Resources: CPU cores, memory usage, and availability via psutil
Environment Detection: Container type (Docker/Podman), Kubernetes environment
GPU Detection: NVIDIA GPU detection via multiple methods:
- nvidia-smi command output with GPU names, memory, and driver versions
- /proc/driver/nvidia driver detection
- GPU device files in /dev/ (nvidia0, nvidia1, etc.)
- Ray GPU resource detection
Platform Information: Architecture, OS platform, processor type
Cgroup Detection: Both v1 and v2 cgroup mounts, pids.max, memory limits
Ray Resource Detection: Multiple Ray APIs for comprehensive cluster resource detection
Thread Spawning: Creates threads until system failure to test limits
Multiprocessing: Fork context testing like nv-ingest uses
Subprocess Spawning: Process group management and signal handling
File Descriptor Limits: Tests FD limits that affect Ray/multiprocessing
Ray Pipeline Simulation: Simulates nv-ingest Ray remote task patterns

Expected Output

The container will output timestamped logs showing:

Python version (should be 3.12.x)
System process limits and file descriptor limits
CPU and memory information via psutil
GPU info if available (using -cuda variant)
Comprehensive cgroup v1 detection (mounts, pids.max, memory limits)
Comprehensive cgroup v2 detection (mounts, pids.max, memory.max, cpu.max)
Signal handling capabilities (including PDEATHSIG)
Multiprocessing fork context testing
Subprocess spawning and process group creation
Ray comprehensive resource detection (cluster_resources, available_resources, nodes)
Ray vs system resource comparison (memory/CPU detection differences)
Ray pipeline simulation with remote tasks
Thread creation progress until failure

Troubleshooting

Ensure your container registry is accessible from your k8s
Check resource limits in the pod specification
Verify Ray version compatibility with your environment

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile.cuda		Dockerfile.cuda
Dockerfile.slim		Dockerfile.slim
LICENSE		LICENSE
README.md		README.md
build-and-deploy.sh		build-and-deploy.sh
build-and-dive.sh		build-and-dive.sh
build-and-load-kind.sh		build-and-load-kind.sh
pod.yaml		pod.yaml
run-and-capture-logs.sh		run-and-capture-logs.sh
run-k8s-and-capture-logs.sh		run-k8s-and-capture-logs.sh
thread_flare.py		thread_flare.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thread Flare - Ray Debug Container

Container Variants

Slim Variant (Recommended)

CUDA Variant (GPU-Enabled)

Features

Quick Start

Local Development

CUDA Variant (GPU-enabled)

Log Capture and Analysis

Local Testing with Log Capture

Kubernetes Testing with Log Capture

Log Analysis

Files

What It Tests

Expected Output

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Thread Flare - Ray Debug Container

Container Variants

Slim Variant (Recommended)

CUDA Variant (GPU-Enabled)

Features

Quick Start

Local Development

CUDA Variant (GPU-enabled)

Log Capture and Analysis

Local Testing with Log Capture

Kubernetes Testing with Log Capture

Log Analysis

Files

What It Tests

Expected Output

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages