Skip to content

Commit 0395d66

Browse files
aayush-fadiaAbhishek-Iyer1p-amyjiangranai-srivastavnevalsar
authored
Add article on thermal camera perception and vision tasks (#211)
Introduces a new wiki page covering thermal perception, including sensor types, thermal-to-RGB mapping, and common perception tasks like depth estimation and SLAM. --------- Co-authored-by: Abhishek-Iyer1 <iyer.abhishek18@gmail.com> Co-authored-by: p-amyjiang <52675002+p-amyjiang@users.noreply.github.com> Co-authored-by: Amy Jiang <p.amyjiang@gmail.com> Co-authored-by: ranai-srivastav <ranaisrivastav@gmail.com> Co-authored-by: Nevin Valsaraj <nevin.valsaraj32@gmail.com>
1 parent ad2ee34 commit 0395d66

5 files changed

Lines changed: 260 additions & 0 deletions

File tree

_data/navigation.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,8 @@ wiki:
157157
url: /wiki/sensing/apple-vision-pro/
158158
- title: Robotics with the Microsoft Hololens2
159159
url: /wiki/sensing/hololens-101/
160+
- title: Perception via Thermal Imaging
161+
url: /wiki/sensing/thermal-perception/
160162
- title: Controls & Actuation
161163
url: /wiki/actuation/
162164
children:
843 KB
Loading
93.1 KB
Loading

wiki/sensing/index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,9 @@ This section dives into various sensing modalities such as GPS modules, fiducial
7171
- **[Thermal Cameras](/wiki/sensing/thermal-cameras/):**
7272
Examines the use of thermal cameras in robotics, including types of thermal cameras, calibration techniques, and debug tips.
7373

74+
- **[Perception via Thermal Imaging](/wiki/sensing/thermal-perception/):**
75+
Discusses strategies to implement key steps in a robotic perception pipeline using thermal cameras, including depth estimation and metric recovery.
76+
7477
- **[Tracking Vehicles Using a Static Traffic Camera](/wiki/sensing/trajectory_extraction_static_camera/):**
7578
Describes a system for extracting vehicle trajectories using static traffic cameras, incorporating detection, tracking, and homography estimation.
7679

wiki/sensing/thermal-perception.md

Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
---
2+
# Jekyll 'Front Matter' goes here. Most are set by default, and should NOT be
3+
# overwritten except in special circumstances.
4+
# You should set the date the article was last updated like this:
5+
date: 2026-04-26 # YYYY-MM-DD
6+
# This will be displayed at the bottom of the article
7+
# You should set the article's title:
8+
title: Perception via Thermal Imaging
9+
# The 'title' is automatically displayed at the top of the page
10+
# and used in other parts of the site.
11+
---
12+
13+
In this article, we discuss strategies to implement key steps in a robotic perception pipeline using thermal cameras.
14+
Specifically, we discuss the conditions under which a thermal camera provides more utility than an RGB camera, followed
15+
by implementation details to perform camera calibration, dense depth estimation and odometry using thermal cameras.
16+
17+
## Why Thermal Cameras?
18+
19+
Thermal cameras are useful in key situations where normal RGB cameras fail - notably, in perceptual degradation like
20+
smoke and darkness.
21+
Furthermore, unlike LiDAR and RADAR, thermal cameras do not emit any detectable radiation.
22+
If your robot is expected to operate in darkness and smoke-filled areas, thermal cameras are a means for your robot to
23+
perceive the environment in nearly the same way as visual cameras would in ideal conditions.
24+
25+
## Why Depth is Hard in Thermal
26+
27+
Depth perception — inferring the 3D structure of a scene — generally relies on texture-rich, high-contrast inputs.
28+
Thermal imagery tends to violate these assumptions:
29+
30+
- **Low Texture**: Stereo matching algorithms depend on local patches with distinctive features. Thermal scenes often
31+
lack these.
32+
- **High Noise**: Infrared sensors may introduce non-Gaussian noise, which confuses pixel-level correspondence.
33+
- **Limited Resolution**: Consumer-grade thermal cameras are often <640×480, constraining disparity accuracy.
34+
- **Spectral Domain Shift**: Models trained on RGB datasets fail to generalize directly to the thermal domain.
35+
36+
_________________________
37+
38+
## Calibration
39+
40+
Calibration is the process by which we can estimate the internal and external parameters of a camera. Usually, the
41+
camera intrinsics matrix has the following numbers
42+
43+
- fx, fy - This the focal length of the camera in the x and y directions **in the camera's frame**. px/distance_unit
44+
- cx, cy OR px, py - The principal point or the optical center of the image
45+
- distortion coefficients (2 - 6 numbers depending on distortion model used)
46+
47+
Additionally, we must also estimate camera extrinsics which is the pose of the camera relative to another sensor - the
48+
body frame of a robot is defined to be the same as the IMU, or another camera in the case of multi-camera system
49+
50+
- This will be in the form of series of 12 numbers - 9 for the rotation matrix and 3 for the translation
51+
- *NOTE*: BE VERY CAREFUL OF COORDINATE FRAMES
52+
- If using more than one sensor, timesync will help you.
53+
54+
- Calibrating thermal cameras is quite similar to calibrating any other RGB sensor. To accomplish this you must have a
55+
checkerboard pattern, Aruco grid or some other calibration target.
56+
- A square checkerboard is not ideal because it is symmetrical and it is hard for the algorithm to estimate if the
57+
orientation of the board has changed.
58+
- An aruco grid gives precise orientation and is the most reliable option but is not necessary.
59+
60+
General tips
61+
62+
- For a thermal camera you will need to use something with distinct hot and cold edges, eg: a thermal checkerboard
63+
- Ensure that the edges on the checkerboard are visible and are not fuzzy. If they are adjust the focus, wipe the lens
64+
and check if there is any blurring being applied
65+
- Ensure the hot parts of the checkerboard are the hottest things in the picture. This will make it easier to detect the
66+
checkerboard
67+
- Thermal cameras by default give 16bit output. You will need to convert this to an 8bit grayscale image.
68+
- Other than the checkerboard, the lesser things that are visible in the image, the better your calibration will be
69+
- If possible, preprocess your image so that other distracting features will be ignored
70+
71+
### Camera Intrinsics
72+
73+
- Calibrating thermal camera intrinsics will give you fx, fy, cx, cy, and the respective distortion coefficients
74+
75+
1. Heat up the checkerboard
76+
2. Record a rosbag with the necessary topics
77+
3. Preprocess your images
78+
4. Run them through OpenCV or Kalibr. There are plenty of good resources online.
79+
80+
Example output from Kalibr:
81+
82+
```text
83+
cam0:
84+
cam_overlaps: []
85+
camera_model: pinhole
86+
distortion_coeffs: [-0.3418843277284295, 0.09554844659447544, 0.0006766728551819399, 0.00013250437150091342]
87+
distortion_model: radtan
88+
intrinsics: [404.9842534577856, 405.0992911907136, 313.1521147858522, 237.73982476898445]
89+
resolution: [640, 512]
90+
rostopic: /thermal_left/image
91+
```
92+
93+
### Thermal Camera peculiarities
94+
95+
- Thermal Cameras are extremely noisy. There are ways you can reduce this noise
96+
- **Camera gain calibration:** The gain values on the camera are used to reduce or increase the intensity of the noise
97+
in the image.
98+
- The noise is increased if you are trying to estimate the static noise and remove it from the image (FFC)
99+
100+
- **Flat Field Correction**: FFC is used to remove any lens effects in the image such as vignetting and thermal patterns
101+
in the images
102+
- FFC is carried out by placing a uniform object in front of the camera and taking a picture
103+
- Then the noise patterns and then vignetting effects are estimated and then removed from the cameras
104+
- The FLIR thermal cameras constantly "click" which them placing a shutter in front of the sensor and taking picture
105+
and correcting for any noise
106+
- The FLIR documentation describes Supplemental FFC (SFFC) which is the user performing FFC manually. It is
107+
recommended that this is performed when the cameras are in their operating conditions
108+
109+
### Camera Extrinsics
110+
111+
- Relative camera pose is necessary to perform depth estimation. Kalibr calls this a camchain
112+
- Camera-IMU calibration is necessary to perform sensor fusion and integrate both sensor together. This can be estimated
113+
using CAD as well.
114+
- Time-sync is extremely important for this because the sensor readings need to be at the exact same time for the
115+
algorithm to effectively estimate poses.
116+
- While performing extrinsics calibration, ensure that all axes are excited (up-down, left-right, fwd-back, roll, pitch,
117+
yaw) sufficiently. ENSURE that you move slow enough that there is no motion blur with the calibration target but fast
118+
enough to excite the axes enough.
119+
120+
________
121+
122+
## Our Depth Estimation Pipeline Evolution
123+
124+
### 1. **Stereo Block Matching**
125+
126+
We started with classical stereo techniques. Given left and right images $I_L, I_R$, stereo block matching computes
127+
disparity $d(x, y)$ using a sliding window that minimizes a similarity cost (e.g., sum of absolute differences):
128+
129+
$d(x, y) = argmin_d \space Cost(x, y, d)$
130+
131+
In broad strokes, this brute force approach compares blocks from $I_L$ and $I_R$. For each block it computes a cost
132+
based on the pixel to pixel similarity (using a loss between feature descriptors generally). Finally, once a block match
133+
is found, the disparity is found by checking how much each pixel has moved in the x direction.
134+
135+
As you can imagine, this approach is simple and lightweight. However, it is dependent on many things such as the noise
136+
in your images, the contrast separation, and will struggle to find accurate matches when looking at textureless and
137+
colorless inputs (like a wall in a thermal image). The algorithm performed better than expected, but we chose not to go
138+
ahead with it.
139+
140+
---
141+
142+
### 2. **Monocular Relative Depth with MoGe**
143+
144+
If you are using a single camera setup, this is called a monocular approach. One issue is that this problem is ill
145+
posed. For example, objects can be placed at twice the distance and be scaled to twice their size to yield the same
146+
image. There is a reflection of the scale ambiguity that exists in any monocular depth estimation method. Therefore,
147+
learning based models are employed to "guess" the right depth (based on data driven priors like the usual chairs). One
148+
such model is MoGe (Monocular Geometry) which estimates *relative* depth $z'$ from a single image. These estimates are
149+
affine-invariant,
150+
meaning we need to apply a scale and a shift to retrieve metric depth:
151+
152+
$z = s \cdot z' + t$
153+
154+
This means they look visually coherent (look at the image below on the right), but the ambiguity limits 3D metric use (
155+
SLAM based applications).
156+
157+
![Relative Depth on Thermal Images](/assets/images/moge-relative-thermal.png)
158+
159+
---
160+
161+
### 3. **MADPose Solver for Metric Recovery**
162+
163+
To determine global scale and shift, we incorporated a stereo system and inferred relative depth from both. We then
164+
utilized the MADPose solver to find the scale and shift of both relative depth images to make them align, i.e. both
165+
depthmaps, after being made metric, should tell us the same 3D structure. This optimizer also estimates other properties
166+
such as extrinsics between the cameras, solving for more unknowns than necessary. Additionally, there is no
167+
temporal constraint imposed (you are looking at mostly the same things between $T$ and $T+1$ timesteps). This meant that
168+
the metric depth that we recovered would keep changing significantly across frames, resulting in pointclouds of
169+
different sizes and distances across timesteps. This method, while sound in theory, did not work out ver well in
170+
practise.
171+
172+
---
173+
174+
### 4. **Monocular Metric Depth Predictors**
175+
176+
We also tested monocular models trained to output metric depth directly. This problem would be the most ill-posed
177+
problem as you would definitely overfit to the baseline of your training data and the approach would fail to generalize
178+
to other baselines. These treat depth as a regression problem from single input $I$:
179+
180+
$z(x, y) = f(I(x, y))$
181+
182+
Thermal's lack of depth cues and color made the problem even harder, and the models performed poorly.
183+
184+
---
185+
186+
### 5. **Stereo Networks Trained on RGB (e.g., MS2, KITTI)**
187+
188+
Alternatively, when a dual camera setup is used, we call it a stereo approach. This inherently is a much simpler problem
189+
to solve as you have two rays that intersect at the point of capture. I encourage looking at the following set of videos
190+
to understand epipolar geometry and the formualation behind the stereo camera
191+
setup [Link](https://www.youtube.com/watch?v=6kpBqfgSPRc).
192+
193+
We evaluated multiple pretrained stereo disparity networks. However, there were a lot of differences between the
194+
datasets used for pretraining and our data distribution. These models failed to generalize due to:
195+
196+
- Domain mismatch (RGB → thermal)
197+
- Texture reliance
198+
- Exposure to only outdoor content
199+
- Reduced exposure
200+
201+
---
202+
203+
## Final Approach: FoundationStereo
204+
205+
Our final and most successful solution was [FoundationStereo](https://github.com/NVlabs/FoundationStereo), a foundation
206+
model for depth estimation that generalizes to unseen domains without retraining. It is trained on large-scale synthetic
207+
stereo data and supports robust zero-shot inference.
208+
209+
### Why It Works:
210+
211+
- **Zero-shot Generalization**: No need for thermal-specific fine-tuning.
212+
- **Strong Priors**: Learned over large datasets of scenes with varied geometry and lighting. (These variations helped
213+
overcome RGB to thermal domain shifts and textureless cues)
214+
- **Robust Matching**: Confidence estimation allows the model to ignore uncertain matches rathern than hallucinate.
215+
- **Formulation**: Formulating the problem as dense depth matching problem also served well. This allowed generalization
216+
to any baseline by constraining the output to the pixel space.
217+
218+
Stereo rectified thermal image pairs are given to FoundationStereo, which gives us clean disparity maps (image space).
219+
We
220+
recover metric depth using the intrinsics of the camera and the baseline. Finally, we can reproject this into the 3D
221+
space to get consistent point clouds:
222+
223+
$$
224+
z = \frac{f \cdot B}{d}
225+
$$
226+
227+
Where:
228+
229+
- $f$ = focal length,
230+
- $B$ = baseline between cameras,
231+
- $d$ = disparity at pixel.
232+
233+
An example output is given below (thermal preprocessed on the top left, disparity is middle left, and the metric
234+
pointcloud is on the right).
235+
236+
![Metric Depth using Foundation Models](/assets/images/foundation-stereo.png)
237+
238+
## Lessons Learned
239+
240+
1. **Texture matters**: Thermal's low detail forces the need for models that use global context.
241+
2. **Don’t trust pretrained RGB models**: They often don’t generalize without retraining.
242+
3. **Stereo > Monocular for thermal**: Even noisy stereo is better than ill-posed monocular predictions.
243+
4. **Foundation models are promising**: Large-scale pretrained vision backbones like FoundationStereo are surprisingly
244+
effective out-of-the-box.
245+
246+
## Conclusion
247+
248+
Recovering depth from thermal imagery is hard — but not impossible. While classical and RGB-trained methods struggled,
249+
modern foundation stereo models overcame the domain gap with minimal effort. Our experience suggests that for any team
250+
facing depth recovery in non-traditional modalities, foundation models are a compelling place to start.
251+
252+
## See Also
253+
254+
- The [Thermal Cameras wiki page](https://roboticsknowledgebase.com/wiki/sensing/thermal-cameras/) goes into more depth
255+
about how thermal cameras function.

0 commit comments

Comments
 (0)