|
| 1 | +--- |
| 2 | +# Jekyll 'Front Matter' goes here. Most are set by default, and should NOT be |
| 3 | +# overwritten except in special circumstances. |
| 4 | +# You should set the date the article was last updated like this: |
| 5 | +date: 2026-04-26 # YYYY-MM-DD |
| 6 | +# This will be displayed at the bottom of the article |
| 7 | +# You should set the article's title: |
| 8 | +title: Perception via Thermal Imaging |
| 9 | +# The 'title' is automatically displayed at the top of the page |
| 10 | +# and used in other parts of the site. |
| 11 | +--- |
| 12 | + |
| 13 | +In this article, we discuss strategies to implement key steps in a robotic perception pipeline using thermal cameras. |
| 14 | +Specifically, we discuss the conditions under which a thermal camera provides more utility than an RGB camera, followed |
| 15 | +by implementation details to perform camera calibration, dense depth estimation and odometry using thermal cameras. |
| 16 | + |
| 17 | +## Why Thermal Cameras? |
| 18 | + |
| 19 | +Thermal cameras are useful in key situations where normal RGB cameras fail - notably, in perceptual degradation like |
| 20 | +smoke and darkness. |
| 21 | +Furthermore, unlike LiDAR and RADAR, thermal cameras do not emit any detectable radiation. |
| 22 | +If your robot is expected to operate in darkness and smoke-filled areas, thermal cameras are a means for your robot to |
| 23 | +perceive the environment in nearly the same way as visual cameras would in ideal conditions. |
| 24 | + |
| 25 | +## Why Depth is Hard in Thermal |
| 26 | + |
| 27 | +Depth perception — inferring the 3D structure of a scene — generally relies on texture-rich, high-contrast inputs. |
| 28 | +Thermal imagery tends to violate these assumptions: |
| 29 | + |
| 30 | +- **Low Texture**: Stereo matching algorithms depend on local patches with distinctive features. Thermal scenes often |
| 31 | + lack these. |
| 32 | +- **High Noise**: Infrared sensors may introduce non-Gaussian noise, which confuses pixel-level correspondence. |
| 33 | +- **Limited Resolution**: Consumer-grade thermal cameras are often <640×480, constraining disparity accuracy. |
| 34 | +- **Spectral Domain Shift**: Models trained on RGB datasets fail to generalize directly to the thermal domain. |
| 35 | + |
| 36 | +_________________________ |
| 37 | + |
| 38 | +## Calibration |
| 39 | + |
| 40 | +Calibration is the process by which we can estimate the internal and external parameters of a camera. Usually, the |
| 41 | +camera intrinsics matrix has the following numbers |
| 42 | + |
| 43 | +- fx, fy - This the focal length of the camera in the x and y directions **in the camera's frame**. px/distance_unit |
| 44 | +- cx, cy OR px, py - The principal point or the optical center of the image |
| 45 | +- distortion coefficients (2 - 6 numbers depending on distortion model used) |
| 46 | + |
| 47 | +Additionally, we must also estimate camera extrinsics which is the pose of the camera relative to another sensor - the |
| 48 | +body frame of a robot is defined to be the same as the IMU, or another camera in the case of multi-camera system |
| 49 | + |
| 50 | +- This will be in the form of series of 12 numbers - 9 for the rotation matrix and 3 for the translation |
| 51 | +- *NOTE*: BE VERY CAREFUL OF COORDINATE FRAMES |
| 52 | +- If using more than one sensor, timesync will help you. |
| 53 | + |
| 54 | +- Calibrating thermal cameras is quite similar to calibrating any other RGB sensor. To accomplish this you must have a |
| 55 | + checkerboard pattern, Aruco grid or some other calibration target. |
| 56 | + - A square checkerboard is not ideal because it is symmetrical and it is hard for the algorithm to estimate if the |
| 57 | + orientation of the board has changed. |
| 58 | + - An aruco grid gives precise orientation and is the most reliable option but is not necessary. |
| 59 | + |
| 60 | +General tips |
| 61 | + |
| 62 | +- For a thermal camera you will need to use something with distinct hot and cold edges, eg: a thermal checkerboard |
| 63 | +- Ensure that the edges on the checkerboard are visible and are not fuzzy. If they are adjust the focus, wipe the lens |
| 64 | + and check if there is any blurring being applied |
| 65 | +- Ensure the hot parts of the checkerboard are the hottest things in the picture. This will make it easier to detect the |
| 66 | + checkerboard |
| 67 | +- Thermal cameras by default give 16bit output. You will need to convert this to an 8bit grayscale image. |
| 68 | +- Other than the checkerboard, the lesser things that are visible in the image, the better your calibration will be |
| 69 | +- If possible, preprocess your image so that other distracting features will be ignored |
| 70 | + |
| 71 | +### Camera Intrinsics |
| 72 | + |
| 73 | +- Calibrating thermal camera intrinsics will give you fx, fy, cx, cy, and the respective distortion coefficients |
| 74 | + |
| 75 | +1. Heat up the checkerboard |
| 76 | +2. Record a rosbag with the necessary topics |
| 77 | +3. Preprocess your images |
| 78 | +4. Run them through OpenCV or Kalibr. There are plenty of good resources online. |
| 79 | + |
| 80 | +Example output from Kalibr: |
| 81 | + |
| 82 | +```text |
| 83 | + cam0: |
| 84 | + cam_overlaps: [] |
| 85 | + camera_model: pinhole |
| 86 | + distortion_coeffs: [-0.3418843277284295, 0.09554844659447544, 0.0006766728551819399, 0.00013250437150091342] |
| 87 | + distortion_model: radtan |
| 88 | + intrinsics: [404.9842534577856, 405.0992911907136, 313.1521147858522, 237.73982476898445] |
| 89 | + resolution: [640, 512] |
| 90 | + rostopic: /thermal_left/image |
| 91 | +``` |
| 92 | + |
| 93 | +### Thermal Camera peculiarities |
| 94 | + |
| 95 | +- Thermal Cameras are extremely noisy. There are ways you can reduce this noise |
| 96 | +- **Camera gain calibration:** The gain values on the camera are used to reduce or increase the intensity of the noise |
| 97 | + in the image. |
| 98 | + - The noise is increased if you are trying to estimate the static noise and remove it from the image (FFC) |
| 99 | + |
| 100 | +- **Flat Field Correction**: FFC is used to remove any lens effects in the image such as vignetting and thermal patterns |
| 101 | + in the images |
| 102 | + - FFC is carried out by placing a uniform object in front of the camera and taking a picture |
| 103 | + - Then the noise patterns and then vignetting effects are estimated and then removed from the cameras |
| 104 | + - The FLIR thermal cameras constantly "click" which them placing a shutter in front of the sensor and taking picture |
| 105 | + and correcting for any noise |
| 106 | + - The FLIR documentation describes Supplemental FFC (SFFC) which is the user performing FFC manually. It is |
| 107 | + recommended that this is performed when the cameras are in their operating conditions |
| 108 | + |
| 109 | +### Camera Extrinsics |
| 110 | + |
| 111 | +- Relative camera pose is necessary to perform depth estimation. Kalibr calls this a camchain |
| 112 | +- Camera-IMU calibration is necessary to perform sensor fusion and integrate both sensor together. This can be estimated |
| 113 | + using CAD as well. |
| 114 | +- Time-sync is extremely important for this because the sensor readings need to be at the exact same time for the |
| 115 | + algorithm to effectively estimate poses. |
| 116 | +- While performing extrinsics calibration, ensure that all axes are excited (up-down, left-right, fwd-back, roll, pitch, |
| 117 | + yaw) sufficiently. ENSURE that you move slow enough that there is no motion blur with the calibration target but fast |
| 118 | + enough to excite the axes enough. |
| 119 | + |
| 120 | +________ |
| 121 | + |
| 122 | +## Our Depth Estimation Pipeline Evolution |
| 123 | + |
| 124 | +### 1. **Stereo Block Matching** |
| 125 | + |
| 126 | +We started with classical stereo techniques. Given left and right images $I_L, I_R$, stereo block matching computes |
| 127 | +disparity $d(x, y)$ using a sliding window that minimizes a similarity cost (e.g., sum of absolute differences): |
| 128 | + |
| 129 | +$d(x, y) = argmin_d \space Cost(x, y, d)$ |
| 130 | + |
| 131 | +In broad strokes, this brute force approach compares blocks from $I_L$ and $I_R$. For each block it computes a cost |
| 132 | +based on the pixel to pixel similarity (using a loss between feature descriptors generally). Finally, once a block match |
| 133 | +is found, the disparity is found by checking how much each pixel has moved in the x direction. |
| 134 | + |
| 135 | +As you can imagine, this approach is simple and lightweight. However, it is dependent on many things such as the noise |
| 136 | +in your images, the contrast separation, and will struggle to find accurate matches when looking at textureless and |
| 137 | +colorless inputs (like a wall in a thermal image). The algorithm performed better than expected, but we chose not to go |
| 138 | +ahead with it. |
| 139 | + |
| 140 | +--- |
| 141 | + |
| 142 | +### 2. **Monocular Relative Depth with MoGe** |
| 143 | + |
| 144 | +If you are using a single camera setup, this is called a monocular approach. One issue is that this problem is ill |
| 145 | +posed. For example, objects can be placed at twice the distance and be scaled to twice their size to yield the same |
| 146 | +image. There is a reflection of the scale ambiguity that exists in any monocular depth estimation method. Therefore, |
| 147 | +learning based models are employed to "guess" the right depth (based on data driven priors like the usual chairs). One |
| 148 | +such model is MoGe (Monocular Geometry) which estimates *relative* depth $z'$ from a single image. These estimates are |
| 149 | +affine-invariant, |
| 150 | +meaning we need to apply a scale and a shift to retrieve metric depth: |
| 151 | + |
| 152 | +$z = s \cdot z' + t$ |
| 153 | + |
| 154 | +This means they look visually coherent (look at the image below on the right), but the ambiguity limits 3D metric use ( |
| 155 | +SLAM based applications). |
| 156 | + |
| 157 | + |
| 158 | + |
| 159 | +--- |
| 160 | + |
| 161 | +### 3. **MADPose Solver for Metric Recovery** |
| 162 | + |
| 163 | +To determine global scale and shift, we incorporated a stereo system and inferred relative depth from both. We then |
| 164 | +utilized the MADPose solver to find the scale and shift of both relative depth images to make them align, i.e. both |
| 165 | +depthmaps, after being made metric, should tell us the same 3D structure. This optimizer also estimates other properties |
| 166 | +such as extrinsics between the cameras, solving for more unknowns than necessary. Additionally, there is no |
| 167 | +temporal constraint imposed (you are looking at mostly the same things between $T$ and $T+1$ timesteps). This meant that |
| 168 | +the metric depth that we recovered would keep changing significantly across frames, resulting in pointclouds of |
| 169 | +different sizes and distances across timesteps. This method, while sound in theory, did not work out ver well in |
| 170 | +practise. |
| 171 | + |
| 172 | +--- |
| 173 | + |
| 174 | +### 4. **Monocular Metric Depth Predictors** |
| 175 | + |
| 176 | +We also tested monocular models trained to output metric depth directly. This problem would be the most ill-posed |
| 177 | +problem as you would definitely overfit to the baseline of your training data and the approach would fail to generalize |
| 178 | +to other baselines. These treat depth as a regression problem from single input $I$: |
| 179 | + |
| 180 | +$z(x, y) = f(I(x, y))$ |
| 181 | + |
| 182 | +Thermal's lack of depth cues and color made the problem even harder, and the models performed poorly. |
| 183 | + |
| 184 | +--- |
| 185 | + |
| 186 | +### 5. **Stereo Networks Trained on RGB (e.g., MS2, KITTI)** |
| 187 | + |
| 188 | +Alternatively, when a dual camera setup is used, we call it a stereo approach. This inherently is a much simpler problem |
| 189 | +to solve as you have two rays that intersect at the point of capture. I encourage looking at the following set of videos |
| 190 | +to understand epipolar geometry and the formualation behind the stereo camera |
| 191 | +setup [Link](https://www.youtube.com/watch?v=6kpBqfgSPRc). |
| 192 | + |
| 193 | +We evaluated multiple pretrained stereo disparity networks. However, there were a lot of differences between the |
| 194 | +datasets used for pretraining and our data distribution. These models failed to generalize due to: |
| 195 | + |
| 196 | +- Domain mismatch (RGB → thermal) |
| 197 | +- Texture reliance |
| 198 | +- Exposure to only outdoor content |
| 199 | +- Reduced exposure |
| 200 | + |
| 201 | +--- |
| 202 | + |
| 203 | +## Final Approach: FoundationStereo |
| 204 | + |
| 205 | +Our final and most successful solution was [FoundationStereo](https://github.com/NVlabs/FoundationStereo), a foundation |
| 206 | +model for depth estimation that generalizes to unseen domains without retraining. It is trained on large-scale synthetic |
| 207 | +stereo data and supports robust zero-shot inference. |
| 208 | + |
| 209 | +### Why It Works: |
| 210 | + |
| 211 | +- **Zero-shot Generalization**: No need for thermal-specific fine-tuning. |
| 212 | +- **Strong Priors**: Learned over large datasets of scenes with varied geometry and lighting. (These variations helped |
| 213 | + overcome RGB to thermal domain shifts and textureless cues) |
| 214 | +- **Robust Matching**: Confidence estimation allows the model to ignore uncertain matches rathern than hallucinate. |
| 215 | +- **Formulation**: Formulating the problem as dense depth matching problem also served well. This allowed generalization |
| 216 | + to any baseline by constraining the output to the pixel space. |
| 217 | + |
| 218 | +Stereo rectified thermal image pairs are given to FoundationStereo, which gives us clean disparity maps (image space). |
| 219 | +We |
| 220 | +recover metric depth using the intrinsics of the camera and the baseline. Finally, we can reproject this into the 3D |
| 221 | +space to get consistent point clouds: |
| 222 | + |
| 223 | +$$ |
| 224 | +z = \frac{f \cdot B}{d} |
| 225 | +$$ |
| 226 | + |
| 227 | +Where: |
| 228 | + |
| 229 | +- $f$ = focal length, |
| 230 | +- $B$ = baseline between cameras, |
| 231 | +- $d$ = disparity at pixel. |
| 232 | + |
| 233 | +An example output is given below (thermal preprocessed on the top left, disparity is middle left, and the metric |
| 234 | +pointcloud is on the right). |
| 235 | + |
| 236 | + |
| 237 | + |
| 238 | +## Lessons Learned |
| 239 | + |
| 240 | +1. **Texture matters**: Thermal's low detail forces the need for models that use global context. |
| 241 | +2. **Don’t trust pretrained RGB models**: They often don’t generalize without retraining. |
| 242 | +3. **Stereo > Monocular for thermal**: Even noisy stereo is better than ill-posed monocular predictions. |
| 243 | +4. **Foundation models are promising**: Large-scale pretrained vision backbones like FoundationStereo are surprisingly |
| 244 | + effective out-of-the-box. |
| 245 | + |
| 246 | +## Conclusion |
| 247 | + |
| 248 | +Recovering depth from thermal imagery is hard — but not impossible. While classical and RGB-trained methods struggled, |
| 249 | +modern foundation stereo models overcame the domain gap with minimal effort. Our experience suggests that for any team |
| 250 | +facing depth recovery in non-traditional modalities, foundation models are a compelling place to start. |
| 251 | + |
| 252 | +## See Also |
| 253 | + |
| 254 | +- The [Thermal Cameras wiki page](https://roboticsknowledgebase.com/wiki/sensing/thermal-cameras/) goes into more depth |
| 255 | + about how thermal cameras function. |
0 commit comments