Merge branch SeFlow bib into DeFlow_public

Kin-Zhang · Kin-Zhang · commit 37fc9d38136e · 2024-10-12T19:13:09.000+02:00
diff --git a/README.md b/README.md
@@ -90,8 +90,8 @@ unzip demo_data.zip -p /home/kin/data/av2
 
 #### Prepare raw data 
 
-Checking more information (step for downloading raw data, storage size, #frame etc) in [dataprocess/README.md](dataprocess/README.md). Extract all data to unified h5 format. 
-[Runtime: Normally need 10 mins finished run following commands totally in my desktop, 45 mins for the cluster I used]
+Checking more information (step for downloading raw data, storage size, #frame etc) in [dataprocess/README.md](dataprocess/README.md). Extract all data to unified `.h5` format. 
+[Runtime: Normally need 45 mins finished run following commands totally in setup mentioned in our paper]
 ```bash
 python dataprocess/extract_av2.py --av2_type sensor --data_mode train --argo_dir /home/kin/data/av2 --output_dir /home/kin/data/av2/preprocess_v2
 python dataprocess/extract_av2.py --av2_type sensor --data_mode val --mask_dir /home/kin/data/av2/3d_scene_flow
@@ -185,11 +185,14 @@ https://github.com/KTH-RPL/DeFlow/assets/35365764/9b265d56-06a9-4300-899c-96047a
   pages={2105-2111},
   doi={10.1109/ICRA57147.2024.10610278}
 }
-@article{zhang2024seflow,
+@inproceedings{zhang2024seflow,
   author={Zhang, Qingwen and Yang, Yi and Li, Peizheng and Andersson, Olov and Jensfelt, Patric},
   title={{SeFlow}: A Self-Supervised Scene Flow Method in Autonomous Driving},
-  journal={arXiv preprint arXiv:2407.01702},
-  year={2024}
+  booktitle={European Conference on Computer Vision (ECCV)},
+  year={2024},
+  pages={353–369},
+  organization={Springer},
+  doi={10.1007/978-3-031-73232-4_20},
 }
 ```
 
diff --git a/assets/cuda/chamfer3D/README.md b/assets/cuda/chamfer3D/README.md
@@ -75,6 +75,76 @@ Chamfer Distance Cal time: 1.814 ms
 loss:  tensor(0.1710, device='cuda:0', grad_fn=<AddBackward0>)
 ```
 
+## Mics
+
+
+### Note for CUDA ChamferDis
+
+主要是 两个月前写的 已经看不懂了；然后问题原因是因为 总是缺0.0003的精度（精度强迫症患者）
+然后就以为是自己写错了 后面发现是因为block的这种并行化 线程大小的不同对CUDA的浮点运算会有所不同，所以导致精度差距是有一点的 如果介意的话 可以使用pytorch3d的版本（也就是速度慢4倍左右 从15ms 到 80ms）
+
+这里主要重申一遍 shared memory在这里的用法：
+1. 首先我们每个点都会分开走到 `int tid = blockIdx.x * blockDim.x + threadIdx.x;` 也就是全局索引，注意这个每个点都分开了 因为pc0每个点和pc1的临近点 和 其他的pc0点无关
+2. 然后走到每个点内部 就是__shared__ 我们首先建立了 pc1的share，但是因为共享内存有限，所以每次只保存THREADS_PER_BLOCK
+3. 保存 THREADS_PER_BLOCK 也是每个线程做的 我们在对比距离前 运行了 __syncthreads(); 确保 THREADS_PER_BLOCK 个点的 pc1 已经到了
+4. 接着 我们在 `num_elems` 这一部分的数据内进行对比，同步best
+5. 最后传给 全局这个点的 `result`
+
+需要注意的是 这种极致的并行化 会对精度产生一定的影响，但是如果你感兴趣 `#define THREADS_PER_BLOCK 256` 可以调整这个，对每个block设置不同的threads 会对精度有影响（当然 影响是 在 gt: 0.1710 但cuda计算会是 0.1711 - 0.1713之间）
+
+以下为chatgpt：
+精度差异的原因之一可能是由于在不同的线程块大小下，浮点运算的顺序发生了改变。由于浮点运算是不结合的（即(a + b) + c 可能不等于 a + (b + c)），因此改变运算的顺序可能会导致轻微的结果差异。
+
+这种类型的精度变化在GPU计算中是非常常见的，特别是在使用较大的数据集和进行大量的浮点运算时。要完全消除这种差异是非常困难的，因为即使是非常微小的实现细节变化（例如改变线程块大小、更改循环的结构、甚至是不同的GPU硬件或不同的CUDA版本）都可能导致浮点运算顺序的微小变化。
+
+如果需要确保结果的一致性，可以考虑以下方法：
+
+1. 固定线程块大小：选择一个固定的线程块大小，并始终使用它。
+
+2. 双精度浮点数（Double Precision）：使用double类型代替float，可以提高精度，但代价是更高的内存使用和可能的性能下降。
+
+3. 数值稳定的算法：尽量使用数值稳定的算法，尽管这在GPU上实现起来可能比较复杂且效率较低。
+
+4. 减少并行化程度：通过减少并行化程度来减少由于不同线程执行顺序引起的差异，但这通常会牺牲性能。
+
+
+复制代码部分如下：
+```cpp
+
+for (int i = 0; i < pc1_n; i += THREADS_PER_BLOCK) {
+	// Copy a block of pc1 to shared memory
+	int pc1_idx = i + threadIdx.x;
+	if (pc1_idx < pc1_n) {
+		shared_pc1[threadIdx.x * 3 + 0] = pc1_xyz[pc1_idx * 3 + 0];
+		shared_pc1[threadIdx.x * 3 + 1] = pc1_xyz[pc1_idx * 3 + 1];
+		shared_pc1[threadIdx.x * 3 + 2] = pc1_xyz[pc1_idx * 3 + 2];
+	}
+
+	__syncthreads();
+
+	// Compute the distance between pc0[tid] and the points in shared_pc1
+	// NOTE(Qingwen):  since after two months I forgot what I did here, I write some notes for future me
+	// 0. One reason for the difference in precision may be due to the changing order of floating point operations at different thread block sizes.
+	//    But I think it's fine we lose 0.0001 precision for speed up cal time 4x
+	// 1. since we use shared to store pc1, here Every BLOCK will have new shared_pc1 start from 0
+	// 2. we use THREADS_PER_BLOCK to loop pc1, so we need to check if the last block is not full
+	// 3. Based on the CUDA document, the __syncthreads() is not necessary here, but we keep it for safety
+	// 4. After running once, we go for next block of pc1, and find the best in that batch
+	
+	int num_elems = min(THREADS_PER_BLOCK, pc1_n - i);
+	for (int j = 0; j < num_elems; j++) {
+		float x1 = shared_pc1[j * 3 + 0];
+		float y1 = shared_pc1[j * 3 + 1];
+		float z1 = shared_pc1[j * 3 + 2];
+		float d = (x1 - x0) * (x1 - x0) + (y1 - y0) * (y1 - y0) + (z1 - z0) * (z1 - z0);
+		if (d < best) {
+			best = d;
+			best_i = j + i;
+		}
+	}
+	__syncthreads();
+}
+```
 
 ## Other issues
 In cluster when build cuda things, you may occur problem:
diff --git a/dataprocess/README.md b/dataprocess/README.md
@@ -12,6 +12,13 @@ We've updated the process dataset for:
 - [x] Waymo: check [here](#waymo-dataset). The process script was involved from [SeFlow](https://github.com/KTH-RPL/SeFlow).
 - [ ] nuScenes: done coding, public after review. Will be involved later by another paper.
 
+If you want to use all datasets above, there is a specific process environment in [envprocess.yml](../envprocess.yml) to install all the necessary packages. As Waymo package have different configuration and conflict with the main environment. Setup through the following command:
+
+```bash
+conda env create -f envprocess.yml
+conda activate dataprocess
+```
+
 ## Download
 
 ### Argoverse 2.0
diff --git a/dataprocess/extract_av2.py b/dataprocess/extract_av2.py
@@ -208,6 +208,8 @@ def create_group_data(group, pc, gm, pose, flow_0to1=None, flow_valid=None, flow
                         for file in os.listdir(data_dir / log_id / "sensors/lidar")
                         if file.endswith('.feather')])
 
+    gt_flow_flag = False if not (data_dir / log_id / "annotations.feather").exists() else True
+
     # if n is not None:
     #     iter_bar = tqdm(zip(timestamps, timestamps[1:]), leave=False,
     #                      total=len(timestamps) - 1, position=n,
@@ -222,7 +224,7 @@ def create_group_data(group, pc, gm, pose, flow_0to1=None, flow_valid=None, flow
             if pc0.shape[0] < 256:
                 print(f'{log_id}/{ts0} has less than 256 points, skip this scenarios. Please check the data if needed.')
                 break
-            if cnt == len(timestamps) - 1:
+            if cnt == len(timestamps) - 1 or not gt_flow_flag:
                 create_group_data(group, pc0, is_ground_0.astype(np.bool_), pose0.transform_matrix.astype(np.float32))
             else:
                 ts1 = timestamps[cnt + 1]
@@ -269,7 +271,7 @@ def main(
     argo_dir: str = "/home/kin/data/av2",
     output_dir: str ="/home/kin/data/av2/preprocess",
     av2_type: str = "sensor",
-    data_mode: str = "test",
+    data_mode: str = "val",
     mask_dir: str = "/home/kin/data/av2/3d_scene_flow",
     nproc: int = (multiprocessing.cpu_count() - 1),
     only_index: bool = False,
diff --git a/environment.yaml b/environment.yaml
@@ -27,10 +27,12 @@ dependencies:
     - open3d==0.18.0
     - dztimer
     - av2==0.2.1
+    - dufomap==1.0.0
 
 # Reason about the version fixed:
 # setuptools==68.5.1: https://github.com/aws-neuron/aws-neuron-sdk/issues/893
 # mkl==2024.0.0: https://github.com/pytorch/pytorch/issues/123097#issue-2218541307
 # av2==0.2.1: in case other version deleted some functions.
 # lightning==2.0.1: https://stackoverflow.com/questions/76647518/how-to-fix-error-cannot-import-name-modelmetaclass-from-pydantic-main
 # open3d==0.18.0: because 0.17.0 have bug on set the view json file
+# dufomap==1.0.0: in case later updating may not compatible with the code.
diff --git a/envprocess.yaml b/envprocess.yaml
@@ -0,0 +1,28 @@
+name: dataprocess
+channels:
+  - conda-forge
+  - pytorch
+dependencies:
+  - python=3.8
+  - pytorch::pytorch=2.0.0
+  - pytorch::torchvision
+  - numba
+  - numpy==1.22
+  - pandas
+  - pip
+  - scipy
+  - tqdm
+  - scikit-learn
+  - fire
+  - pip:
+    - nuscenes-devkit
+    - av2==0.2.1
+    - waymo-open-dataset-tf-2.11.0==1.5.0
+    - open3d==0.16.0
+    - linefit
+    - dztimer
+    - dufomap==1.0.0
+    - evalai
+
+# Reason about the version fixed:
+# numpy==1.22: package conflicts, need numpy higher or same 1.22