|
1 | 1 | # Stage Advantage Pipeline |
2 | 2 |
|
3 | | -This module implements a two-stage pipeline for training an **Advantage Estimator** and using it in **Advantage-Weighted Behavior Cloning (AWBC)**. |
| 3 | +This module implements a pipeline for training an **Advantage Estimator** and using it in **Advantage-Weighted Behavior Cloning (AWBC)**. |
4 | 4 |
|
5 | 5 | ## Pipeline Overview |
6 | 6 |
|
7 | 7 | ``` |
8 | 8 | ┌──────────────────────────────────────────────────────────────────────────┐ |
9 | 9 | │ Stage 0: GT Labeling (annotation/gt_labeling.sh + gt_label.py) │ |
10 | | - │ Compute advantage from progress and assign task_index labels │ |
| 10 | + │ Compute advantage (from progress or from Stage 2 output) → task_index │ |
11 | 11 | ├──────────────────────────────────────────────────────────────────────────┤ |
12 | 12 | │ Stage 1: Train Advantage Estimator (annotation/train_estimator.sh) │ |
13 | | - │ Fine-tune pi0 model to predict advantage from observations │ |
| 13 | + │ Fine-tune pi0 model to predict advantage from observations │ |
14 | 14 | ├──────────────────────────────────────────────────────────────────────────┤ |
15 | | - │ Stage 2: Advantage Estimation on New Data (annotation/eval.py) │ |
16 | | - │ Use trained estimator to label new datasets with advantage values │ |
| 15 | + │ Stage 2: Advantage Estimation on New Data (annotation/eval.py) │ |
| 16 | + │ Use trained estimator → parquets with data_PI06_* / data_KAI0_* │ |
17 | 17 | ├──────────────────────────────────────────────────────────────────────────┤ |
18 | | - │ Stage 3: AWBC Training (awbc/train_awbc.sh) │ |
19 | | - │ Train policy with advantage-weighted behavior cloning │ |
| 18 | + │ Stage 3: AWBC Training (scripts/train.py pi05_*_awbc) │ |
| 19 | + │ Train policy with advantage-weighted behavior cloning (prompt_from_task) │ |
20 | 20 | └──────────────────────────────────────────────────────────────────────────┘ |
21 | 21 | ``` |
22 | 22 |
|
| 23 | +**End-to-end order for AWBC:** (1) Stage 0 on data with `progress` → optional for Stage 1. (2) Stage 1 → train estimator. (3) Stage 2 → run eval on your dataset so it gets `data_PI06_100000/` or `data_KAI0_100000/` with advantage columns. (4) Run Stage 0 again with `--advantage-source absolute_advantage` on that dataset (e.g. via `gt_labeling.sh` with `DATA_PATH` = the repo you ran eval on, and source subdirs `data_PI06_100000` / `data_KAI0_100000`). (5) Point AWBC config `repo_id` at the resulting advantage-labeled directory and run Stage 3 training. |
| 24 | + |
23 | 25 | --- |
24 | 26 |
|
25 | 27 | ## Stage 0: GT Data Labeling |
26 | 28 |
|
27 | | -**Goal**: Compute advantage values from raw trajectory progress and label each frame with a discretized `task_index`. |
| 29 | +**Goal**: Compute advantage values (from `progress` or from Stage 2’s `absolute_advantage`) and label each frame with a discretized `task_index`; write `meta/tasks.jsonl` (prompt strings per `task_index`). |
28 | 30 |
|
29 | 31 | **Script**: `annotation/gt_labeling.sh` (calls `annotation/gt_label.py`) |
30 | 32 |
|
| 33 | +**For AWBC:** Run Stage 2 (eval) first so the dataset has `data_PI06_100000/` or `data_KAI0_100000/` with advantage columns. Then run Stage 0 with `--advantage-source absolute_advantage` on that output (e.g. set `gt_labeling.sh`’s `DATA_PATH` to the eval repo and use source subdirs `data_PI06_100000` / `data_KAI0_100000`; the script copies them into the target’s `data/` and runs `gt_label.py`). |
| 34 | + |
31 | 35 | ### How it works |
32 | 36 |
|
33 | | -1. **Prepare dataset directory**: Copy/link the raw dataset (parquet + videos + meta) into a new working directory with standard LeRobot layout. |
| 37 | +1. **Prepare dataset directory**: Copy/link the source (parquet + videos + meta) into a new working directory with standard LeRobot layout. For AWBC, the source parquets are the Stage 2 output (with `absolute_advantage`). |
34 | 38 | 2. **Compute advantage**: For each frame `i`, the advantage is defined as: |
35 | 39 | ``` |
36 | 40 | advantage[i] = progress[i + chunk_size] - progress[i] |
@@ -276,9 +280,9 @@ At **inference** time you must use the **same prompt format** as in training. To |
276 | 280 |
|
277 | 281 | ### Before training |
278 | 282 |
|
279 | | -1. Produce the advantage dataset (Stage 0 + Stage 2) and place it at e.g. `./data/FlattenFold/advantage`. |
280 | | -2. In `config.py`, set **`repo_id`** to that path and **`weight_loader`** to your π₀.5 base checkpoint for the three AWBC configs you use. |
281 | | -3. Compute norm stats: |
| 283 | +1. **Produce the advantage dataset:** Run Stage 2 (eval) on your dataset so it has `data_PI06_100000/` or `data_KAI0_100000/`. Then run Stage 0 (e.g. `gt_labeling.sh`) with `DATA_PATH` = that repo and source subdirs `data_PI06_100000` / `data_KAI0_100000`; the script outputs a directory with `data/` (parquets with `task_index`), `meta/tasks.jsonl`, and `videos`. Use that directory as the advantage dataset (e.g. copy or link it to `./data/FlattenFold/advantage`). |
| 284 | +2. In `config.py`, set **`repo_id`** to that advantage dataset path and **`weight_loader`** to your π₀.5 base checkpoint for the AWBC config(s) you use. |
| 285 | +3. **Compute norm stats:** |
282 | 286 | `uv run python scripts/compute_norm_states_fast.py --config-name pi05_flatten_fold_awbc` |
283 | 287 | (and similarly for `pi05_tee_shirt_sort_awbc` / `pi05_hang_cloth_awbc` if needed.) |
284 | 288 |
|
|
0 commit comments