Deep Q-Network implementation for optimal bridge maintenance planning using Markov Decision Process formulation with vectorized parallel training and Noisy Networks for Exploration (ICLR 2018).
Based on Phase 3 (Vectorized DQN) + Noisy Networks from dql-maintenance-faster project.
This project extends Phase 3 (Vectorized DQN) to implement a Markov補修政策 (Markov Maintenance Policy) using DQN with:
- Explicit state transition modeling
- Policy optimization based on Markov Decision Process theory
- Vectorized parallel training (AsyncVectorEnv)
- GPU-accelerated training with Mixed Precision (AMP)
- Noisy Networks for Exploration (ICLR 2018) - eliminates ε-greedy exploration
- 14x Faster Training: AsyncVectorEnv with 4 parallel environments
- Stable Convergence: Prioritized Experience Replay (PER)
- GPU-Accelerated: CUDA support with Mixed Precision Training
- Production-Ready: Validated on 30-year maintenance simulations
- Parameter-Space Exploration: Factorised Gaussian noise in network weights
- No ε-greedy Needed: Automatic exploration through stochastic policy
- Better Sample Efficiency: Learned exploration strategy
- Based on: Fortunato et al., "Noisy Networks for Exploration" (ICLR 2018)
| Metric | v0.6 (ε-greedy) | v0.7 (Noisy Net) | Improvement |
|---|---|---|---|
| Final Reward (mean) | 1,144.10 | 1,385.51 | +21.1% ✓ |
| Final Reward (std) | 561.61 | 433.97 | -22.7% ✓ |
| Best Reward (MA100) | 1,369.45 | 1,509.88 | +10.3% ✓ |
| Episodes to Best | 3,965 | 3,629 | -336 ep ✓ |
| Training Stability | Moderate | High | ✓ |
Key Findings:
- ✓ 21.1% performance improvement with Noisy Networks
- ✓ 22.7% reduction in variance - more stable learning
- ✓ Faster convergence - reaches optimal policy 336 episodes earlier
- ✓ No hyperparameter tuning - automatic exploration without ε-greedy scheduling
Figure: Comprehensive performance comparison between v0.6 (ε-greedy) and v0.7 (Noisy Networks) over 5000 episodes. Top row shows reward and cost progression with moving averages. Middle row displays final performance distribution (boxplot) and learning progress (cumulative best). Bottom row presents sample efficiency and detailed statistics table. Noisy Networks (red) consistently outperform ε-greedy (blue) across all metrics.
- Mixed Precision Training (AMP)
- Double DQN - Reduces overestimation bias
- Dueling DQN Architecture
- N-step Learning (n=3)
- Prioritized Experience Replay (PER)
- AsyncVectorEnv (4 parallel)
v0.6:
- Markov補修政策: Explicit MDP formulation
- State Transition Modeling: P(s'|s,a) representation
- Policy Optimization: Bellman optimality with DQN
v0.7:
- Noisy Networks: NoisyLinear layers for exploration
- Factorised Gaussian Noise: Efficient parameter-space noise
- Automatic Exploration: No manual ε-greedy tuning required
- Reset Noise per Episode: Fresh exploration each episode
graph TB
A["AsyncVectorEnv<br/>16 Parallel Environments"] --> B["MarkovFleetEnvironment<br/>100 Bridges: 20 Urban + 80 Rural"]
B --> C["State Space<br/>3 States: Good, Fair, Poor"]
B --> D["Action Space<br/>6 Actions: None, Work31-38"]
C --> E["Transition Matrices<br/>P(s'|s,a)<br/>6 actions × 3×3 matrices"]
D --> E
E --> F["State Transition<br/>s' ~ P(·|s,a)"]
F --> G["Reward: HEALTH_REWARD(s,s')"]
F --> H["Cost: ACTION_COST(a)"]
G --> I["Experience Generation<br/>(s, a, r, s', done, cost)"]
H --> I
style A fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
style B fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
style E fill:#fff4e1,stroke:#ff9900,stroke-width:2px
style F fill:#fff4e1,stroke:#ff9900,stroke-width:2px
style I fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
Components:
- Environment (Blue): Vectorized parallel execution with 16 environments
- Markov Model (Yellow): Explicit P(s'|s,a) transitions for 6 maintenance actions
- Experience (Green): Tuple generation with rewards and costs
graph TB
A["Experience<br/>(s, a, r, s', done)"] --> B["Prioritized Replay Buffer<br/>Capacity: 10k<br/>Priority: TD-error"]
B --> C["Sample Mini-batch<br/>Batch size: 64"]
C --> D["N-step Returns<br/>n=3, γ=0.95"]
D --> E["Double DQN Target<br/>Q_target = r + γ Q_target(s', argmax Q_online(s'))"]
E --> F["Dueling Network<br/>with Noisy Layers"]
F --> G["Value Stream V(s)<br/>NoisyLinear(256→128→1)"]
F --> H["Advantage Stream A(s,a)<br/>NoisyLinear(256→128→600)"]
G --> I["Q(s,a) = V(s) + A(s,a) - mean(A)"]
H --> I
I --> J["TD-error<br/>δ = Q_target - Q(s,a)"]
J --> K["MSE Loss<br/>L = (Q_target - Q)²"]
K --> L["AMP Backpropagation<br/>Mixed Precision"]
L --> M["Update Q-network<br/>θ ← θ - α∇L<br/>(includes noise params σ)"]
M --> N["Update Buffer Priorities<br/>priority ← abs(δ)"]
N --> O{"Target Sync?<br/>Every 500 steps"}
O -->|Yes| P["θ_target ← θ_online"]
O -->|No| Q["Continue Training"]
P --> Q
Q --> R["Reset Noise<br/>Sample new ε ~ f(x)<br/>f(x) = sgn(x)√|x|"]
R --> S["Greedy Action Selection<br/>a = argmax Q(s,a)<br/>(exploration via noise)"]
S --> A
style B fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
style E fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
style G fill:#ffccff,stroke:#cc0066,stroke-width:3px
style H fill:#ffccff,stroke:#cc0066,stroke-width:3px
style I fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
style L fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
style P fill:#fff4e1,stroke:#ff9900,stroke-width:2px
style R fill:#ffe6e6,stroke:#ff0000,stroke-width:2px
Components:
- Replay Buffer (Pink): Prioritized experience sampling
- Double DQN (Pink): Reduces Q-value overestimation
- Dueling Architecture with Noisy Layers (Purple): NoisyLinear in value/advantage streams
- AMP Training (Green): GPU-accelerated mixed precision
- Target Network (Yellow): Periodic synchronization for stability
- Noise Reset (Red): Factorised Gaussian noise for automatic exploration
Key Innovation (v0.7): NoisyLinear layers eliminate ε-greedy exploration by injecting learnable stochastic noise directly into network parameters. This achieves 21.1% performance improvement and 22.7% stability increase over ε-greedy.
graph TB
A["Training Loop"] --> B["Collect Episode Data"]
B --> C["Rewards History"]
B --> D["Costs History"]
B --> E["Loss History"]
C --> G["Episode Statistics<br/>Mean reward: +1,385<br/>Best reward: +1,510 (MA100)"]
D --> G
E --> G
G --> H["Save Checkpoint<br/>Every 1000 episodes"]
H --> I["Model State Dict<br/>θ_online, θ_target"]
H --> J["Training History<br/>rewards, costs, losses"]
H --> K["Hyperparameters<br/>lr, ε, γ, etc."]
I --> L["Checkpoint File<br/>.pt format"]
J --> L
K --> L
L --> M["visualize_markov_v06.py"]
L --> N["analyze_markov_v06.py"]
M --> O["Training Curves<br/>6-panel figure"]
M --> P["Learning Progress<br/>Phase analysis"]
N --> Q["Action Analysis<br/>Policy behavior"]
N --> R["Cost Distribution<br/>Mean: $2.59M"]
style A fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
style G fill:#fff4e1,stroke:#ff9900,stroke-width:2px
style L fill:#f5e1ff,stroke:#9900cc,stroke-width:2px
style O fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
style P fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
style Q fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
style R fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
Components:
- Data Collection (Blue): Real-time metric tracking during training
- Statistics (Yellow): Aggregated performance metrics
- Checkpointing (Purple): Persistent storage of model and history
- Visualization (Green): Post-training analysis and plotting
markov-dqn-v07-noisy/
README.md # This file
config.yaml # Configuration (v0.7)
requirements.txt # Dependencies
NOISY_NETWORKS.md # Implementation details
src/
markov_fleet_environment.py # Markov MDP environment
fleet_environment_gym.py # Gymnasium wrapper
__init__.py
train_markov_fleet.py # Training script (v0.7, Noisy Net)
test_noisy_net.py # Verification script
compare_v06_v07.py # Performance comparison tool
- Python 3.12+
- NVIDIA GPU with CUDA 12.4+
- 16GB+ VRAM recommended
# Create virtual environment
python -m venv venv
.\venv\Scripts\Activate.ps1
# Install PyTorch with CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
# Install dependencies
pip install gymnasium numpy matplotlib pyyaml tqdm# Verify implementation
python test_noisy_net.py
# Quick test (100 episodes)
python train_markov_fleet.py --episodes 100 --n-envs 4 --device cuda --output test_v07
# Standard training (1000 episodes)
python train_markov_fleet.py --episodes 1000 --n-envs 4 --device cuda --output outputs_v07_1k
# Production training (5000 episodes, recommended)
python train_markov_fleet.py --episodes 5000 --n-envs 16 --device cuda --output outputs_v07_5k
# Compare with v0.6
python compare_v06_v07.py # Requires both v0.6 and v0.7 checkpointsNote: No ε-greedy parameters needed! Exploration is automatic via Noisy Networks.
# Visualize v0.7 training curves
python visualize_markov_v07.py --checkpoint outputs_v07_5k/models/markov_fleet_dqn_final_5000ep.pt
# Analyze v0.7 learned policy
python analyze_markov_v07.py --checkpoint outputs_v07_5k/models/markov_fleet_dqn_final_5000ep.pt --device cudaFigure 1: Comprehensive training progress for v0.7 with Noisy Networks. (Top-left) Episode rewards with 50-episode moving average showing stable convergence. (Top-center) Total maintenance costs over episodes. (Top-right) Reward-cost trade-off colored by episode progression. (Bottom-left) Training loss with logarithmic scale. (Bottom-center) Reward distribution histogram. (Bottom-right) Training statistics table highlighting automatic exploration without ε-greedy.
Figure 2: Learning progress analysis across 5 training phases. (Top-left) Phase-wise reward distribution showing improvement over time. (Top-right) Cumulative mean reward with confidence intervals demonstrating convergence. (Bottom-left) Best reward trajectory showing continuous improvement. (Bottom-right) Phase-wise statistics table with mean, std, max, and min values.
Figure 3: Learned policy behavior analysis for v0.7 in evaluation mode (deterministic, no noise). (Top-left) Overall action distribution across all bridges and 30-year horizon. (Top-center) Urban vs Rural action comparison. (Top-right) Bridge state evolution showing maintenance effectiveness. (Bottom-left) Annual maintenance costs. (Bottom-center) Annual rewards. (Bottom-right) Performance summary with final states and action statistics.
| Metric | Value |
|---|---|
| Episodes Trained | 5,000 |
| Final Reward (last 100) | 1,368.31 |
| Best Reward | 2,754.75 |
| Final Cost (last 100) | $3,054,951k |
| Test Episode Reward | 1,387.10 |
| Test Episode Cost | $2,956,404k |
| Exploration Method | Noisy Networks |
| ε-greedy Used | No |
- Phase 3 Base: dql-maintenance-faster
- v0.6: [markov-dql-vectorized]
- Original Implementation: Multi-Bridge Fleet Maintenance with Vectorized DQN
MIT License
For questions or collaboration, please open an issue.
Version: 0.7
Last Updated: 2025-12-08
Based On: Phase 3 Vectorized DQN + Noisy Networks (ICLR 2018)
Performance: 21.1% improvement over v0.6, 22.7% stability increase



