Skip to content

tk-yasuno/markov-dqn-v07-noisy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Markov Decision Process DQN with Noisy Networks (v0.7)

Deep Q-Network implementation for optimal bridge maintenance planning using Markov Decision Process formulation with vectorized parallel training and Noisy Networks for Exploration (ICLR 2018).

Based on Phase 3 (Vectorized DQN) + Noisy Networks from dql-maintenance-faster project.

Project Overview

This project extends Phase 3 (Vectorized DQN) to implement a Markov補修政策 (Markov Maintenance Policy) using DQN with:

  • Explicit state transition modeling
  • Policy optimization based on Markov Decision Process theory
  • Vectorized parallel training (AsyncVectorEnv)
  • GPU-accelerated training with Mixed Precision (AMP)
  • Noisy Networks for Exploration (ICLR 2018) - eliminates ε-greedy exploration

Key Features (Inherited from Phase 3)

  • 14x Faster Training: AsyncVectorEnv with 4 parallel environments
  • Stable Convergence: Prioritized Experience Replay (PER)
  • GPU-Accelerated: CUDA support with Mixed Precision Training
  • Production-Ready: Validated on 30-year maintenance simulations

New in v0.7: Noisy Networks

  • Parameter-Space Exploration: Factorised Gaussian noise in network weights
  • No ε-greedy Needed: Automatic exploration through stochastic policy
  • Better Sample Efficiency: Learned exploration strategy
  • Based on: Fortunato et al., "Noisy Networks for Exploration" (ICLR 2018)

Performance Results: v0.6 vs v0.7 (5000 Episodes)

Experimental Comparison

Metric v0.6 (ε-greedy) v0.7 (Noisy Net) Improvement
Final Reward (mean) 1,144.10 1,385.51 +21.1%
Final Reward (std) 561.61 433.97 -22.7%
Best Reward (MA100) 1,369.45 1,509.88 +10.3%
Episodes to Best 3,965 3,629 -336 ep
Training Stability Moderate High

Key Findings:

  • 21.1% performance improvement with Noisy Networks
  • 22.7% reduction in variance - more stable learning
  • Faster convergence - reaches optimal policy 336 episodes earlier
  • No hyperparameter tuning - automatic exploration without ε-greedy scheduling

Performance Comparison Visualization

v0.6 vs v0.7 Comparison

Figure: Comprehensive performance comparison between v0.6 (ε-greedy) and v0.7 (Noisy Networks) over 5000 episodes. Top row shows reward and cost progression with moving averages. Middle row displays final performance distribution (boxplot) and learning progress (cumulative best). Bottom row presents sample efficiency and detailed statistics table. Noisy Networks (red) consistently outperform ε-greedy (blue) across all metrics.

Technical Stack

Core Technologies (from Phase 3)

  1. Mixed Precision Training (AMP)
  2. Double DQN - Reduces overestimation bias
  3. Dueling DQN Architecture
  4. N-step Learning (n=3)
  5. Prioritized Experience Replay (PER)
  6. AsyncVectorEnv (4 parallel)

New Features (v0.6 -> v0.7)

v0.6:

  • Markov補修政策: Explicit MDP formulation
  • State Transition Modeling: P(s'|s,a) representation
  • Policy Optimization: Bellman optimality with DQN

v0.7:

  • Noisy Networks: NoisyLinear layers for exploration
  • Factorised Gaussian Noise: Efficient parameter-space noise
  • Automatic Exploration: No manual ε-greedy tuning required
  • Reset Noise per Episode: Fresh exploration each episode

Markov DQN Learning Flow

1. Environment Setup and Markov Transition Model

graph TB
    A["AsyncVectorEnv<br/>16 Parallel Environments"] --> B["MarkovFleetEnvironment<br/>100 Bridges: 20 Urban + 80 Rural"]
    B --> C["State Space<br/>3 States: Good, Fair, Poor"]
    B --> D["Action Space<br/>6 Actions: None, Work31-38"]
    
    C --> E["Transition Matrices<br/>P(s'|s,a)<br/>6 actions × 3×3 matrices"]
    D --> E
    
    E --> F["State Transition<br/>s' ~ P(·|s,a)"]
    F --> G["Reward: HEALTH_REWARD(s,s')"]
    F --> H["Cost: ACTION_COST(a)"]
    
    G --> I["Experience Generation<br/>(s, a, r, s', done, cost)"]
    H --> I
    
    style A fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    style B fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    style E fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    style F fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    style I fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
Loading

Components:

  • Environment (Blue): Vectorized parallel execution with 16 environments
  • Markov Model (Yellow): Explicit P(s'|s,a) transitions for 6 maintenance actions
  • Experience (Green): Tuple generation with rewards and costs

2. DQN Training Loop with Noisy Networks (v0.7)

graph TB
    A["Experience<br/>(s, a, r, s', done)"] --> B["Prioritized Replay Buffer<br/>Capacity: 10k<br/>Priority: TD-error"]
    
    B --> C["Sample Mini-batch<br/>Batch size: 64"]
    C --> D["N-step Returns<br/>n=3, γ=0.95"]
    D --> E["Double DQN Target<br/>Q_target = r + γ Q_target(s', argmax Q_online(s'))"]
    
    E --> F["Dueling Network<br/>with Noisy Layers"]
    F --> G["Value Stream V(s)<br/>NoisyLinear(256→128→1)"]
    F --> H["Advantage Stream A(s,a)<br/>NoisyLinear(256→128→600)"]
    
    G --> I["Q(s,a) = V(s) + A(s,a) - mean(A)"]
    H --> I
    
    I --> J["TD-error<br/>δ = Q_target - Q(s,a)"]
    J --> K["MSE Loss<br/>L = (Q_target - Q)²"]
    K --> L["AMP Backpropagation<br/>Mixed Precision"]
    L --> M["Update Q-network<br/>θ ← θ - α∇L<br/>(includes noise params σ)"]
    
    M --> N["Update Buffer Priorities<br/>priority ← abs(δ)"]
    N --> O{"Target Sync?<br/>Every 500 steps"}
    O -->|Yes| P["θ_target ← θ_online"]
    O -->|No| Q["Continue Training"]
    P --> Q
    
    Q --> R["Reset Noise<br/>Sample new ε ~ f(x)<br/>f(x) = sgn(x)√|x|"]
    R --> S["Greedy Action Selection<br/>a = argmax Q(s,a)<br/>(exploration via noise)"]
    S --> A
    
    style B fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    style E fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    style G fill:#ffccff,stroke:#cc0066,stroke-width:3px
    style H fill:#ffccff,stroke:#cc0066,stroke-width:3px
    style I fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    style L fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style P fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    style R fill:#ffe6e6,stroke:#ff0000,stroke-width:2px
Loading

Components:

  • Replay Buffer (Pink): Prioritized experience sampling
  • Double DQN (Pink): Reduces Q-value overestimation
  • Dueling Architecture with Noisy Layers (Purple): NoisyLinear in value/advantage streams
  • AMP Training (Green): GPU-accelerated mixed precision
  • Target Network (Yellow): Periodic synchronization for stability
  • Noise Reset (Red): Factorised Gaussian noise for automatic exploration

Key Innovation (v0.7): NoisyLinear layers eliminate ε-greedy exploration by injecting learnable stochastic noise directly into network parameters. This achieves 21.1% performance improvement and 22.7% stability increase over ε-greedy.

3. Monitoring and Output Visualization

graph TB
    A["Training Loop"] --> B["Collect Episode Data"]
    
    B --> C["Rewards History"]
    B --> D["Costs History"]
    B --> E["Loss History"]
    
    C --> G["Episode Statistics<br/>Mean reward: +1,385<br/>Best reward: +1,510 (MA100)"]
    D --> G
    E --> G
    
    G --> H["Save Checkpoint<br/>Every 1000 episodes"]
    H --> I["Model State Dict<br/>θ_online, θ_target"]
    H --> J["Training History<br/>rewards, costs, losses"]
    H --> K["Hyperparameters<br/>lr, ε, γ, etc."]
    
    I --> L["Checkpoint File<br/>.pt format"]
    J --> L
    K --> L
    
    L --> M["visualize_markov_v06.py"]
    L --> N["analyze_markov_v06.py"]
    
    M --> O["Training Curves<br/>6-panel figure"]
    M --> P["Learning Progress<br/>Phase analysis"]
    
    N --> Q["Action Analysis<br/>Policy behavior"]
    N --> R["Cost Distribution<br/>Mean: $2.59M"]
    
    style A fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    style G fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    style L fill:#f5e1ff,stroke:#9900cc,stroke-width:2px
    style O fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style P fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style Q fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style R fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
Loading

Components:

  • Data Collection (Blue): Real-time metric tracking during training
  • Statistics (Yellow): Aggregated performance metrics
  • Checkpointing (Purple): Persistent storage of model and history
  • Visualization (Green): Post-training analysis and plotting

Project Structure

markov-dqn-v07-noisy/
 README.md                          # This file
 config.yaml                        # Configuration (v0.7)
 requirements.txt                   # Dependencies
 NOISY_NETWORKS.md                  # Implementation details

 src/
    markov_fleet_environment.py    # Markov MDP environment
    fleet_environment_gym.py       # Gymnasium wrapper
    __init__.py

 train_markov_fleet.py              # Training script (v0.7, Noisy Net)
 test_noisy_net.py                  # Verification script
 compare_v06_v07.py                 # Performance comparison tool

Quick Start

Prerequisites

  • Python 3.12+
  • NVIDIA GPU with CUDA 12.4+
  • 16GB+ VRAM recommended

Installation

# Create virtual environment
python -m venv venv
.\venv\Scripts\Activate.ps1

# Install PyTorch with CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

# Install dependencies
pip install gymnasium numpy matplotlib pyyaml tqdm

Training

# Verify implementation
python test_noisy_net.py

# Quick test (100 episodes)
python train_markov_fleet.py --episodes 100 --n-envs 4 --device cuda --output test_v07

# Standard training (1000 episodes)
python train_markov_fleet.py --episodes 1000 --n-envs 4 --device cuda --output outputs_v07_1k

# Production training (5000 episodes, recommended)
python train_markov_fleet.py --episodes 5000 --n-envs 16 --device cuda --output outputs_v07_5k

# Compare with v0.6
python compare_v06_v07.py  # Requires both v0.6 and v0.7 checkpoints

Note: No ε-greedy parameters needed! Exploration is automatic via Noisy Networks.

Visualization & Analysis

# Visualize v0.7 training curves
python visualize_markov_v07.py --checkpoint outputs_v07_5k/models/markov_fleet_dqn_final_5000ep.pt

# Analyze v0.7 learned policy
python analyze_markov_v07.py --checkpoint outputs_v07_5k/models/markov_fleet_dqn_final_5000ep.pt --device cuda

Training Results (v0.7 - 5000 Episodes)

Training Curves

Training Curves v0.7

Figure 1: Comprehensive training progress for v0.7 with Noisy Networks. (Top-left) Episode rewards with 50-episode moving average showing stable convergence. (Top-center) Total maintenance costs over episodes. (Top-right) Reward-cost trade-off colored by episode progression. (Bottom-left) Training loss with logarithmic scale. (Bottom-center) Reward distribution histogram. (Bottom-right) Training statistics table highlighting automatic exploration without ε-greedy.

Learning Progress

Learning Progress v0.7

Figure 2: Learning progress analysis across 5 training phases. (Top-left) Phase-wise reward distribution showing improvement over time. (Top-right) Cumulative mean reward with confidence intervals demonstrating convergence. (Bottom-left) Best reward trajectory showing continuous improvement. (Bottom-right) Phase-wise statistics table with mean, std, max, and min values.

Policy Analysis

Action Analysis v0.7

Figure 3: Learned policy behavior analysis for v0.7 in evaluation mode (deterministic, no noise). (Top-left) Overall action distribution across all bridges and 30-year horizon. (Top-center) Urban vs Rural action comparison. (Top-right) Bridge state evolution showing maintenance effectiveness. (Bottom-left) Annual maintenance costs. (Bottom-center) Annual rewards. (Bottom-right) Performance summary with final states and action statistics.

Performance Summary

Metric Value
Episodes Trained 5,000
Final Reward (last 100) 1,368.31
Best Reward 2,754.75
Final Cost (last 100) $3,054,951k
Test Episode Reward 1,387.10
Test Episode Cost $2,956,404k
Exploration Method Noisy Networks
ε-greedy Used No

Related Projects

  • Phase 3 Base: dql-maintenance-faster
  • v0.6: [markov-dql-vectorized]
  • Original Implementation: Multi-Bridge Fleet Maintenance with Vectorized DQN

License

MIT License

Contact

For questions or collaboration, please open an issue.


Version: 0.7
Last Updated: 2025-12-08
Based On: Phase 3 Vectorized DQN + Noisy Networks (ICLR 2018)
Performance: 21.1% improvement over v0.6, 22.7% stability increase

Releases

No releases published

Packages

 
 
 

Contributors

Languages