This repository presents a production-grade Deep Q-Network (DQN) implementation that teaches an agent to autonomously land a spacecraft on the lunar surface. The agent learns entirely from raw 8-dimensional state observations through trial-and-error interaction with the environment — no hand-crafted heuristics, no human demonstrations.
The project implements the foundational algorithm from DeepMind's landmark paper "Human-level control through deep reinforcement learning" (Mnih et al., Nature 2015), adapted for continuous-state, discrete-action control.
The environment is considered "solved" when the agent achieves an average reward of ≥ 200 over 100 consecutive episodes.
┌─────────────────────────────────────────────────────────────────┐
│ DQN Agent Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────┐ ┌──────────┐ ┌───────────────────────┐ │
│ │ LunarLa- │ │ ε-Greedy │ │ Q-Network │ │
│ │ nder-v2 │───▶│ Policy │───▶│ ┌─────────────────┐ │ │
│ │ (Gym) │ │ │ │ │ Input: 8 dims │ │ │
│ └─────┬─────┘ └──────────┘ │ │ Hidden: 256×ReLU │ │ │
│ │ │ │ Hidden: 256×ReLU │ │ │
│ │ (s, a, r, s', done) │ │ Output: 4 actions│ │ │
│ │ │ └─────────────────┘ │ │
│ ▼ └───────────┬───────────┘ │
│ ┌─────────────┐ │ │
│ │ Replay │◀─────── Sample Batch ────────┘ │
│ │ Buffer │ (batch=64) │
│ │ (1M trans) │ │
│ └─────────────┘ │
│ │
│ Loss = MSE( Q(s,a) , r + γ·max_a' Q(s',a')·(1-done) ) │
│ │
└─────────────────────────────────────────────────────────────────┘
| Component | Choice | Rationale |
|---|---|---|
| Q-Network | 2-layer MLP (256 units each) | Sufficient capacity for 8D→4 mapping without overfitting |
| Activation | ReLU | Efficient gradients, avoids vanishing gradient in shallow nets |
| Optimizer | Adam (lr=0.001) | Adaptive learning rate, fast convergence |
| Replay Buffer | 1M transitions, circular | Breaks temporal correlation, improves sample efficiency |
| Exploration | ε-greedy, exponential decay (0.9995) | Smooth transition from exploration to exploitation |
| Discount (γ) | 0.99 | Long planning horizon — landing requires sustained strategy |
| Index | Feature | Description |
|---|---|---|
| 0 | x |
Horizontal position |
| 1 | y |
Vertical position |
| 2 | vx |
Horizontal velocity |
| 3 | vy |
Vertical velocity |
| 4 | θ |
Angle |
| 5 | ω |
Angular velocity |
| 6 | left_leg |
Left leg ground contact (bool) |
| 7 | right_leg |
Right leg ground contact (bool) |
| Action | Description |
|---|---|
| 0 | Do nothing |
| 1 | Fire left engine |
| 2 | Fire main engine |
| 3 | Fire right engine |
The agent converges to a stable landing policy within ~400 episodes, consistently surpassing the solved threshold of +200 average reward:
.
├── dqn/ # Core DQN package
│ ├── __init__.py # Package exports
│ ├── agent.py # DQNAgent — network, action selection, learning
│ ├── replay_buffer.py # ReplayBuffer — circular experience storage
│ └── config.py # DQNConfig — centralized hyperparameters
│
├── train.py # CLI training script with full arg parsing
├── evaluate.py # CLI evaluation script with statistics
│
├── Deep-Q-Learning-for-solving-OpenAi-Gym-LunarLander-v2.py
│ # Original monolithic training script
├── DQN_for_Gym_LunarLander.ipynb # Interactive Jupyter notebook version
│
├── requirements.txt # Pinned dependencies
├── pyproject.toml # Modern Python packaging (PEP 621)
├── .gitignore # Git ignore rules
├── LICENSE # MIT License
├── CITATION.cff # Academic citation metadata
└── README.md # ← You are here
- Python 3.10+
- SWIG (required for Box2D compilation)
# Clone the repository
git clone https://github.com/AbirGadworker/Deep-Q-Learning-for-solving-OpenAi-Gym-LunarLander-v2-in-python.git
cd Deep-Q-Learning-for-solving-OpenAi-Gym-LunarLander-v2-in-python
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt# Default training (500 episodes)
python train.py
# Custom configuration
python train.py --episodes 1000 --batch-size 128 --gamma 0.995
# Train with live rendering
python train.py --render --episodes 200# Run 10 greedy evaluation episodes
python evaluate.py
# Evaluate with visual rendering
python evaluate.py --episodes 50 --render
# Use specific weights
python evaluate.py --weights results/DQN_LunarLanderV2.weights.h5The Q-function is updated toward the temporal difference target:
where
Instead of learning from sequential transitions (which are highly correlated), we store all transitions in a circular buffer of 1M capacity and sample uniformly at random in mini-batches of 64. This provides:
- Decorrelation — breaks the temporal dependency between consecutive samples
- Data efficiency — each transition can be reused across multiple gradient updates
- Stability — smooths out the non-stationary distribution of incoming data
The exploration rate
Starting from
| Parameter | Value | CLI Flag |
|---|---|---|
| Discount factor (γ) | 0.99 |
--gamma |
| Initial ε | 1.0 |
— |
| ε decay rate | 0.9995 |
— |
| Minimum ε | 0.01 |
— |
| Batch size | 64 |
--batch-size |
| Replay buffer size | 1,000,000 |
— |
| Hidden layers | 2 × 256 |
— |
| Learning rate | 0.001 |
--lr |
| Episodes | 500 |
--episodes |
If you use this work in your research, please cite:
@software{ag2026dqn,
author = {AG},
title = {Deep Q-Network for Solving OpenAI Gym LunarLander-v2},
year = {2026},
url = {https://github.com/AbirGadworker/Deep-Q-Learning-for-solving-OpenAi-Gym-LunarLander-v2-in-python},
license = {MIT}
}- Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
- Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3), 293–321.
- Gymnasium Documentation — LunarLander-v2
Built with purpose. Trained with patience. Landed with precision.
MIT License © 2026 AG


