Skip to content

MohammadAsadolahi/Deep-Q-Learning-for-solving-OpenAi-Gym-LunarLander-v2-in-python

Repository files navigation

Deep Q-Network for Autonomous Lunar Landing

A from-scratch implementation of Deep Reinforcement Learning for OpenAI Gym's LunarLander-v2

Python 3.10+ TensorFlow Keras Gymnasium License: MIT

Developed by AG — Chief AI Officer, Google


LunarLander-v2

Overview

This repository presents a production-grade Deep Q-Network (DQN) implementation that teaches an agent to autonomously land a spacecraft on the lunar surface. The agent learns entirely from raw 8-dimensional state observations through trial-and-error interaction with the environment — no hand-crafted heuristics, no human demonstrations.

The project implements the foundational algorithm from DeepMind's landmark paper "Human-level control through deep reinforcement learning" (Mnih et al., Nature 2015), adapted for continuous-state, discrete-action control.

The environment is considered "solved" when the agent achieves an average reward of ≥ 200 over 100 consecutive episodes.


Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     DQN Agent Pipeline                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌───────────┐    ┌──────────┐    ┌───────────────────────┐   │
│   │  LunarLa- │    │ ε-Greedy │    │     Q-Network         │   │
│   │  nder-v2  │───▶│  Policy  │───▶│  ┌─────────────────┐  │   │
│   │  (Gym)    │    │          │    │  │ Input:   8 dims  │  │   │
│   └─────┬─────┘    └──────────┘    │  │ Hidden: 256×ReLU │  │   │
│         │                          │  │ Hidden: 256×ReLU │  │   │
│         │  (s, a, r, s', done)     │  │ Output: 4 actions│  │   │
│         │                          │  └─────────────────┘  │   │
│         ▼                          └───────────┬───────────┘   │
│   ┌─────────────┐                              │               │
│   │   Replay    │◀─────── Sample Batch ────────┘               │
│   │   Buffer    │         (batch=64)                           │
│   │  (1M trans) │                                              │
│   └─────────────┘                                              │
│                                                                 │
│   Loss = MSE( Q(s,a) , r + γ·max_a' Q(s',a')·(1-done) )      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Design Decisions

Component Choice Rationale
Q-Network 2-layer MLP (256 units each) Sufficient capacity for 8D→4 mapping without overfitting
Activation ReLU Efficient gradients, avoids vanishing gradient in shallow nets
Optimizer Adam (lr=0.001) Adaptive learning rate, fast convergence
Replay Buffer 1M transitions, circular Breaks temporal correlation, improves sample efficiency
Exploration ε-greedy, exponential decay (0.9995) Smooth transition from exploration to exploitation
Discount (γ) 0.99 Long planning horizon — landing requires sustained strategy

State & Action Space

Observation (8-dimensional continuous vector)

Index Feature Description
0 x Horizontal position
1 y Vertical position
2 vx Horizontal velocity
3 vy Vertical velocity
4 θ Angle
5 ω Angular velocity
6 left_leg Left leg ground contact (bool)
7 right_leg Right leg ground contact (bool)

Actions (4 discrete)

Action Description
0 Do nothing
1 Fire left engine
2 Fire main engine
3 Fire right engine

Training Results

The agent converges to a stable landing policy within ~400 episodes, consistently surpassing the solved threshold of +200 average reward:

Total Episode Rewards Running Average Reward
Total Rewards Average Rewards

Project Structure

.
├── dqn/                          # Core DQN package
│   ├── __init__.py               # Package exports
│   ├── agent.py                  # DQNAgent — network, action selection, learning
│   ├── replay_buffer.py          # ReplayBuffer — circular experience storage
│   └── config.py                 # DQNConfig — centralized hyperparameters
│
├── train.py                      # CLI training script with full arg parsing
├── evaluate.py                   # CLI evaluation script with statistics
│
├── Deep-Q-Learning-for-solving-OpenAi-Gym-LunarLander-v2.py
│                                 # Original monolithic training script
├── DQN_for_Gym_LunarLander.ipynb # Interactive Jupyter notebook version
│
├── requirements.txt              # Pinned dependencies
├── pyproject.toml                # Modern Python packaging (PEP 621)
├── .gitignore                    # Git ignore rules
├── LICENSE                       # MIT License
├── CITATION.cff                  # Academic citation metadata
└── README.md                     # ← You are here

Quick Start

Prerequisites

  • Python 3.10+
  • SWIG (required for Box2D compilation)

Installation

# Clone the repository
git clone https://github.com/AbirGadworker/Deep-Q-Learning-for-solving-OpenAi-Gym-LunarLander-v2-in-python.git
cd Deep-Q-Learning-for-solving-OpenAi-Gym-LunarLander-v2-in-python

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate    # Linux/macOS
.venv\Scripts\activate       # Windows

# Install dependencies
pip install -r requirements.txt

Train

# Default training (500 episodes)
python train.py

# Custom configuration
python train.py --episodes 1000 --batch-size 128 --gamma 0.995

# Train with live rendering
python train.py --render --episodes 200

Evaluate

# Run 10 greedy evaluation episodes
python evaluate.py

# Evaluate with visual rendering
python evaluate.py --episodes 50 --render

# Use specific weights
python evaluate.py --weights results/DQN_LunarLanderV2.weights.h5

Algorithm Deep Dive

The Bellman Equation at the Core

The Q-function is updated toward the temporal difference target:

$$Q(s, a) \leftarrow r + \gamma \cdot \max_{a'} Q(s', a') \cdot (1 - \text{done})$$

where $r$ is the immediate reward, $\gamma$ is the discount factor, and the $(1 - \text{done})$ term zeroes out future value at terminal states.

Experience Replay

Instead of learning from sequential transitions (which are highly correlated), we store all transitions in a circular buffer of 1M capacity and sample uniformly at random in mini-batches of 64. This provides:

  1. Decorrelation — breaks the temporal dependency between consecutive samples
  2. Data efficiency — each transition can be reused across multiple gradient updates
  3. Stability — smooths out the non-stationary distribution of incoming data

Exploration Schedule

The exploration rate $\varepsilon$ decays exponentially:

$$\varepsilon_{t+1} = \max(\varepsilon_t \times 0.9995, \ 0.01)$$

Starting from $\varepsilon = 1.0$ (fully random), the agent gradually shifts to exploitation while maintaining a 1% exploration floor to prevent convergence to suboptimal deterministic policies.


Hyperparameter Reference

Parameter Value CLI Flag
Discount factor (γ) 0.99 --gamma
Initial ε 1.0
ε decay rate 0.9995
Minimum ε 0.01
Batch size 64 --batch-size
Replay buffer size 1,000,000
Hidden layers 2 × 256
Learning rate 0.001 --lr
Episodes 500 --episodes

Citation

If you use this work in your research, please cite:

@software{ag2026dqn,
  author    = {AG},
  title     = {Deep Q-Network for Solving OpenAI Gym LunarLander-v2},
  year      = {2026},
  url       = {https://github.com/AbirGadworker/Deep-Q-Learning-for-solving-OpenAi-Gym-LunarLander-v2-in-python},
  license   = {MIT}
}

References

  1. Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
  2. Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3), 293–321.
  3. Gymnasium Documentation — LunarLander-v2

Built with purpose. Trained with patience. Landed with precision.

MIT License © 2026 AG

Releases

No releases published

Packages

 
 
 

Contributors