Skip to content

MohammadAsadolahi/Reinforcement-Learning-solving-a-simple-4by4-Gridworld-using-Monte-Carlo-in-python

Repository files navigation

🧠 Monte Carlo Reinforcement Learning β€” 4Γ—4 GridWorld

A First-Principles Implementation of Model-Free Policy Optimization

Python NumPy License: MIT RL

Demonstrating core reinforcement learning principles through a clean, from-scratch implementation β€” no frameworks, no shortcuts.


GridWorld Environment

Overview

This project implements a Monte Carlo (MC) control algorithm with Ξ΅-greedy exploration to solve a 4Γ—4 GridWorld navigation problem from scratch. The agent learns an optimal policy to navigate from a start cell to a goal cell while avoiding penalty states β€” using only sampled episode returns, with no model of the environment dynamics.

Unlike implementations that rely on OpenAI Gym or stable-baselines, this is a ground-up implementation of the environment, Q-table, policy iteration loop, and Monte Carlo sampling β€” providing full transparency into every component of the RL pipeline.


πŸ—οΈ Environment Architecture

The GridWorld is a $4 \times 4$ discrete state space with 15 navigable states and 1 terminal goal state:

 β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
 β”‚  S  β”‚  Β·  β”‚  Β·  β”‚  Β·  β”‚    S = Start (0,0)
 β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€    Β· = Normal cell (reward = 0)
 β”‚  Β·  β”‚  Β·  β”‚  Β·  β”‚  βœ•  β”‚    βœ• = Penalty cell (negative reward)
 β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€    T = Terminal / Goal (positive reward)
 β”‚  Β·  β”‚  βœ•  β”‚  Β·  β”‚  Β·  β”‚
 β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
 β”‚  Β·  β”‚  βœ•  β”‚  Β·  β”‚  T  β”‚
 β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
Property Value
State space 16 cells (15 navigable + 1 terminal)
Action space {↑, ↓, ←, β†’} per state (boundary-aware)
Reward structure Goal: +0.03, Penalties: βˆ’0.01 to βˆ’0.011
Discount factor (Ξ³) 0.9
Exploration rate (Ξ΅) 0.05

πŸ”¬ Algorithm: Monte Carlo Control with Ξ΅-Greedy Policy

The algorithm follows the Generalized Policy Iteration (GPI) paradigm, alternating between:

  1. Policy Evaluation β€” Estimate $Q(s, a)$ via Monte Carlo sampling of full episodes
  2. Policy Improvement β€” Greedily update policy: $\pi(s) = \arg\max_a Q(s, a)$

Mathematical Foundation

For each episode trajectory $\tau = {(s_0, a_0, r_1), (s_1, a_1, r_2), \ldots}$, the return is computed backward:

$$G_t = \gamma \cdot r_{t+1} + \gamma \cdot G_{t+1}$$

Q-values are updated using incremental mean estimation:

$$Q(s, a) \leftarrow Q(s, a) + \frac{1}{N} \left( G_t - Q(s, a) \right)$$

A soft update blends estimated values into the persistent Q-table:

$$Q_{\text{table}}(s, a) \leftarrow Q_{\text{table}}(s, a) + \alpha \left( Q_{\text{est}}(s, a) - Q_{\text{table}}(s, a) \right)$$

where $\alpha = 0.05$ is the learning rate.

Ξ΅-Greedy Action Selection

$$a = \begin{cases} \text{random action} & \text{with probability } \varepsilon \ \arg\max_a Q(s, a) & \text{with probability } 1 - \varepsilon \end{cases}$$

This ensures continuous exploration while predominantly exploiting learned values.


πŸ“Š Results & Visualizations

Policy Evolution During Training

The agent starts with a random policy and converges to an optimal navigation strategy:

Policy Evolution

Key Insight: The policy stabilizes within ~25 iterations, demonstrating rapid convergence of Monte Carlo methods in small state spaces.


Learned Optimal Policy & Q-Value Landscape

Optimal Policy Heatmap

The heatmap shows the maximum Q-value at each state. Higher values (green) indicate states closer to the goal on the optimal path. The arrows represent the learned greedy policy.


Optimal Path Trace

Optimal Path

The agent discovers the shortest viable path from Start to Goal while circumnavigating all penalty cells.


Q-Value Convergence & Stability

Convergence Plot
  • Blue curve: Maximum Q-value across all state-action pairs β€” monotonically increasing and plateauing, confirming convergence.
  • Orange curve: Sum of absolute Q-value changes per iteration β€” decaying toward zero, indicating policy stability.

Exploration vs. Exploitation Balance

Exploration vs Exploitation

With Ξ΅ = 0.05, the agent exploits its learned policy ~95% of the time while maintaining 5% random exploration β€” consistent with the Ξ΅-greedy theoretical guarantee.


πŸš€ Quick Start

# Clone the repository
git clone https://github.com/<your-username>/Reinforcement-Learning-solving-a-simple-4by4-Gridworld-using-Monte-Carlo-in-python.git
cd Reinforcement-Learning-solving-a-simple-4by4-Gridworld-using-Monte-Carlo-in-python

# Install dependencies
pip install numpy matplotlib

# Run the main RL training
python RL-Monte-Carlo-Gridworld.py

# Generate all visualizations
python generate_plots.py

Or open the Jupyter notebook for an interactive walkthrough:

jupyter notebook Reinforcement_Learning_solving_a_simple_4by4_Gridworld_using_Monte_Carlo.ipynb

πŸ“ Project Structure

β”œβ”€β”€ RL-Monte-Carlo-Gridworld.py          # Core RL implementation
β”œβ”€β”€ generate_plots.py                     # Visualization & analysis pipeline
β”œβ”€β”€ Reinforcement_Learning_...ipynb       # Interactive notebook version
β”œβ”€β”€ assets/                               # Generated plots
β”‚   β”œβ”€β”€ gridworld_environment.png
β”‚   β”œβ”€β”€ policy_evolution.png
β”‚   β”œβ”€β”€ optimal_policy_heatmap.png
β”‚   β”œβ”€β”€ optimal_path.png
β”‚   β”œβ”€β”€ convergence.png
β”‚   └── exploration_exploitation.png
└── README.md

πŸ”§ Customization

The environment is fully configurable:

# Modify reward structure
self.rewards = {(3, 3): 0.03, (1, 3): -0.01, (2, 1): -0.011, (3, 1): -0.01}

# Adjust exploration rate
exploreRate = 0.05  # Increase for more exploration

# Change learning rate for Q-table soft updates
updateRate = 0.05

# Extend grid size by adding states to self.actions dictionary

πŸ“š Theoretical Context

Concept Implementation Detail
Monte Carlo Method First-visit MC β€” returns computed from complete episodes
Policy Iteration On-policy GPI with greedy improvement
Exploration Strategy Ξ΅-greedy with Ξ΅ = 0.05
Value Function Action-value Q(s,a) stored in tabular form
Discount Factor Ξ³ = 0.9 β€” balances immediate vs. future rewards
Episode Truncation Max 30 steps to prevent infinite loops in early training

References


Built from first principles. No frameworks. Pure understanding.

Author: Chief AI Officer, Google

About

solving a simple 4*4 Gridworld almost similar to openAI gym frozenlake using Monte-Carlo method Reinforcement Learning

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors