Demonstrating core reinforcement learning principles through a clean, from-scratch implementation β no frameworks, no shortcuts.
This project implements a Monte Carlo (MC) control algorithm with Ξ΅-greedy exploration to solve a 4Γ4 GridWorld navigation problem from scratch. The agent learns an optimal policy to navigate from a start cell to a goal cell while avoiding penalty states β using only sampled episode returns, with no model of the environment dynamics.
Unlike implementations that rely on OpenAI Gym or stable-baselines, this is a ground-up implementation of the environment, Q-table, policy iteration loop, and Monte Carlo sampling β providing full transparency into every component of the RL pipeline.
The GridWorld is a
βββββββ¬ββββββ¬ββββββ¬ββββββ
β S β Β· β Β· β Β· β S = Start (0,0)
βββββββΌββββββΌββββββΌββββββ€ Β· = Normal cell (reward = 0)
β Β· β Β· β Β· β β β β = Penalty cell (negative reward)
βββββββΌββββββΌββββββΌββββββ€ T = Terminal / Goal (positive reward)
β Β· β β β Β· β Β· β
βββββββΌββββββΌββββββΌββββββ€
β Β· β β β Β· β T β
βββββββ΄ββββββ΄ββββββ΄ββββββ
| Property | Value |
|---|---|
| State space | 16 cells (15 navigable + 1 terminal) |
| Action space | {β, β, β, β} per state (boundary-aware) |
| Reward structure | Goal: +0.03, Penalties: β0.01 to β0.011 |
| Discount factor (Ξ³) | 0.9 |
| Exploration rate (Ξ΅) | 0.05 |
The algorithm follows the Generalized Policy Iteration (GPI) paradigm, alternating between:
-
Policy Evaluation β Estimate
$Q(s, a)$ via Monte Carlo sampling of full episodes -
Policy Improvement β Greedily update policy:
$\pi(s) = \arg\max_a Q(s, a)$
For each episode trajectory
Q-values are updated using incremental mean estimation:
A soft update blends estimated values into the persistent Q-table:
where
This ensures continuous exploration while predominantly exploiting learned values.
The agent starts with a random policy and converges to an optimal navigation strategy:
Key Insight: The policy stabilizes within ~25 iterations, demonstrating rapid convergence of Monte Carlo methods in small state spaces.
The heatmap shows the maximum Q-value at each state. Higher values (green) indicate states closer to the goal on the optimal path. The arrows represent the learned greedy policy.
The agent discovers the shortest viable path from Start to Goal while circumnavigating all penalty cells.
- Blue curve: Maximum Q-value across all state-action pairs β monotonically increasing and plateauing, confirming convergence.
- Orange curve: Sum of absolute Q-value changes per iteration β decaying toward zero, indicating policy stability.
With Ξ΅ = 0.05, the agent exploits its learned policy ~95% of the time while maintaining 5% random exploration β consistent with the Ξ΅-greedy theoretical guarantee.
# Clone the repository
git clone https://github.com/<your-username>/Reinforcement-Learning-solving-a-simple-4by4-Gridworld-using-Monte-Carlo-in-python.git
cd Reinforcement-Learning-solving-a-simple-4by4-Gridworld-using-Monte-Carlo-in-python
# Install dependencies
pip install numpy matplotlib
# Run the main RL training
python RL-Monte-Carlo-Gridworld.py
# Generate all visualizations
python generate_plots.pyOr open the Jupyter notebook for an interactive walkthrough:
jupyter notebook Reinforcement_Learning_solving_a_simple_4by4_Gridworld_using_Monte_Carlo.ipynbβββ RL-Monte-Carlo-Gridworld.py # Core RL implementation
βββ generate_plots.py # Visualization & analysis pipeline
βββ Reinforcement_Learning_...ipynb # Interactive notebook version
βββ assets/ # Generated plots
β βββ gridworld_environment.png
β βββ policy_evolution.png
β βββ optimal_policy_heatmap.png
β βββ optimal_path.png
β βββ convergence.png
β βββ exploration_exploitation.png
βββ README.md
The environment is fully configurable:
# Modify reward structure
self.rewards = {(3, 3): 0.03, (1, 3): -0.01, (2, 1): -0.011, (3, 1): -0.01}
# Adjust exploration rate
exploreRate = 0.05 # Increase for more exploration
# Change learning rate for Q-table soft updates
updateRate = 0.05
# Extend grid size by adding states to self.actions dictionary| Concept | Implementation Detail |
|---|---|
| Monte Carlo Method | First-visit MC β returns computed from complete episodes |
| Policy Iteration | On-policy GPI with greedy improvement |
| Exploration Strategy | Ξ΅-greedy with Ξ΅ = 0.05 |
| Value Function | Action-value Q(s,a) stored in tabular form |
| Discount Factor | Ξ³ = 0.9 β balances immediate vs. future rewards |
| Episode Truncation | Max 30 steps to prevent infinite loops in early training |
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
- Ξ΅-Greedy Algorithm in Reinforcement Learning
- OpenAI Gym FrozenLake-v1 β conceptual inspiration for the grid environment
Built from first principles. No frameworks. Pure understanding.
Author: Chief AI Officer, Google




