Reinforcement Learning Agents

A focused collection of standard RL algorithms, each implemented as a clean Jupyter notebook (.ipynb) — built for understanding, not abstraction. Every notebook runs end-to-end, logs metrics to Weights & Biases, and is written so the algorithm speaks for itself.

Philosophy

Most RL codebases bury the algorithm under layers of wrappers and utility classes. This repo does the opposite — each notebook is self-contained, linearly readable, and follows the same structure:

Intuition → Math → Implementation → Training → Results

No base classes. No hidden logic. Just the algorithm.

Implementations

Algorithm	Environment	Notebook
DQN	Atari Pong	`Pong-DQN/`
Double DQN	CartPole	`CartPole-Double-DQN/`
A2C	Gymnasium	`A2C/`
PPO	Continuous Control	`Proximal Policy Optimization (PPO)/`
Proximal PPO	Acrobot-v1	`Proximal Policy Optimization - Acrobot/`
DDPG	Continuous Control	`DDPG/`
TD3	Continuous Control	`TD3/`
REINFORCE + Baseline	Gymnasium	`Reinforce with Baseline (MC)/`
GRPO Fine-Tuning	LLM Fine-Tuning	`Group Relative Policy Optimization (GRPO)/`

W&B Logging

All notebooks log training metrics (rewards, losses, episode lengths) directly to Weights & Biases.

Setup — one step:

import wandb
wandb.login(key="YOUR_WANDB_API_KEY")  # Get your key at https://wandb.ai/authorize

Replace YOUR_WANDB_API_KEY with your key from wandb.ai/authorize. That's it — metrics stream automatically once training starts.

Each notebook initializes its own wandb.init(project=..., config=...) run. You'll see reward curves, loss plots, and hyperparameter sweeps live in your W&B dashboard.

Running a Notebook

Each directory is self-contained. Navigate to any algorithm folder and open the .ipynb:

cd "Proximal Policy Optimization (PPO)"
jupyter notebook ppo.ipynb

Install dependencies:

pip install torch gymnasium wandb numpy matplotlib

Some environments (Atari, MuJoCo) need additional setup — see the README inside each subdirectory.

Repository Structure

reinforcement-learning-agents/
├── CartPole-Double-DQN/
├── Pong-DQN/
├── A2C/
├── DDPG/
├── TD3/
├── Proximal Policy Optimization (PPO)/
├── Proximal Policy Optimization - Acrobot/
├── Reinforce with Baseline (MC)/
├── Group Relative Policy Optimization (GRPO)/
└── README.md

Requirements

Python 3.8+
PyTorch
Gymnasium
Weights & Biases (wandb)
NumPy, Matplotlib

Contributing

New implementations should follow the same notebook format: intuition first, then derivation, then clean code. Include W&B logging and a short results section at the end of the notebook.

References

Implementations follow the original papers. Citations are included at the top of each notebook.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning Agents

Philosophy

Implementations

W&B Logging

Running a Notebook

Repository Structure

Requirements

Contributing

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
A2C		A2C
CartPole-Double-DQN		CartPole-Double-DQN
DDPG		DDPG
Group Relative Policy Optimization (GRPO)		Group Relative Policy Optimization (GRPO)
Independent PPO (IPPO)		Independent PPO (IPPO)
PPO- Continious		PPO- Continious
Pong-DQN		Pong-DQN
Proximal Policy Optimization (PPO)		Proximal Policy Optimization (PPO)
Reinforce with Baseline (MC)		Reinforce with Baseline (MC)
Self-Play-PPO/src		Self-Play-PPO/src
TD3		TD3
Vectorized Env PPO RiverRaid		Vectorized Env PPO RiverRaid
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning Agents

Philosophy

Implementations

W&B Logging

Running a Notebook

Repository Structure

Requirements

Contributing

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages