Skip to content

vismaychuriwala/Optimal-Strategies-in-Multi-Armed-Bandits

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Optimal Strategies in Multi-Armed Bandits: A Cricket Simulation

I built this as a course project to explore how different MAB algorithms behave when you change what they're optimizing for. Cricket batting turns out to be a surprisingly clean environment for this: every ball, the batsman picks a shot. Each shot has a different run potential and dismissal risk. Play defensively and you stay in but don't score. Play aggressively and you score fast but risk getting out.

There is a naive way to frame this as a bandit problem, to maximize runs per ball gives a clean answer that turns out to be wrong in an actual game. I implemented four agents with different objectives and ran them in both the naive and real setting (with limited wickets) to show the key differences.

Innings performance

The environment

Six shots, each with a fixed dismissal probability p(out) and run-scoring probability p(run):

Shot Runs p(out) p(run) E[runs/ball]
Block 0 0.1% 100% 0.00
Nudge 1 1.0% 90% 0.89
Drive 2 2.0% 85% 1.67
Sweep 3 3.0% 80% 2.33
Pull 4 10% 75% 2.70
Slog 6 30% 70% 2.94

There is no single best shot. Block is safest. Slog scores fastest. Nudge has the best runs-per-wicket ratio. Which arm an agent converges to depends entirely on what reward it's been given.

Environment

Agents

I defined four agents, each with a different reward function. All start with a round-robin warm-up (ten pulls per arm) before switching to their main strategy.

KL-UCB Survival

Reward is 1 - wicket (binary): the agent gets 1 for surviving a ball, 0 for being dismissed. KL-UCB picks the arm with the highest upper confidence bound, where the bound is the largest q consistent with the observed data:

UCB(arm) = max{ q : KL(p̂, q) ≤ (log t + 3 log log t) / n }

The bound is tighter than UCB1 for Bernoulli rewards because it uses the actual shape of the distribution rather than Hoeffding's inequality. I expected this to converge to Block, which it does.


KL-UCB Reward

Reward is runs_scored / 6, normalized to [0,1]. Same KL-UCB algorithm, but now the agent tracks scoring rather than survival. Should converge to Slog.


UCB1 Reward

Same reward as KL-UCB Reward, looser bound:

UCB(arm) = p̂ + sqrt(2 log t / n)

I included this mostly as a baseline to compare against KL-UCB under the same objective. The difference in regret between the two shows what the tighter bound actually buys, and later it matters for a different reason entirely.


Risk-Adjusted Successive Elimination

Reward metric: ρ = E[runs] / p(out), runs per wicket risked. The idea is to find the shot with the best scoring rate per unit of dismissal risk, which turns out to be Nudge, not Slog.

Instead of an optimism bonus, this agent keeps a set of active arms and prunes any arm that is statistically worse than the current best:

eliminate arm i  if  (ρ̂* - 2bt) / (ĉt* + 2bt)  >  (ρ̂_i + 2bt) / (ĉt_i - 2bt)

where bt = sqrt(log(4K/δ) / 2t). Once an arm is eliminated it stays eliminated, with no re-exploration.


Part 1: The naive approach (Unlimited Wickets)

The obvious way to frame this as a bandit problem: reward = runs scored per ball, minimize regret over a fixed 300-ball horizon. Wickets are tracked but don't end the innings, the agent just keeps playing. This is the standard MAB setup, and the algorithms perform exactly as you'd expect.

Regret

Cumulative score regret vs always playing Slog:

Regret curves

KL-UCB Reward gets the lowest regret (~125 by ball 300) and flattens around ball 50 once it locks onto Slog. UCB1 Reward trails at ~200, taking longer to commit. Risk-Adj SE lands in a similar range to UCB1: it converges to Nudge, which has lower E[runs/ball] than Slog, so there's a persistent gap. KL-UCB Survival has the highest regret by far (~600, still growing linearly at ball 300); it never sharply converges, instead drifting slowly through arms due to tie-breaking at the start, and keeps leaving runs on the table.

Arm convergence

Arm selection

KL-UCB Reward snaps to Slog by ball ~50, very clean. UCB1 Reward is similar but slightly slower. Risk-Adj SE shows a clear round-robin stripe for the first 60 balls (10 pulls × 6 arms), then commits to Nudge. KL-UCB Survival shows a gradual drift: it starts at the high-index arms due to tie-breaking after the warm-up, and slowly works down toward Block over hundreds of balls. No sharp convergence.

Runs and wickets

Tradeoff

KL-UCB Reward scores the most (~750 median) but loses the most wickets (~55 median). Risk-Adj SE is nearly as high on runs (~720) with far fewer wickets (~28), the best runs-per-wicket ratio in the group, which is exactly what it optimizes for. KL-UCB Survival scores the least (~270) and barely loses any wickets (~5), consistent with its slow drift toward Block. In this fixed-ball setting the scoring agents look dominant.


Part 2: Real Innings

A real innings ends when 10 wickets fall. The naive framing completely ignores this. Once you put it back in, the idea strategy is much different.

The bandit ranking and the cricket ranking are a perfect reversal. KL-UCB Survival had the highest regret in Part 1 and scores the most runs here (270 median). KL-UCB Reward had the lowest regret and scores the fewest (138). Risk-Adj SE second (234), UCB1 third (187).

KL-UCB Reward finds the globally optimal shot per-ball, but Slog at p(out) = 30% burns through 10 wickets in roughly 33 balls of exploitation. There's barely time to accumulate runs before the innings is over. UCB1 Reward, slower to commit, spends more time on Pull (p(out) = 10%) and faces ~90 balls: lower scoring rate, but a longer innings and more total runs.

How it plays out ball by ball

Innings regret

Left panel: KL-UCB Reward (orange) rises steeply through ball ~65, then stops: innings over, wickets gone. UCB1 Reward (green) and Risk-Adj SE (red) rise more slowly but keep going for ~90 and ~110 balls respectively, ending up ahead on total runs. KL-UCB Survival (blue) keeps accumulating all the way to ball 300, since its slow exploration means it rarely burns through all 10 wickets, so it just keeps playing.

Right panel: once an innings ends the agent scores 0 while the reference keeps ticking, so regret grows linearly from that point. KL-UCB Reward's regret shoots up steeply after ball 65 and ends highest. KL-UCB Survival has the lowest innings regret, scoring throughout all 300 balls and never stopped by wickets.

Distribution of outcomes

Full innings


About

This repository contains several implementations of multi-armed bandit (MAB) agents applied to a simulated cricket match where an agent selects among different strategies with the goal of maximizing runs while minimizing the risk of getting out.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages