This article is meant to be an introduction on Multi Arm Bandits, without getting into too much into the mathematical details (I'll leave that for a followup article!). We'll talk about reinforcement learning, multi arm bandit optimization and the exploration/exploitation trade off.

From Wikipedia : "Reinforcement learning is an area of

MAB is best understood through this analogy:

A gambler at a row of slot machines has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a reward from a distribution specific to that machine. The objective is to maximize the sum of rewards earned through a sequence of lever pulls.

**What is Reinforcement Learning?**From Wikipedia : "Reinforcement learning is an area of

*machine learning*inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward." . Let us break this definition down:- Reinforcement Learning Maps
**situations**to**actions**to**maximize**a numerical reward. - Unlike supervised learning, the learner is not told which actions to take but must
**discover**which actions yield the most reward**by trying**them. - Highly useful in cases of significant uncertainty about the environment

**exploration**") and optimizes its decision based on existing knowledge (called "**exploitation**"). Multi Arm Bandits is a class of sequential resource allocation that embody the exploration/exploitation dilemma.**What are Multi Arm Bandits (MAB)?***“Bandit problems embody in essential form a conflict evident in all human action: choosing actions which yield immediate reward vs. choosing actions (e.g. acquiring information or preparing the ground) whose benefit will come only later” -*P. Whittle (1980).MAB is best understood through this analogy:

A gambler at a row of slot machines has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a reward from a distribution specific to that machine. The objective is to maximize the sum of rewards earned through a sequence of lever pulls.

Some real world exploration/exploitation examples:

Restaurant Selection

•Lets Index the arms by

• We have to

•In practice

So how does one come up with an optimal (albeit approximate) strategy to explore and exploit so as to reap maximum rewards? There are two classes of well known strategies in literature:

Select the best lever most of the time, pull a random lever some of the time (show random ads sometimes, and the best

ad most of the time). In this strategy we choose to pull a new lever (explore) with a frequency of

Key Advantages:

algorithms. The basic idea is to

In other words, we encode our belief about where the expected reward

Key Advantages:

References

Restaurant Selection

- Exploitation : Go to your favorite restaurant
- Exploration: Try a new restaurant

- Exploitation : Continue using existing well
- Exploration: Drill at a new location

- Exploitation: Show the most successful advert
- Exploration :Show a different advert

•Lets Index the arms by

**, and the probability distribution over possible rewards***a**for each arm***r***can be written as***a****.***pa(r)*• We have to

**find the arm with the largest mean reward***μa=Ea[r]*•In practice

**are non-stationary***pa(r)*So how does one come up with an optimal (albeit approximate) strategy to explore and exploit so as to reap maximum rewards? There are two classes of well known strategies in literature:

- Greedy : the
*best*lever (based on previous trials) is always pulled except when a (uniformly) random action is taken. A popular example of a Greedy strategy is called**Epsilon Greedy.** - Probabilistic Matching : the number of pulls for a given lever should
*match*its actual probability of being the optimal lever. Example:**Thompson Sampling**

Epsilon GreedyEpsilon Greedy

Select the best lever most of the time, pull a random lever some of the time (show random ads sometimes, and the best

ad most of the time). In this strategy we choose to pull a new lever (explore) with a frequency of

*epsilon*(hence the name). Hence*epsilon*is the fraction of times we sample a lever randomly and*1- epsilon*is the fraction of times we choose optimally. An epsilon value of 1.0 will cause it to always explore, while a value of 0.0 will cause it to always exploit the best preforming lever.Key Advantages:

- Very easy to implement.
- Will not get stuck in some local optimal state.
- The best performing arm will be used most of the time.

- How do you pick the value
*epsilon*? This is a tricky problem to solve and the wrong epsilon value could lead to either 0 exploration or too much exploration.

**Thompson Sampling is a randomized algorithm based on Bayesian ideas. The first version of this Bayesian heuristic is more than 80 years old, dating to Thompson (1933). It is a member of the family of randomized probability matching**

Thompson Sampling

Thompson Sampling

algorithms. The basic idea is to

*assume a simple prior distribution*on the underlying parameters of the reward distribution of every lever, and at every time step, play a lever according to its*posterior probability*of being the best arm.In other words, we encode our belief about where the expected reward

*is for lever***μa***in probability distribution***a***For example, if each lever is an advert,***p(μa|data).***could be the probability distribution of the mean CTR of the advert given historical data about the advert. When the reward is a binomial (click Vs no click) a Beta-Binomial model is usually a convenient choice.***p(μa|data)**Key Advantages:

- Easy to implement.
- Robust against delayed feedback: Often feedback about rewards is not immediately available. Imagine each lever being a product on sale on an eCommerce site. The reward is sold Vs not sold. The knowledge that a product was sold may not be immediately available to the bandit system. In this case, Thompson Sampling (being randomized) will keep exploring (because its randomized) instead of being stuck showing a sub performing product.

- Exploration may cease to exist if the algorithm converges quickly. Continuing with the previous example, a product that gets a few sales may continue getting more sales (simply because its shown more) and may overwhelm new products for being shown. Thompson sampling hence needs careful tuning and experimentation.

References

- Kuleshov, Volodymyr, and Doina Precup. "Algorithms for multi-armed bandit problems."
*arXiv preprint arXiv:1402.6028*(2014). - Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling."
*Advances in neural information processing systems*. 2011. - Agrawal, Shipra, and Navin Goyal. "Analysis of Thompson sampling for the multi-armed bandit problem."
*arXiv preprint arXiv:1111.1797*(2011). - https://en.wikipedia.org/wiki/Multi-armed_bandit