MultiArmed Bandits
Introduction
Reinforcement Learning is different from other machine learning in the aspect that it evaluates the actions rather than instructing than instructing the correct actions.

Purely evaluative feedback indicates how good an action is , but not whether it is best or worst action possible.

Purely instructive feedback indicates the correct action to take independent of the action already taken.
KArmed Bandit Problem
The karmed bandit problem is similar to onearmed bandits, except that it has k levers instead of one. Here we are faced repeatedly with a choice among k different levers, or actions. Here the rewards are the payoffs for hitting the jackpot.
In the problem, each of the k actions has an expected or mean reward given that that action is selected; let us call this the value of that action.We denote the action selected on time step t as $A_t$ , and the corresponding reward as $R_t$. The value then of an arbitrary action $a$, denoted $q_∗(a)$, is the expected reward given that a is selected:
$q_∗(a) = E[R_t  A_t = a]$
We denote the estimated value of action $a$ at time $t$ as $Q_t(a) \approx q_*(a)$ .
Actions whose $Q_t$ value is the highest at time $t$, are called, greedy actions.All the other actions at time $t$ are called nongreedy actions.Selecting a greedy action is said to be exploitation whereas selecting a nongreedy action is said to be evaluation.Exploitation is the right thing to do to maximize the expected on the one step, but exploration may produce greater total reward in the long run.
Action Value Methods
So now we know that the true value of an action is the mean reward when the action is chosen.There can be many ways of calcuate this, one way can be:
$$Q_t(a) = \frac{\text{sum of rewards when a is taken prior to t}}{\text{number of times a is taken prior to t}} = \frac{\sum_{i=0}^{t1} R_i \cdot 1_{A_i = a}}{\sum_{i=0}^{t1} 1_{A_i = a}}$$
where $1_predicate$ denotes the random variable that is $1$ if $predicate$ is true and $0$ if it is not. If the denominator is zero, then we instead denote $Q_t(a)$ as some default value, such as $Q_1(a) = 0$ . As the denominator goes to infnity, by the law of large numbers, $Q_t(a)$ converges to $q_*(a)$. We call this the *sampleaverage* method for estimating action values because each *estimate* is an average of the *sample* of relevant rewards.
The greedy actions selection method (a.k.a exploitation) can be represented as:
$A_t = \underset{x}{\operatorname{argmax}}Q_t(a)$
Where $argmax_a$ denotes the value of $a$ at which the expressio that follows is maximized(with ties broken arbitrarily). A simple alternative to this purely greedy approach is a $\varepsilongreedy$ approach where with probability $\varepsilon$ , select an action from the set of nongreedy actions unformly and randomly.
Incremental Implementation
We would now dive into the implementation of the above formulae to reinforcement learning problems. Lets concentrate on a single action $a$. Let $R_i$ denote the reward received after the $ith$ selection of this action.Let $Q_n$ denote the estimate of this action after it has been selected n1 times. Then we can write:
$$Q_n(a) = \frac{R_1 + R_2 +…+ R_{n1}}{n1}$$
Keeping a record of all the rewards will be inefficient in terms of memory, so we would tweak the formula a little bit:
A Simple Bandit Algorithm
The $\frac{1}{N(A)}$ parameter (a.k.a StepSize) changes from time step to time step and can also be written as $\alpha_{t}(a)$.
$\alpha_{t}(a)$ is any function that satisfies the following conditions :
$\sum_{t=1}^{\infty}\alpha_{t}(a) = \infty \text{ and } \sum_{t=1}^{\infty} \alpha_{t}^2(a) \lt \infty $
For the timebeing let us consider $\alpha_t$ to be a constant (say $\alpha \epsilon (0,1]$ ). We can also write $Q_{n+1}$ in terms of $Q_1$:
UpperConfidenceBound Action Selection Method
$\varepsilon$greedy action selection method forces the nongreedy actions to be tried, but indiscriminately, with no preference to those that are nearly greedy. It would be better to select among the nongreedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertanities in those estimates.
One effective way of doing this is to select action as:
$A_t = \underset{a}{\operatorname{argmax}}\Bigl[Qt(a) + \sqrt[\leftroot{2}\uproot{2}c]{\tfrac{\log t}{N_t(a)}} \Bigr]$
where $\log t$ denotes the natural logarithm of $t$ , $N_t(a)$ denotes the number of times that action $a$ has been selected prior to time $t$ and the number $ c \gt 0$ controls the degree of exploration. If $N_t(a) = 0$, then $a$ is considered to be a maximizing action.
The idea of this method is that the squareroot term is a measure of the uncertainity or variance in the measure of $a$'s value.Each time action $a$ is selected $N_t(a)$ increases by one and the contribution of the squareroot term in selecting action $A_t$ is decreased.That means that if $a$ is chosen a lot of times then chances of it being selected depends mostly on $Q_t$, which should be the behaviour of our model, logically.
Gradient Bandit Algorithms
In this section we consider learning a numerical preference $H_t(a)$ for each action $a$. The larger the preference more often that action is tken, but the preference has no interpretation in terms of reward.Since the relative preference is important we take softmax over all the preferences at time $t$ which we denote as $\pi_{t}(a)$.
$Pr{A_t=a} = \frac{\varepsilon^{H_t(a)}}{\sum_{b=1}^{k} \varepsilon^{H_t(b)}} = \pi_{t}(a)$
There is a natural learning algorithm for this setting based on the idea of stochastic gradient ascent.On each step, after selecting the action $A_t$ and receiving the reward $R_t$, the preferences are updated by:
where $\alpha \gt 0$ is stepsize parameter, and $\bar R_t \in \mathbb R$ is the average of all the rewards up throught and including time $t$, which can be computed incrementally.The $\bar R_t$ serves as the baseline with which the reward is compared. If the reward is higher than the baseline, then the probability of taking $A_t$ in the future is increased, and if the reward is below baseline, then the probability is decreased.The nonselected actions move in the opposite direction.
For the represenation of the Bandit Gradient Algorithm as Stochastic Gradient Ascent please refer section2.8(page 29) of the book.