Tutorial: Introduction to Bandits: Algorithms and Theory
This tutorial intends to be an introduction to stochastic and adversarial multi-armed bandit algorithms and to survey some of the recent advances. In the multi-armed bandit problem, at each stage, an agent (or decision maker) chooses one action (or arm), and receives a reward from it. The agent aims at maximizing his rewards. Since he does not know the process generating the rewards, he needs to explore (try) the different actions and yet, exploit (concentrate its draws on) the seemingly most rewarding arms.
The bandit problem has been increasingly popular in the machine learning community. It is the simplest setting where one encounters the exploration-exploitation dilemma. It has a wide range of applications including advertizement [1, 6], economics [2, 12], games  and optimization [10, 5, 9, 3], model selection and machine learning algorithms itself [13, 4]. It can be a central building block of larger systems, like in evolutionary programming  and reinforcement learning , in particular in large state space Markovian Decision Problems .