Robust Multiarmed Bandit Problems

The multiarmed bandit problem is a popular framework for studying the exploration versus exploitation trade-off. Recent applications include dynamic assortment design, Internet advertising, dynamic pricing, and the control of queues. The standard mathematical formulation for a bandit problem makes t...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Management science Ročník 62; číslo 1; s. 264 - 285
Hlavní autoři: Kim, Michael Jong, Lim, Andrew E.B.
Médium: Journal Article
Jazyk:angličtina
Vydáno: Linthicum Institute for Operations Research and the Management Sciences 01.01.2016
Témata:
ISSN:0025-1909, 1526-5501
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The multiarmed bandit problem is a popular framework for studying the exploration versus exploitation trade-off. Recent applications include dynamic assortment design, Internet advertising, dynamic pricing, and the control of queues. The standard mathematical formulation for a bandit problem makes the strong assumption that the decision maker has a full characterization of the joint distribution of the rewards, and that ″arms″ under this distribution are independent. These assumptions are not satisfied in many applications, and the outof-sample performance of policies that optimize a misspecified model can be poor. Motivated by these concerns, we formulate a robust bandit problem in which a decision maker accounts for distrust in the nominal model by solving a worst-case problem against an adversary (″nature″) who has the ability to alter the underlying reward distribution and does so to minimize the decision maker's expected total profit. Structural properties of the optimal worst-case policy are characterized by using the robust Bellman (dynamic programming) equation, and arms are shown to be no longer independent under nature's worst-case response. One implication of this is that index policies are not optimal for the robust problem, and we propose, as an alternative, a robust version of the Gittins index. Performance bounds for the robust Gittins index are derived by using structural properties of the value function together with ideas from stochastic dynamic programming duality. We also investigate the performance of the robust Gittins index policy when applied to a Bayesian webpage design problem. In the presence of model misspecification, numerical experiments show that the robust Gittins index policy not only outperforms the classical Gittins index policy, but also substantially reduces the variability in the out-of-sample performance.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0025-1909
1526-5501
DOI:10.1287/mnsc.2015.2153