A tractable online learning algorithm for the multinomial logit contextual bandit

•This work considers the dynamic assortment optimization problem.•The consumer choice is modelled via Multinomial logit contextual model.•The worst-case regret bound is free from the multiplicative problem dependent factor k and improves upon the previous bounds in the literature.•The k factor could...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:European journal of operational research Ročník 310; číslo 2; s. 737 - 750
Hlavní autori: Agrawal, Priyank, Tulabandhula, Theja, Avadhanula, Vashist
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Elsevier B.V 16.10.2023
Predmet:
ISSN:0377-2217, 1872-6860
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:•This work considers the dynamic assortment optimization problem.•The consumer choice is modelled via Multinomial logit contextual model.•The worst-case regret bound is free from the multiplicative problem dependent factor k and improves upon the previous bounds in the literature.•The k factor could be exponential in size for some common problem instances. In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision makers problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon T. Though this problem has recently attracted considerable attention, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, current algorithms for this problem have regret bounded by O(κdT), where κ is a problem-dependent constant that may have an exponential dependency on the number of attributes, d. In this paper, we propose an optimistic algorithm and show that the regret is bounded by O(dT+κ), significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favorable regret guarantee. We also demonstrate that our algorithm has robust performance for varying κ values through numerical experiments.
ISSN:0377-2217
1872-6860
DOI:10.1016/j.ejor.2023.02.036