{"title": "Stochastic Bandits with Context Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 14113, "page_last": 14122, "abstract": "We introduce a stochastic contextual bandit model where at each time step the environment chooses a distribution over a context set and samples the context from this distribution. The learner observes only the context distribution while the exact context realization remains hidden. This allows for a broad range of applications where the context is stochastic or when the learner needs to predict the context. We leverage the UCB algorithm to this setting and show that it achieves an order-optimal high-probability bound on the cumulative regret for linear and kernelized reward functions. Our results strictly generalize previous work in the sense that both our model and the algorithm reduce to the standard setting when the environment chooses only Dirac delta distributions and therefore provides the exact context to the learner. We further analyze a variant where the learner observes the realized context after choosing the action. Finally, we demonstrate the proposed method on synthetic and real-world datasets.", "full_text": "Stochastic Bandits with Context Distributions\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nAndreas Krause\n\nETH Zurich\n\nkrausea@ethz.ch\n\nJohannes Kirschner\n\nETH Zurich\n\njkirschner@inf.ethz.ch\n\nAbstract\n\nWe introduce a stochastic contextual bandit model where at each time step the\nenvironment chooses a distribution over a context set and samples the context\nfrom this distribution. The learner observes only the context distribution while\nthe exact context realization remains hidden. This allows for a broad range of\napplications where the context is stochastic or when the learner needs to predict the\ncontext. We leverage the UCB algorithm to this setting and show that it achieves\nan order-optimal high-probability bound on the cumulative regret for linear and\nkernelized reward functions. Our results strictly generalize previous work in the\nsense that both our model and the algorithm reduce to the standard setting when the\nenvironment chooses only Dirac delta distributions and therefore provides the exact\ncontext to the learner. We further analyze a variant where the learner observes the\nrealized context after choosing the action. Finally, we demonstrate the proposed\nmethod on synthetic and real-world datasets.\n\n1\n\nIntroduction\n\nIn the contextual bandit model a learner interacts with an environment in several rounds. At the\nbeginning of each round, the environment provides a context, and in turn, the learner chooses an action\nwhich leads to an a priori unknown reward. The learner\u2019s goal is to choose actions that maximize\nthe cumulative reward, and eventually compete with the best mapping from context observations\nto actions. This model creates a dilemma of exploration and exploitation, as the learner needs to\nbalance exploratory actions to estimate the environment\u2019s reward function, and exploitative actions\nthat maximize the total return. Contextual bandit algorithms have been successfully used in many\napplications, including online advertisement, recommender systems and experimental design.\nThe contextual bandit model, as usually studied in the literature, assumes that the context is observed\nexactly. This is not always the case in applications, for instance, when the context is itself a noisy\nmeasurement or a forecasting mechanism. An example of such a context could be a weather or stock\nmarket prediction. In other cases such as recommender systems, privacy constraints can restrict access\nto certain user features, but instead we might be able to infer a distribution over those. To allow for\nuncertainty in the context, we consider a setting where the environment provides a distribution over\nthe context set. The exact context is assumed to be a sample from this distribution, but remains hidden\nfrom the learner. Such a model, to the best of our knowledge, has not been discussed in the literature\nbefore. Not knowing the context realization makes the learning problem more dif\ufb01cult, because the\nlearner needs to estimate the reward function from noisy observations and without knowing the exact\ncontext that generated the reward. Our setting recovers the classical contextual bandit setting when\nthe context distribution is a Dirac delta distribution. We also analyze a natural variant of the problem,\nwhere the exact context is observed after the player has chosen the action. This allows for different\napplications, where at the time of decision the context needs to be predicted (e.g. weather conditions),\nbut when the reward is obtained, the exact context can be measured.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u221a\n\nWe focus on the setting where the reward function is linear in terms of action-context feature vectors.\nFor this case, we leverage the UCB algorithm on a speci\ufb01cally designed bandit instance without\nfeature uncertainty to recover an O(d\nT ) high-probability bound on the cumulative regret. Our\nanalysis includes a practical variant of the algorithm that requires only sampling access to the context\ndistributions provided by the environment. We also extend our results to the kernelized setting, where\nthe reward function is contained in a known reproducing kernel Hilbert space (RKHS). For this\ncase, we highlight an interesting connection to distributional risk minimization and we show that the\nnatural estimator for the reward function is based on so-called kernel mean embeddings. We discuss\nrelated work in Section 6.\n\n2 Stochastic Bandits with Context Distributions\n\nWe formally de\ufb01ne the setting of stochastic bandits with context distributions as outlined in the\nintroduction. Let X be a set of actions and C a context set. The environment is de\ufb01ned by a \ufb01xed, but\nunknown reward function f : X \u00d7 C \u2192 R. At iteration t \u2208 N, the environment chooses a distribution\n\u00b5t \u2208 P(C) over the context set and samples a context realization ct \u223c \u00b5t. The learner observes only\n\u00b5t but not ct, and then chooses an action xt \u2208 X . We allow that an adaptive adversary chooses the\ncontext distribution, that is \u00b5t may in an arbitrary way depend on previous choices of the learner\nup to time t. Given the learner\u2019s choice xt, the environment provides a reward yt = f (xt, ct) + \u0001t,\nwhere \u0001t is \u03c3-subgaussian, additive noise. The learner\u2019s goal is to maximize the cumulative reward\n\n(cid:80)T\n\nT(cid:88)\n\nt=1 f (xt, ct), or equivalently, minimize the cumulative regret\nt , ct) \u2212 f (xt, ct)\n\nRT =\n\nf (x\u2217\n\nt=1\n\nt=1\n\ntext distribution to actions, that maximizes the expected reward(cid:80)T\n\n(1)\nwhere x\u2217\nt = arg maxx\u2208X Ec\u223c\u00b5t[f (x, c)] is the best action provided that we know f and \u00b5t, but not ct.\nNote that this way, we compete with the best possible mapping \u03c0\u2217 : P(C) \u2192 X from the observed con-\nEct\u223c\u00b5t[f (\u03c0\u2217(\u00b5t), ct)|Ft\u22121, \u00b5t]\nwhere Ft = {(xs, \u00b5s, ys)}t\ns=1 is the \ufb01ltration that contains all information available at the end of\nround t. It is natural to ask if it is possible to compete with the stronger baseline that chooses actions\ngiven the context realization ct, i.e. \u02dcx\u2217\nt = arg maxx\u2208X f (x, ct). While this can be possible in special\ncases, a simple example shows, that in general the learner would suffer \u2126(T ) regret. In particular,\nassume that ct \u223c Bernoulli(0.6) , and X = {0, 1}. Let f (0, c) = c and f (1, c) = 1 \u2212 c. Clearly, any\npolicy that does not know the realizations ct, must have \u2126(T ) regret when competing against \u02dcx\u2217\nt .\nFrom now on, we focus on linearly parameterized reward functions f (x, c) = \u03c6(cid:62)\nx,c\u03b8 with given\nfeature vectors \u03c6x,c \u2208 Rd for x \u2208 X and c \u2208 C, and unknown parameter \u03b8 \u2208 Rd. This setup is\ncommonly referred to as the linear bandit setting. For the analysis we require standard boundedness\nassumptions (cid:107)\u03c6x,c(cid:107)2 \u2264 1 and (cid:107)\u03b8(cid:107)2 \u2264 1 that we set to 1 for the sake of simplicity. In Section 4.2,\nwe further consider a variant of the problem, where the learner observes ct after taking the decision\nxt. This simpli\ufb01es the estimation problem, because we have data {(xt, ct, yt)} with exact context ct\navailable, just like in the standard setting. The exploration problem however remains subtle as at the\ntime of decision the learner still knows only \u00b5t and not ct. In Section 4.3 we extend our algorithm\nand analysis to kernelized bandits where f \u2208 H is contained in a reproducing kernel Hilbert space H.\n\n3 Background\n\nWe brie\ufb02y review standard results from the linear contextual bandit literature and the upper con\ufb01dence\nbound (UCB) algorithm that we built on later (Abbasi-Yadkori et al., 2011). The linear contextual\nbandit setting can be de\ufb01ned as a special case of our setup, where the choice of \u00b5t is restricted to\nDirac delta distributions \u00b5t = \u03b4ct, and therefore the learner knows beforehand the exact context\nwhich is used to generate the reward. In an equivalent formulation, the environment provides at time\nt a set of action-context feature vectors \u03a8t = {\u03c6x,ct : x \u2208 X} \u2282 Rd and the algorithm chooses an\naction xt with corresponding features \u03c6t := \u03c6xt,ct \u2208 \u03a8t. We emphasize that in this formulation the\ncontext ct is extraneous to the algorithm, and everything can be de\ufb01ned in terms of the time-varying\naction-feature sets \u03a8t. As before, the learner obtains a noisy reward observation yt = \u03c6(cid:62)\nt \u03b8 + \u0001t\nwhere \u0001t is conditionally \u03c1-subgaussian with variance proxy \u03c1, i.e.\n\n\u2200\u03bb \u2208 R,\n\nE[e\u03bb\u0001t|Ft\u22121, \u03c6t] \u2264 exp(\u03bb2\u03c12/2) .\n\n2\n\n\fAlso here, the standard objective is to minimize the cumulative regret RT =(cid:80)T\n\nt = arg max\u03c6\u2208\u03a8t \u03c6(cid:62)\u03b8 is the feature vector of the best action at time t.\n\nwhere \u03c6\u2217\nTo de\ufb01ne the UCB algorithm, we make use of the con\ufb01dence sets derived by Abbasi-Yadkori et al.\n(2011) for online least square regression. At the end of round t, the algorithm has adaptively\ncollected data {(\u03c61, y1), . . . , (\u03c6t, yt)} that we use to compute the regularized least squares estimate\n\u02c6\u03b8t = arg min\u03b8(cid:48)\u2208Rd\n2 with \u03bb > 0. We denote the closed form solution\nt , V0 = \u03bbId and Id \u2208 Rd\u00d7d is the identity matrix.\nby \u02c6\u03b8t = V \u22121\nLemma 1 (Abbasi-Yadkori et al. (2011)). For any stochastic sequence {(\u03c6t, yt)}t and estimator \u02c6\u03b8t\nas de\ufb01ned above for \u03c1-subgaussian observations, with probability at least 1 \u2212 \u03b4, at any time t \u2208 N,\n\n(cid:80)t\n(cid:80)t\ns \u03b8(cid:48) \u2212 ys)2 + \u03bb(cid:107)\u03b8(cid:48)(cid:107)2\ns=1(\u03c6(cid:62)\ns=1 \u03c6sys with Vt = Vt\u22121 + \u03c6t\u03c6(cid:62)\n\nt\n\nt=1 \u03c6\u2217(cid:62)\n\nt \u03b8 \u2212 \u03c6(cid:62)\nt \u03b8\n\n(cid:107)\u03b8 \u2212 \u02c6\u03b8t(cid:107)Vt \u2264 \u03b2t\n\nwhere \u03b2t = \u03b2t(\u03c1, \u03b4) = \u03c1\n\n2 log\n\n+ \u03bb1/2(cid:107)\u03b8(cid:107)2 .\n\n(cid:115)\n\n(cid:18) det(Vt)1/2\n\n(cid:19)\n\n\u03b4 det(V0)1/2\n\nNote that the size of the con\ufb01dence set depends the variance proxy \u03c1, which will be important in the\nfollowing. In each round t + 1, the UCB algorithm chooses an action \u03c6t+1, that maximizes an upper\ncon\ufb01dence bound on the reward,\n\n\u03c6t+1 := arg max\n\u03c6\u2208\u03a8t+1\n\n\u03c6(cid:62) \u02c6\u03b8t + \u03b2t(cid:107)\u03c6(cid:107)V\n\n.\n\n\u22121\nt\n\nThe following result shows that the UCB policy achieves sublinear regret (Dani et al., 2008; Abbasi-\nYadkori et al., 2011).\nLemma 2. In the standard contextual bandit setting with \u03c1-subgaussian observation noise, the regret\nof the UCB policy with \u03b2t = \u03b2t(\u03c1, \u03b4) is bounded with probability 1 \u2212 \u03b4 by\n\n(cid:115)\n\n(cid:18) det VT\n\n(cid:19)\n\ndet V0\n\nRU CB\n\nT\n\n\u2264 \u03b2T\n\n8T log\n\nThe data-dependent terms can be further upper-bounded to obtain RU CB\nT ) up to logarith-\nmic factors in T (Abbasi-Yadkori et al., 2011, Theorem 3). A matching lower bound is given by Dani\net al. (2008, Theorem 3).\n\n\u2264 \u02dcO(d\n\nT\n\n\u221a\n\n4 UCB with Context Distributions\n\nx,ct\n\nIn our setting, where we only observe a context distribution \u00b5t (e.g. a weather prediction) instead\nof the context ct (e.g. realized weather conditions), also the features \u03c6x,ct (e.g. the last layer of a\nneural network that models the reward f (x, c) = \u03c6(cid:62)\n\u03b8) are uncertain. We propose an approach that\ntransforms the problem such that we can directly use a contextual bandit algorithm as for the standard\nsetting. Given the distribution \u00b5t, we de\ufb01ne a new set of feature vectors \u03a8t = { \u00af\u03c8x,\u00b5t : x \u2208 X},\nwhere we denote by \u00af\u03c8x,\u00b5t = Ec\u223c\u00b5t[\u03c6x,c|Ft\u22121, \u00b5t] the expected feature vector of action x under \u00b5t.\nEach feature \u00af\u03c8x,\u00b5t corresponds to exactly one action x \u2208 X , so we can use \u03a8t as feature context\nset at time t and use the UCB algorithm to choose an action xt. The choice of the UCB algorithm\nhere is only for the sake of the analysis, but any other algorithm that works in the linear contextual\nbandit setting can be used. The complete algorithm is summarized in Algorithm 1. We compute the\nUCB action xt with corresponding expected features \u03c8t := \u00af\u03c8xt,t \u2208 \u03a8t, and the learner provides xt\nto the environment. We then proceed and use the reward observation yt to update the least squares\nestimate. That this is a sensible approach is not immediate, because yt is a noisy observation of\n\u03c6(cid:62)\nt \u03b8. We address this issue by constructing the feature set\nxt,ct\n\u03a8t in such a way, that yt acts as unbiased observation also for the action choice \u03c8t. As computing\nexact expectations can be dif\ufb01cult and in applications often only sampling access of \u00b5t is possible,\nwe also analyze a variant of Algorithm 1 where we use \ufb01nite sample averages \u02dc\u03c8x,\u00b5t = 1\nl=1 \u03c6x,\u02dccl\nfor L \u2208 N i.i.d. samples \u02dccl \u223c \u00b5t instead of the expected features \u00af\u03c8x,\u00b5. The corresponding feature set\nis \u02dc\u03a8t = { \u02dc\u03c8x,\u00b5t : x \u2208 X}. For both variants of the algorithm we show the following regret bound.\n\n\u03b8, whereas UCB expects the reward \u03c8(cid:62)\n\n(cid:80)L\n\nL\n\n3\n\n\fAlgorithm 1 UCB for linear stochastic bandits with context distributions\nInitialize \u02c6\u03b8 = 0 \u2208 Rd, V0 = \u03bbI \u2208 Rd\u00d7d\nFor step t = 1, 2, . . . , T :\nEnvironment chooses \u00b5t \u2208 P(C)\nLearner observes \u00b5t\nSet \u03a8t = { \u00af\u03c8x,\u00b5t : x \u2208 X} with \u00af\u03c8x,\u00b5t := Ec\u223c\u00b5t[\u03c6x,c]\nAlternatively, sample ct,1, . . . , ct,L for L = t,\nSet \u03a8t = { \u02dc\u03c8x,\u00b5t : x \u2208 X} with \u02dc\u03c8x,\u00b5t = 1\nRun UCB step with \u03a8t as context set\nChoose action xt = arg max\u03c8x,\u00b5t\u2208\u03a8t \u03c8(cid:62)\nEnvironment provides yt = \u03c6(cid:62)\nUpdate Vt = Vt\u22121 + \u03c8xs,\u00b5s\u03c8(cid:62)\n\n\u02c6\u03b8t\u22121 + \u03b2t(cid:107)\u03c8x,\u00b5t(cid:107)V\n(cid:80)t\n\n\u03b8 + \u0001 where ct \u223c \u00b5t\n\nxs,\u00b5s, \u02c6\u03b8t = V \u22121\n\ns=1 \u03c8xs,\u00b5sys\n\n(cid:80)L\n\nl=1 \u03c6x,\u02dccl\n\n\u22121\nt\u22121\n\nxt,ct\n\nx,\u00b5t\n\nL\n\nt\n\n// context distribution\n\n// variant 1, expected version\n// variant 2, sampled version\n\n// reduction\n// UCB action\n\n// reward observation\n// least-squares update\n\n\u221a\nTheorem 1. The regret of Algorithm 1 with expected feature set \u03a8t and \u03b2t = \u03b2t(\nbounded at time T with probability at least 1 \u2212 \u03b4 by\n\n4 + \u03c32, \u03b4/2) is\n\nFurther, for \ufb01nite action sets X , if the algorithm uses sampled feature sets \u02dc\u03a8t with L = t and \u03b2t = \u02dc\u03b2t\nas de\ufb01ned in (11), Appendix A.2, then with probability at least 1 \u2212 \u03b4,\n\nRT \u2264 \u03b2T\n\n8T log\n\n(cid:115)\n\n(cid:115)\n\n(cid:114)\n\n(cid:18) det VT\n\n(cid:19)\n\ndet V0\n\n+ 4\n\n2T log\n\n4\n\u03b4\n\n.\n\n(cid:114)\n\n(cid:18) det VT\n\n(cid:19)\n\ndet V0\n\n+ 4\n\n2T log\n\n2|X|\u03c0T\n\n3\u03b4\n\n.\n\nRT \u2264 \u02dc\u03b2T\n\n8T log\n\n\u221a\n\nAs before, one can further upper bound the data-dependent terms to obtain an overall regret bound of\norder RT \u2264 \u02dcO(d\nT ), see (Abbasi-Yadkori et al., 2011, Theorem 2). With iterative updates of the\nleast-squares estimator, the per step computational complexity is O(Ld2|X|) if the UCB action is\ncomputed by a simple enumeration over all actions.\n\n4.1 Regret analysis: Proof of Theorem 1\n\n:=(cid:80)T\n\nRecall that xt is the action that the UCB algorithm selects at time t, \u03c8t the corresponding feature\nt = arg max\u03c8\u2208\u03a8t \u03c8(cid:62)\u03b8. We show that the regret RT is bounded in\nvector in \u03a8t and we de\ufb01ne \u03c8\u2217\nt \u03b8 \u2212 \u03c8(cid:62)\nterms of the regret RU CB\nt \u03b8 of the UCB algorithm on the contextual bandit\nde\ufb01ned by the sequence of action feature sets \u03a8t.\nLemma 3. The regret of Algorithm 1 with the expected feature set \u03a8t is bounded at time T with\nprobability at least 1 \u2212 \u03b4,\n\nt=1 \u03c8\u2217(cid:62)\n\nT\n\nRT \u2264 RU CB\n\nT\n\n+ 4\n\n2T log\n\n1\n\u03b4\n\n.\n\nRT \u2264 RU CB\n\nT\n\n+ 4\n\n2T log\n\n|X|\u03c0T\n3\u03b4\n\n.\n\nRT \u2264 RU CB\n\nT\n\n+\n\nDt ,\n\n(cid:114)\n\n(cid:114)\n\nT(cid:88)\n\nt=1\n\n4\n\nFurther, if the algorithm uses the sample based features \u02dc\u03a8t with L = t at iteration t, the regret is\nbounded at time T with probability at least 1 \u2212 \u03b4,\n\nProof. Consider \ufb01rst the case where we use the expected features \u00af\u03c8xt,\u00b5t. We add and subtract\n( \u00af\u03c8x\u2217\n\nt ,\u00b5t \u2212 \u00af\u03c8xt,\u00b5t)(cid:62)\u03b8 and use \u00af\u03c8(cid:62)\nx\u2217\nt ,\u00b5t\n\nt \u03b8 to bound the regret by\n\n\u03b8 \u2264 \u03c8\u2217(cid:62)\n\n\ft ,ct \u2212 \u00af\u03c8x\u2217\n\nt ,\u00b5t + \u00af\u03c8xt,\u00b5t \u2212 \u03c6xt,ct)(cid:62)\u03b8.\n\nMT = (cid:80)T\n\nwhere we de\ufb01ned Dt = (\u03c6x\u2217\nIt is easy to verify that\nEct\u223c\u00b5t[Dt|Ft\u22121, \u00b5t, xt] = 0, that is Dt is a martingale difference sequence with |Dt| \u2264 4 and\nt=1 Dt is a martingale. The \ufb01rst part of the lemma therefore follows from Azuma-\nHoeffding\u2019s inequality (Lemma 4, Appendix). For the sample-based version, the reasoning is similar,\nbut we need to ensure that the features \u02dc\u03c8x,\u00b5t = 1\nl=1 \u03c6x,\u02dccl are suf\ufb01ciently concentrated around\ntheir expected counterparts \u00af\u03c8x,\u00b5t for any x \u2208 X and t \u2208 N. We provide details in Appendix A.1.\n\n(cid:80)L\n\nL\n\nyt = \u03c8(cid:62)\n\nT\n\nis bounded. The main dif\ufb01culty is that the reward observation yt = \u03c6(cid:62)\n\nProof of Theorem 1. Clearly, the lemma gives a regret bound for Algorithm 1 if the regret term\nRU CB\n\u03b8 + \u0001t is generated\nfrom a different feature vector than the feature \u03c8t \u2208 \u03a8t that is chosen by UCB. Note that in general,\nit is not even true that \u03c6xt,ct \u2208 \u03a8t. However, a closer inspection of the reward signal reveals that yt\ncan be written as\n\nxt,ct\n\nwith \u03bet := (\u03c6xt,ct \u2212 \u03c8t)(cid:62)\u03b8\n\nt \u03b8 + \u03bet + \u0001\n\n(cid:112)\u03c32\n\n(2)\nFor the variant that uses the expected features \u03c8t = \u00af\u03c8xt,\u00b5t, our construction already ensures that\nE[\u03bet|Ft\u22121, \u00b5t, xt] = E[\u03c6xt,ct \u2212 \u00af\u03c8xt,\u00b5t|Ft\u22121, \u00b5t, xt](cid:62)\u03b8 = 0. Note that the distribution of \u03bet depends\non xt and is therefore heteroscedastic in general. However, by boundedness of the rewards, |\u03bet| \u2264 2\nand hence \u03bet is 2-subgaussian, which allows us to continue with a homoscedastic noise bound.\nWe see that yt acts like an observation of \u03c8(cid:62)\n4 + \u03c32-subgaussian noise (for two\nindependent random variables X and Y that are \u03c31- and \u03c32-subgaussian respectively, X + Y is\n2- subgaussian). Therefore, the construction of the con\ufb01dence bounds for the least squares\nestimator w.r.t. \u03c8t remains valid at the cost of an increased variance proxy, and we are required to use\n\u03b2t with \u03c1 =\n4 + \u03c32 in the de\ufb01nition of the con\ufb01dence set. The regret bound for the UCB algorithm\n(Lemma 2) and an application of the union bound completes the proof for this case. When we use the\nsample-based features \u02dc\u03c8x,\u00b5t, the noise term \u03bet can be biased, because xt depends on the sampled\nfeatures and E[ \u02dc\u03c8xt,\u00b5t|Ft\u22121, \u00b5t] (cid:54)= \u00af\u03c8xt,\u00b5t. This bias carries on to the least-squares estimator, but can\nbe controlled by a more careful analysis. See Appendix A.2 for details.\n\nt \u03b8 perturbed by\n\n1 + \u03c32\n\n\u221a\n\n\u221a\n\n4.2 When the context realization is observed\n\nWe now turn our attention to the alternative setting, where it is possible to observe the realized\ncontext ct (e.g. actual weather measurements) after the learner has chosen xt. In Algorithm 1, so\nfar our estimate \u02c6\u03b8t only uses the data {(xs, \u00b5s, ys)}t\ns=1, but with the context observation we have\n{(xs, cs, ys)}t\ns=1 available. It makes sense to use the additional information to improve our estimate\n\u02c6\u03b8t, and as we show below this reduces the amount the UCB algorithm explores. The pseudo code of the\nmodi\ufb01ed algorithm is given in Algorithm 2 (Appendix B), where the only difference is that we replaced\n\u03b8(cid:48) \u2212 ys)2 + \u03bb(cid:107)\u03b8(cid:48)(cid:107)2\nthe estimate of \u03b8 by the least squares estimate \u02c6\u03b8t = arg min\u03b8(cid:48)\u2208Rd\n2.\nSince now the observation noise \u0001t = yt \u2212 \u03c6(cid:62)\n4 + \u03c32-\nsubgaussian), we can use the smaller scaling factor \u03b2t with \u03c1 = \u03c3 to obtain a tighter upper con\ufb01dence\nbound.\nTheorem 2. The regret of Algorithm 2 based on the expected feature sets \u03a8t and \u03b2t = \u03b2t(\u03c3, \u03b4/3) is\nbounded with probability at least 1 \u2212 \u03b4 by\n\n\u03b8 is only \u03c3- subgaussian (instead of\n\n(cid:80)t\ns=1(\u03c6(cid:62)\n\n\u221a\n\nxs,cs\n\nxt,ct\n\n(cid:114)\n\n(cid:115)\n\n(cid:18) det VT\n\n(cid:19)\n\ndet V0\n\nRT \u2264 \u03b2T\n\n8T log\n\n+ 4(1 + \u03bb\u22121/2\u03b2T )\n\n2T log\n\n3\n\u03b4\n\nThe importance of this result is that it justi\ufb01es the use of the smaller scaling \u03b2t of the con\ufb01dence\nset, which affects the action choice of the UCB algorithm. In practice, \u03b2t has a large impact on the\namount of exploration, and a tighter choice can signi\ufb01cantly reduce the regret as we show in our\nexperiments. We note that in this case, the reduction to the regret bound of UCB is slightly more\ninvolved than previously. As before, we use Lemma 3 to reduce a regret bound on RT to the regret\nRU CB\nthat the UCB algorithm obtains on the sequence of context-feature sets \u03a8t. Since now, the\nUCB action is based on tighter con\ufb01dence bounds, we expect the regret RU CB\nto be smaller, too.\nThis does not follow directly from the UCB analysis, as there the estimator is based on the features\n\u00af\u03c8xt,\u00b5t instead of \u03c6xt,ct. We defer the complete proof to Appendix B.1. There we also show a similar\nresult for the sample based feature sets \u02dc\u03a8t analogous to Theorem 1.\n\nT\n\nT\n\n5\n\n\f4.3 Kernelized stochastic bandits with context distributions\nIn the kernelized setting, the reward function f : X \u00d7 C \u2192 R is a member of a known reproducing\nkernel Hilbert space (RKHS) H with kernel function k : (X \u00d7 C)2 \u2192 R. In the following, let (cid:107) \u00b7 (cid:107)H\nbe the Hilbert norm and we denote by kx,c := k(x, c,\u00b7,\u00b7) \u2208 H the kernel features. For the analysis\nwe further make the standard boundedness assumption (cid:107)f(cid:107) \u2264 1 and (cid:107)kx,c(cid:107) \u2264 1. We provide details\non how to estimate f given data {(xs, \u00b5s, ys)}t\ns=1 with uncertain context. As in the linear case, the\nestimator \u02c6ft can be de\ufb01ned as an empirical risk minimizer with parameter \u03bb > 0,\n\n+ \u03bb(cid:107)f(cid:107)2H .\n\n(3)\n\n(cid:0)Ec\u223c\u00b5t[f (xs, c)] \u2212 ys\n\n(cid:1)2\n\nt(cid:88)\n\ns=1\n\n\u02c6ft = arg min\n\nf\u2208H\n\nIn the literature this is known as distributional risk minimization (Muandet et al., 2017, Section 3.7.3).\nThe following representer theorem shows, that the solution can be expressed as a linear combination\nof kernel mean embeddings \u00afkx,\u00b5 := Ec\u223c\u00b5[kx,c] \u2208 H.\nTheorem 3 (Muandet et al. (2012, Theorem 1)). Any f \u2208 H that minimizes the regularized risk\n\nfunctional (3) admits a representation of the form f =(cid:80)t\n\n\u00afkxs,\u00b5s for some \u03b1s \u2208 R.\n\ns=1 \u03b1s\n\nIt is easy to verify that the solution to (3) can be written as\n\n\u02c6ft(x, c) = kt(x, c)(cid:62)(Kt + \u03bbI)\u22121yt\n\n(4)\nwhere kt(x, c) = [\u00afkx1,\u00b51 (x, c), . . . , \u00afkxt,\u00b5t(x, c)](cid:62), (Kt)a,b = Ec\u223c\u00b5b [\u00afkxa,\u00b5a (xb, c)] for 1 \u2264 a, b,\u2264 t\n(cid:80)L\nis the kernel matrix and yt = [y1, . . . , yt]T denotes the vector of observations. Likewise, the estimator\ni=1 k(x, \u02dcci,\u00b7,\u00b7) \u2208 H for\ncan be computed from sample based kernel mean embeddings \u02dckL\ni.i.d. samples \u02dcci \u223c \u00b5. This allows for an ef\ufb01cient implementation also in the kernelized setting, at the\nusual cost of inverting the kernel matrix. With iterative updates the overall cost amount to O(LT 3).\nThe cubic scaling in T can be avoided with \ufb01nite dimensional feature approximations or inducing\npoints methods, e.g. Rahimi and Recht (2008); Mutny and Krause (2018).\nThe UCB algorithm can be de\ufb01ned using an analogous concentration result for the RKHS setting\n(Abbasi-Yadkori, 2012). We provide details and the complete kernelized algorithm (Algorithm 3) in\nAppendix C. The corresponding regret bound is summarized in the following theorem.\nTheorem 4. At any time T \u2208 N, the regret of Algorithm 3 with exact kernel mean embeddings \u00afkx,c\nand \u03b2t as de\ufb01ned in Lemma 6 in Appendix C, is bounded with probability at least 1 \u2212 \u03b4 by\n\nx,\u00b5 := 1\nL\n\n(cid:112)8T log(det(I + (\u03bb\u03c1)\u22121KT )) + 4\n\nRT \u2264 \u03b2T\n\n(cid:114)\n\n2T log\n\n2\n\u03b4\n\nAgain, the data dependent log-determinant in the regret bound can be replaced with kernel speci\ufb01c\nbounds, referred to as maximum information gain \u03b3T (Srinivas et al., 2010).\n\n5 Experiments\n\nWe evaluate the proposed method on a synthetic example as well as on two benchmarks that we\nconstruct from real-world data. Our focus is on understanding the effect of the sample size L used to\nde\ufb01ne the context set \u02dc\u03a8l\nt. We compare three different observational modes, with decreasing amount\nof information available to the learner. First, in the exact setting, we allow the algorithm to observe\nthe context realization before choosing an action, akin to the usual contextual bandit setting. Note\nthat this variant possibly obtains negative reward on the regret objective (1), because x\u2217\nt is computed\nto maximize the expected reward over the context distribution independent of ct. Second, in the\nobserved setting, decisions are based on the context distribution, but the regression is based on\nthe exact context realization. Last, only the context distribution is used for the hidden setting. We\nevaluate the effect of the sample sizes L = 10, 100 and compare to the variant that uses that exact\nexpectation of the features. As common practice, we treat the con\ufb01dence parameter \u03b2T as tuning\nparameter that we choose to minimize the regret after T = 1000 steps. Below we provide details on\nthe experimental setup and the evaluation is shown in Figure 1. In all experiments, the \u2018exact\u2019 version\nsigni\ufb01cantly outperforms the distributional variants or even achieves negative regret as anticipated.\nConsistent with our theory, observing the exact context after the action choice improves performance\ncompared to the unobserved variant. The sampled-based algorithm is competitive with the expected\nfeatures already for L = 100 samples.\n\n6\n\n\f(a) Synthetic Example\n\n(b) Movielens\n\n(c) Wheat Yield Data\n\nFigure 1: The plots show cumulative regret as de\ufb01ned in (1). As expected, the variant that does not\nobserve the context (hidden) is out-performed by the variant that uses the context realization for\nregression (obs)4. The sample size l used to construct the feature sets from the context distribution\nhas a signi\ufb01cant effect on the regret, where with l = 100 performance is already competitive with the\npolicy that uses the exact expectation over features (E). The exact baseline, which has access to the\ncontext realization before taking the decision, achieves negative regret on the benchmark (a) and (b),\nas the regret objective (1) compares to the action maximizing the expected reward. The error bars\nshow two times standard error over 100 trials for (a) and (c), and 200 trials for (b). The variance in\nthe movielens experiment is fairly large, likely because our linear model is miss-speci\ufb01ed; and at\nthe \ufb01rst glance, it looks like the sample-based version outperforms the expected version in one case.\nFrom repeated trials we con\ufb01rmed, that this is only an effect of the randomness in the results.\n\n(cid:80)5\nSynthetic Example As a simple synthetic benchmark we set the reward function to f (x, c) =\ni=1(xi \u2212 ci)2, where both actions and context are vectors in R5. We choose this quadratic form to\ncreate a setting where the optimal action strongly depends on the context ci. As linear parametrization\nwe choose \u03c6(x, c) = (x2\n5, x1c1, . . . , x5c5). The action set consists of k = 100\nelements that we sample at the beginning of each trial from a standard Gaussian distribution. For the\ncontext distribution, we \ufb01rst sample a random element mt \u2208 R5, again from a multivariate normal\ndistribution, and then set \u00b5t = N (mt, 1). Observation noise is Gaussian with standard deviation 0.1.\n\n1,\u00b7\u00b7\u00b7 , x2\n\n1,\u00b7\u00b7\u00b7 , c2\n\n5, c2\n\nMovielens Data Using matrix factorization we construct 6-dimensional features for user ratings\nof movies in the movielens-1m dataset (Harper and Konstan, 2016). We use the learned embedding\nas ground truth to generate the reward which we round to half-integers between 0 and 5 likewise\nthe actual ratings. Therefore our model is miss-speci\ufb01ed in this experiment. Besides the movie\nratings, the data set provides basic demographic data for each user. In the interactive setting, the\ncontext realization is a randomly sampled user from the data. The context distribution is set to the\nempirical distribution of users in the dataset with the same demographic data. The setup is motivated\nby a setting where the system interacts with new users, for which we already obtained the basic\ndemographic data, but not yet the exact user\u2019s features (that in collaborative \ufb01ltering are computed\nfrom the user\u2019s ratings). We provide further details in Appendix D.1.\n\nCrop Yield Data We use a wheat yield dataset that was systematically collected by the Agroscope\ninstitute in Switzerland over 15 years on 10 different sites. For each site and year, a 16-dimensional\nsuitability factor based on recorded weather conditions is available. The dataset contains 8849 yield\nmeasurements for 198 crops. From this we construct a data set D = {(xi, wi, yi)} where xi is the\nidenti\ufb01er of the tested crop, wi \u2208 R16+10 is a 16 dimensional suitability factor obtained from weather\nmeasurements augmented with a 1-hot encoding for each site, and yi is the normalized crop yield.\nWe \ufb01t a bilinear model yi \u2248 wT\ni W Vxi to get 5-dimensional features Vx for each variety x and site\nfeatures w(cid:62)W that take the weather conditions w of the site into account. From this model, we\ngenerate the ground-truth reward. Our goal is to provide crop recommendations to maximize yield on\na given site with characteristics w. Since w is based on weather measurements that are not available\nahead of time, we set the context distribution such that each feature of w is perturbed by a Gaussian\ndistribution centered around the true w. We set the variance of the perturbation to the empirical\nvariance of the features for the current site over all 15 years. Further details are in Appendix D.2.\n\n7\n\nexactobs(E)obs(L=10)obs(L=100)hidden(E)hidden(L=10)hidden(L=100)02505007501000T0250500750Regret02505007501000T05010015002505007501000T020406080\f6 Related Work\n\nThere is a large array of work on bandit algorithms, for a survey see Bubeck and Cesa-Bianchi\n(2012) or the book by Lattimore and Szepesv\u00e1ri (2018). Of interest to us is the stochastic contextual\nbandit problem, where the learner chooses actions after seeing a context; and the goal is to compete\nwith a class of policies, that map contexts to actions. This is akin to reinforcement learning (Sutton\nand Barto, 2018), but the contextual bandit problem is different in that the sequence of contexts\nis typically allowed to be arbitrary (even adversarially chosen), and does not necessarily follow a\nspeci\ufb01c transition model. The contextual bandit problem in this formulation dates back to at least Abe\nand Long (1999) and Langford and Zhang (2007). The perhaps best understood instance of this model\nis the linear contextual bandit, where the reward function is a linear map of feature vectors (Auer et al.,\n2002). One of the most popular algorithms is the Upper Con\ufb01dence Bound (UCB) algorithm, \ufb01rst\nintroduced by Auer (2002) for the multi-armed bandit problem, and later extended to the linear case\nby Li et al. (2010). Analysis of this algorithm was improved by Dani et al. (2008), Abbasi-Yadkori\net al. (2011) and Li et al. (2019), where the main technical challenge is to construct tight con\ufb01dence\nsets for an online version of the least squares estimator. Alternative exploration strategies have been\nconsidered as well, for instance Thompson sampling (Thompson, 1933), which was analyzed for the\nlinear model by Agrawal and Goyal (2013) and Abeille and Lazaric (2017). Other notable approaches\ninclude an algorithm that uses a perturbed data history as exploration mechanism (Kveton et al.,\n2019), or a mostly greedy algorithm that leverages the randomness in the context to obtain suf\ufb01cient\nexploration (Bastani et al., 2017). In kernelized bandits the reward function is contained given\nreproducing kernel Hilbert space (RKHS). This setting is closely related to Bayesian optimization\n(Mockus, 1982). Again, the analysis hinges on the construction of con\ufb01dence sets and bounding a\nquantity referred to as information gain by the decay of the kernel\u2019s eigenspectrum. An analysis of\nthe UCB algorithm for this setting was provided by Srinivas et al. (2010). It was later re\ufb01ned by\nAbbasi-Yadkori (2012); Valko et al. (2013); Chowdhury and Gopalan (2017); Durand et al. (2018) and\nextended to the contextual setting by Krause and Ong (2011). Interestingly, in our reduction the noise\ndistribution depends on the action, also referred to as heteroscedastic bandits. Heteroscedastic bandits\nwhere previously considered by Hsieh et al. (2019) and Kirschner and Krause (2018). Stochastic\nuncertainty on the action choice has been studied by Oliveira et al. (2019) in the context of Bayesian\noptimization. Closer related is the work by Yun et al. (2017), who introduce a linear contextual bandit\nmodel where the observed feature is perturbed by noise and the objective is to compete with the\nbest policy that has access to the unperturbed feature vector. The main difference to our setting is\nthat we assume that the environment provides a distribution of feature vectors (instead of a single,\n\u221a\nperturbed vector) and we compute the best action as a function of the distribution. As a consequence,\nwe are able to obtain O(\nT ) regret bounds without further assumptions on the context distribution,\nwhile Yun et al. (2017) get O(T 7/8) with identical noise on each feature, and O(T 2/3) for Gaussian\nfeature distributions. Most closely related is the work by Lamprier et al. (2018) on linear bandits with\nstochastic context. The main difference to our setting is that the context distribution in Lamprier et al.\n(2018) is \ufb01xed over time, which allows to built aggregated estimates of the mean feature vector over\ntime. Our setting is more general in that it allows an arbitrary sequence of distributions as well as\ncorrelation between the feature distributions of different actions. Moreover, in contrast to previous\nwork, we discuss the kernelized-setting and the setting variant, where the context is observed exactly\nafter the action choice. Finally, also adversarial contextual bandit algorithms apply in our setting,\nfor example the EXP4 algorithm of Auer et al. (2002) or ILTCB of Agarwal et al. (2014). Here, the\nobjective is to compete with the best policy in a given class of policies, which in our setting would\nrequire to work with a covering of the set of distributions P(C). However, these algorithms do not\nexploit the linear reward assumption and, therefore, are arguably less practical in our setting.\n\n7 Conclusion\n\nWe introduced context distributions for stochastic bandits, a model that is naturally motivated in\nmany applications and allows to capture the learner\u2019s uncertainty in the context realization. The\nmethod we propose is based on the UCB algorithm, and in fact, both our model and algorithm strictly\ngeneralize the standard setting in the sense that we recover the usual model and the UCB algorithm if\nthe environment chooses only Dirac delta distributions. The most practical variant of the proposed\nalgorithm requires only sample access to the context distributions and satis\ufb01es a high-probability\nregret bound that is order optimal in the feature dimension and the horizon up to logarithmic factors.\n\n8\n\n\fAcknowledgments\n\nThe authors thank Agroscope for providing the crop yield data set, in particular Didier Pellet, Lilia\nLevy and Juan Herrera, who collected the winter wheat data, and Annelie Holzk\u00e4mper, who developed\nthe environmental suitability factors model. Further, the authors acknowledge the work by Mariyana\nKoleva and Dejan Mir\u02c7ci\u00b4c, who performed the initial data cleaning and exploration as part of their\nMaster\u2019s theses.\nThis research was supported by SNSF grant 200020 159557 and has received funding from the\nEuropean Research Council (ERC) under the European Union\u2019s Horizon 2020 research and innovation\nprogramme grant agreement No 815943.\n\nReferences\nAbbasi-Yadkori, Y. (2012). Online Learning for Linearly Parametrized Control Problems. PhD\n\nthesis.\n\nAbbasi-Yadkori, Y., P\u00e1l, D., and Szepesv\u00e1ri, C. (2011). Improved algorithms for linear stochastic\n\nbandits. In Advances in Neural Information Processing Systems, pages 2312\u20132320.\n\nAbe, N. and Long, P. M. (1999). Associative reinforcement learning using linear probabilistic\nconcepts. In Proceedings of the Sixteenth International Conference on Machine Learning, pages\n3\u201311. Morgan Kaufmann Publishers Inc.\n\nAbeille, M. and Lazaric, A. (2017). Linear thompson sampling revisited. In Arti\ufb01cial Intelligence\n\nand Statistics, pages 176\u2013184.\n\nAgarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. (2014). Taming the monster:\nA fast and simple algorithm for contextual bandits. In International Conference on Machine\nLearning, pages 1638\u20131646.\n\nAgrawal, S. and Goyal, N. (2013). Thompson sampling for contextual bandits with linear payoffs. In\n\nInternational Conference on Machine Learning, pages 127\u2013135.\n\nAuer, P. (2002). Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422.\n\nAuer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed\n\nbandit problem. SIAM journal on computing, 32(1):48\u201377.\n\nBastani, H., Bayati, M., and Khosravi, K. (2017). Mostly exploration-free algorithms for contextual\n\nbandits. arXiv preprint arXiv:1704.09011.\n\nBubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1).\n\nChowdhury, S. R. and Gopalan, A. (2017). On kernelized multi-armed bandits. In Proceedings of the\n\n34th International Conference on Machine Learning-Volume 70, pages 844\u2013853. JMLR. org.\n\nDani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic Linear Optimization under Bandit\n\nFeedback. In COLT, pages 355\u2013366. Omnipress.\n\nDurand, A., Maillard, O.-A., and Pineau, J. (2018). Streaming kernel regression with provably\nadaptive mean, variance, and regularization. The Journal of Machine Learning Research, 19(1):650\u2013\n683.\n\nHarper, F. M. and Konstan, J. A. (2016). The movielens datasets: History and context. Acm\n\ntransactions on interactive intelligent systems (tiis), 5(4):19.\n\nHolzk\u00e4mper, A., Calanca, P., and Fuhrer, J. (2013). Identifying climatic limitations to grain maize\nyield potentials using a suitability evaluation approach. Agricultural and Forest Meteorology,\n168:149\u2013159.\n\n9\n\n\fHsieh, P.-C., Liu, X., Bhattacharya, A., and Kumar, P. (2019). Stay with me: Lifetime maximization\nthrough heteroscedastic linear bandits with reneging. In International Conference on Machine\nLearning, pages 2800\u20132809.\n\nHug, N. (2017). Surprise, a Python library for recommender systems. http://surpriselib.com.\nKirschner, J. and Krause, A. (2018). Information Directed Sampling and Bandits with Heteroscedastic\n\nNoise. In Proc. International Conference on Learning Theory (COLT).\n\nKoren, Y., Bell, R., and Volinsky, C. (2009). Matrix factorization techniques for recommender\n\nsystems. Computer, (8):30\u201337.\n\nKrause, A. and Ong, C. S. (2011). Contextual gaussian process bandit optimization. In Advances in\n\nNeural Information Processing Systems, pages 2447\u20132455.\n\nKveton, B., Szepesvari, C., Ghavamzadeh, M., and Boutilier, C. (2019). Perturbed-history exploration\n\nin stochastic linear bandits. arXiv preprint arXiv:1903.09132.\n\nLamprier, S., Gisselbrecht, T., and Gallinari, P. (2018). Pro\ufb01le-based bandit with unknown pro\ufb01les.\n\nThe Journal of Machine Learning Research, 19(1):2060\u20132099.\n\nLangford, J. and Zhang, T. (2007). The epoch-greedy algorithm for contextual multi-armed bandits.\nIn Proceedings of the 20th International Conference on Neural Information Processing Systems,\npages 817\u2013824. Citeseer.\n\nLattimore, T. and Szepesv\u00e1ri, C. (2018). Bandit algorithms.\nLi, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized\nnews article recommendation. In Proceedings of the 19th international conference on World wide\nweb, pages 661\u2013670. ACM.\n\nLi, Y., Wang, Y., and Zhou, Y. (2019). Tight regret bounds for in\ufb01nite-armed linear contextual\n\nbandits.\n\nMockus, J. (1982). The bayesian approach to global optimization. System Modeling and Optimization,\n\npages 473\u2013481.\n\nMuandet, K., Fukumizu, K., Dinuzzo, F., and Sch\u00f6lkopf, B. (2012). Learning from distributions via\nsupport measure machines. In Advances in neural information processing systems, pages 10\u201318.\nMuandet, K., Fukumizu, K., Sriperumbudur, B., Sch\u00f6lkopf, B., et al. (2017). Kernel mean embedding\nof distributions: A review and beyond. Foundations and Trends R(cid:13) in Machine Learning, 10(1-2):1\u2013\n141.\n\nMutny, M. and Krause, A. (2018). Ef\ufb01cient high dimensional bayesian optimization with additivity\nand quadrature fourier features. In Advances in Neural Information Processing Systems, pages\n9005\u20139016.\n\nOliveira, R., Ott, L., and Ramos, F. (2019). Bayesian optimisation under uncertain inputs. In The\n\n22nd International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1177\u20131184.\n\nRahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Advances in\n\nneural information processing systems, pages 1177\u20131184.\n\nSrinivas, N., Krause, A., Seeger, M., and Kakade, S. M. (2010). Gaussian Process Optimization in\nthe Bandit Setting: No Regret and Experimental Design. In Proceedings of the 27th International\nConference on Machine Learning, pages 1015\u20131022.\n\nSutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.\nThompson, W. R. (1933). On the Likelihood that One Unknown Probability Exceeds Another in\n\nView of the Evidence of Two Samples. Biometrika, 25(3/4):285\u2013294.\n\nValko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013). Finite-Time Analysis of\n\nKernelised Contextual Bandits. arXiv:1309.6869 [cs, stat].\n\nYun, S.-Y., Nam, J. H., Mo, S., and Shin, J. (2017). Contextual multi-armed bandits under feature\n\nuncertainty. arXiv preprint arXiv:1703.01347.\n\n10\n\n\f", "award": [], "sourceid": 7862, "authors": [{"given_name": "Johannes", "family_name": "Kirschner", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}]}