We address the problem of planning in an environment with deterministic dynamics and stochastic discounted rewards under a limited numerical budget where the ranges of both rewards and noise are unknown. We introduce \platypoos, an adaptive, robust and efficient alternative to the \OLOP (open-loop optimistic planning) algorithm. Whereas \OLOP requires apriori knowledge of the ranges of both rewards and noise, \platypoos dynamically adapts its behavior to both. This allows \platypoos to be immune to two vulnerabilities of \OLOP: failure when given underestimated ranges of noise and rewards and inefficiency when these are overestimated. \Platypoos additionally adapts to the global smoothness of the value function. We assess \platypoos’s performance in terms of the simple regret, the expected loss resulting from choosing our algorithm’s recommended action rather than an optimal one. We show that \platypoos acts in a provably more efficient manner vs \OLOP when \OLOP is given an overestimated reward and show that in the case of no noise, \platypoos learns exponentially faster than \OLOP.
Learn More