Scale-free adaptive planning for deterministic dynamics & discounted rewards

Proceedings of the 2019 International Conference on Machine Learning (ICML 2019)

Published September 3, 2018

Peter Bartlett, Victor Gabillion, Jennifer Healey, Michal Valko, Healey

We address the problem of planning in an environment with deterministic dynamics and stochastic discounted rewards under a limited numerical budget where the ranges of both rewards and noise are unknown. We introduce \platypoos, an adaptive, robust and efficient alternative to the \OLOP (open-loop optimistic planning) algorithm. Whereas \OLOP requires apriori knowledge of the ranges of both rewards and noise, \platypoos dynamically adapts its behavior to both. This allows \platypoos to be immune to two vulnerabilities of \OLOP: failure when given underestimated ranges of noise and rewards and inefficiency when these are overestimated. \Platypoos additionally adapts to the global smoothness of the value function. We assess \platypoos’s performance in terms of the simple regret, the expected loss resulting from choosing our algorithm’s recommended action rather than an optimal one. We show that \platypoos acts in a provably more efficient manner vs \OLOP when \OLOP is given an overestimated reward and show that in the case of no noise, \platypoos learns exponentially faster than \OLOP.

Learn More

Research Area:  AI & Machine Learning