We consider the problem of model selection in the batch (offline, non-interactive) reinforcement learning setting when the
goal is to find an action-value function with the smallest Bellman error among a countable set of candidates functions. We
propose a complexity regularization-based model selection algorithm,
BERMIN\ensuremath{\mbox{\textsc {BErMin}}}, and prove that it enjoys an oracle-like property: the estimator’s error differs from that of an oracle, who selects the
candidate with the minimum Bellman error, by only a constant factor and a small remainder term that vanishes at a parametric
rate as the number of samples increases. As an application, we consider a problem when the true action-value function belongs
to an unknown member of a nested sequence of function spaces. We show that under some additional technical conditions
BERMIN\ensuremath{\mbox{\textsc {BErMin}}} leads to a procedure whose rate of convergence, up to a constant factor, matches that of an oracle who knows which of the
nested function spaces the true action-value function belongs to, i.e., the procedure achieves
adaptivity.
Keywords Reinforcement learning – Model selection – Complexity regularization – Adaptivity – Offline learning – Off-policy learning – Finite-sample bounds
Editors: S. Whiteson and M. Littman.