This paper presents a framework allowing to tune continual exploration in an optimal way. It first quantifies the rate of
exploration by defining the degree of exploration of a state as the probability-distribution entropy for choosing an admissible action. Then, the exploration/exploitation
tradeoff is stated as a global optimization problem: find the exploration strategy that minimizes the expected cumulated cost, while maintaining fixed degrees of exploration
at same nodes. In other words, “exploitation” is maximized for constant “exploration”. This formulation leads to a set of
nonlinear updating rules reminiscent of the value-iteration algorithm. Convergence of these rules to a local minimum can be
proved for a stationary environment. Interestingly, in the deterministic case, when there is no exploration, these equations
reduce to the Bellman equations for finding the shortest path while, when it is maximum, a full “blind” exploration is performed.