This talk presents some of our recent work on optimality-preserving operators on Q-functions. Our starting point is the consistent Bellman operator, a new operator for tabular representations. The consistent Bellman operator incorporates a notion of local policy consistency, which we show leads to an increase in the action gap at each state. In this talk I argue that increasing the action gap mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. I will present empirical results showing that this operator can also be applied to discretized continuous space and time problems, and in fact performs excellently on these problems. We then extend the idea of a locally consistent operator and derive sufficient conditions for an operator to preserve optimality. This leads to a family of operators which includes our consistent Bellman operator. As corollaries to our main result we find a proof of optimality for Baird’s advantage learning algorithm, and derive other gap-increasing operators with interesting properties. To conclude, I will describe an empirical study on 57 Atari 2600 games illustrating the strong potential of these new operators.
Full text: http://arxiv.org/abs/1512.04860