Reinforcement learning methods for discrete and semi-Markov de- cision problems such as Real-Time Dynamic Programming can be generalized for Controlled Diffusion Processes. The optimal control problem reduces to a boundary value problem for a fully nonlinear second-order elliptic differential equation of Hamilton- Jacobi-Bellman (HJB-) type. Numerical analysis provides multi- grid methods for this kind of equation. In the case of Learning Con- trol, however, the systems of equations on the various grid-levels are obtained using observed information (transitions and local cost). To ensure consistency, special attention needs to be directed to- ward the type of time and space discretization during the obser- vation. An algorithm for multi-grid observation is proposed. The multi-grid algorithm is demonstrated on a simple queuing problem.
1987 – 2019 Neural Information Processing Systems Foundation, Inc.