This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov
Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete)
action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient
search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient
estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and
an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented.
Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs
TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed
to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed
algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance
comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (
1999) and Bhatnagar and Kumar (
2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (
1999).
Keywords Actor-critic algorithms - Two timescale stochastic approximation - Markov decision processes - Policy iteration - Simultaneous perturbation stochastic approximation - Normalized Hadamard matrices - Reinforcement learning - TD-learning
This work was supported in part by Grant no. SR/S3/EE/43/2002-SERC-Engg from the Department of Science and Technology, Government
of India.