Policy Gradient methods are model-free reinforcement learning algorithms which in recent years have been successfully applied
to many real-world problems. Typically, Likelihood Ratio (LR) methods are used to estimate the gradient, but they suffer from
high variance due to random exploration at every time step of each training episode. Our solution to this problem is to introduce
a state-dependent exploration function (SDE) which during an episode returns the same action for any given state. This results
in less variance per episode and faster convergence. SDE also finds solutions overlooked by other methods, and even improves
upon state-of-the-art gradient estimators such as Natural Actor-Critic. We systematically derive SDE and apply it to several
illustrative toy problems and a challenging robotics simulation task, where SDE greatly outperforms random exploration.