The central problem of reinforcement learning is to fit the parameters of an agent's policy in order to make the agent chose "good" decisions according to some unknown objective.

Let's consider a very simple case. We have an MDP which is basically stateless. So to say, all the world is doing is to map the actions of an agent directly to a reward. Consider this function to be a mixture of two gaussians, with one mode at -3 and one at 3. Say we pick a single mode gaussian as a model of the world.

After fitting our model to our data, gathered via a rollout, weighting it by the reward - we will get a gaussian that has a mode somewhere around 0.

A better strategy for an agent would be to just stick to one of the modes. Even though this model of the world is not more correct, it does lead to a better policy. Picking 0 as an action will lead to a reward of approximately zero while always picking -3 or always picking 3 will always return 0.5 as a reward.

Under the assumption of a single mode gaussian and via the maximum likelihood approach, a non optimal solution is picked.

Of course, there are superior methods. For example, one could estimate the gradient and just move up to one of the modes. But what happens if the gradient information is incomplete? This can happen if our world has time dependencies which are not modelled (e.g. non markovian environments). It can also happen if our observations are not complete but only partial (as in POMDPs). In that case, direct methods are prone to fail, again.

Imagine our world to be only accessible through a proxy, which happens to turn the complete state information into an observation via a many-to-one mapping. For example, two different states s and s' map to the same observation o. However, the rewards r(s) and r(s') are completely different, say -1 and 1. In that case our sampling will lead to a completely wrong impression.

Stochastic search does not suffer from this problem. A direct method uses an approximation of the world's dynamics consisting of the observations and the rewards as well as the current agent's policy in order to generate a new (hopefully better) policy. A simple hill climber uses only the current policy and its expected reward. Since the non-use of local information does keep it from making wrong decisions due to wrong information, it is in theory more robust if the environment has a lot of wrong local information.