The central problem of reinforcement learning is to fit the parameters of an agent's policy in order to make the agent chose "good" decisions according to some unknown objective.

Let's consider a very simple case. We have an MDP which is basically stateless. So to say, all the world is doing is to map the actions of an agent directly to a reward. Consider this function to be a mixture of two gaussians, with one mode at -3 and one at 3. Say we pick a single mode gaussian as a model of the world.

After fitting our model to our data, gathered via a rollout, weighting it by the reward - we will get a gaussian that has a mode somewhere around 0.

A better strategy for an agent would be to just stick to one of the modes. Even though this model of the world is not more correct, it does lead to a better policy. Picking 0 as an action will lead to a reward of approximately zero while always picking -3 or always picking 3 will always return 0.5 as a reward.

Under the assumption of a single mode gaussian and via the maximum likelihood approach, a non optimal solution is picked.

Of course, there are superior methods. For example, one could estimate the gradient and just move up to one of the modes. But what happens if the gradient information is incomplete? This can happen if our world has time dependencies which are not modelled (e.g. non markovian environments). It can also happen if our observations are not complete but only partial (as in POMDPs). In that case, direct methods are prone to fail, again.

Imagine our world to be only accessible through a proxy, which happens to turn the complete state information into an observation via a many-to-one mapping. For example, two different states s and s' map to the same observation o. However, the rewards r(s) and r(s') are completely different, say -1 and 1. In that case our sampling will lead to a completely wrong impression.

Stochastic search does not suffer from this problem. A direct method uses an approximation of the world's dynamics consisting of the observations and the rewards as well as the current agent's policy in order to generate a new (hopefully better) policy. A simple hill climber uses only the current policy and its expected reward. Since the non-use of local information does keep it from making wrong decisions due to wrong information, it is in theory more robust if the environment has a lot of wrong local information.

## 1 comment:

Hi Justin,

How's the diploma thesis going? I haven't heard from you for quite a bit, but I stumbled accross your page quite by chance, you discusss quite an interesting topic here. One topic which has occupied me lately has been metrics and aggregate functions, which I assume are very important in machine learning. Currently there is no defacto metric for evaluating how well medical images are aligned to each other (registered), although there has been much research on this topic there is still no standard method avaailabe to align one medical image to another (independent of modality of course). From my perspective it is more important to develop an optimal metric than it is to develop an optimal search method. My question is, how you do you know your similarity measure that you use delivers an optimum for the problem you wish to solve using machine learning methods? Frohes Schaffen with your thesis,

best regards,

Brian

Post a Comment