## Thursday 17 May 2007

### Simulation of Markov Decision Processes with enhanced generators in Python (with blood)

The language war will always go on and I consider myself as someone who has to promote the Python programming language no matter what happens (even though I know that there is no silver bullet.) Thus, I do have to use every opportunity to show the world how easy it is to connect formalisms of high scientific importance with not-so-important-but-still-cool language features.

Whatever.

Markov Decision Processes are big in machine learning. The whole field of reinforcement learning is based heavily on them since they are a good way to formalize environments in which an agent acts.

Python 2.5 brought a powerful new feature to the Python community: enhanced generators. If you want to give a function state you might consider overriding the __call__ method of your class - or you might want to go a way even more straightforward by using a generator function. If you are not familiar with Python's generators, you can read about them here.

Sending values into the generator is somehow more complex than it has to be if you do want to send in a value everytime you want to retrieve one. For example, you might want to have an adder that adds up numbers and always returns the current result:

```def adder(n = 0):
while True:
inc = (yield None)
n += inc
(yield n)
```
That gives us in the REPL:
```>>> a = adder()
>>> a.next()  # You have to forward the generator once
>>> a.send(2) # Sending in gives you a result
2
>>> a.next()  # Forward again
>>> a.send(3)
5```
This can be eased with the following coroutine decorator:
```def coroutine(func):
def wrapper(*args, **kwargs):
f = func(*args, **kwargs)
f.next()    # Move on to the first
def inner(message):
while True:
# StopIteration might be raised here, which is okay
r = f.send(message)
f.next()
return r
return inner
return wrapper```
```>>> coAdder = coroutine(adder)
>>> a(2)
4
>>> a(3)
7```

We can use those features to simulate a Markov Decision Process. First let's recall a somewhat simpler definition: A Markov Decision Process (MDP) is defined by

• A set of states
• A set of actions
• Transition probabilities P(s, a, s') that describe the probablity of transfering from state s to state s' if action a is taken
• An initial state from the set of states

You can think of an MDP as a something that has a state and that you feed with actions - upon every action, the MDP transfers into a (possibly) new state. Let's have a look at an example:

Quentin is living together with his friends Bill and Beatrix. Bill and Beatrix have been going out together since some years, but today they have a serious argument . They run around the flat and scream at each other, swing their Hatori Hanzo blades and are about to cover the whole apartment in blood, gore and bones. Quentin is interested in not beeing in the same room with them since his two daikatanas are at the polisher. Bill and Beatrix keep chasing each other around the flat having a movie-like fight. The apartment is a little small and only has two rooms. Quentin has two actions: he can either switch the room or stay in the one he is at the moment.

There are two states - either Quentin is in the same room as the couple or he is in the other room. The fact that there are two different rooms does not matter for now, since we consider them only distinguishable on the fact who is in it.

Every time step the couple has a probability of 30% to switch the room. This leads to the following transition probabilities:

Same room, switch: 0.7 other room, 0.3 same room
Same room, stay: 0.3 other room, 0.7 same room
Other room, switch: 0.3 other room, 0.7 same room
Other room, stay: 0.7 other room, 0.3 same room

Of course, the best strategy for Quentin would be to switch rooms if they are in the same room and to stay if they are in the other room. The uber strategy "Leaving the flat" is not considered for educational reasons. ;)

How can this be implemented as a generator function? First we should think about how to represent the transition probabilites in Python, which can be done extremly easy with a dictionary:

```
transitions = {
('same room', 'switch'):  ((0.3, 'same room'), (0.7, 'other room')),
('same room', 'stay'):    ((0.7, 'same room'), (0.3, 'other room')),
('other room', 'switch'): ((0.3, 'same room'), (0.7, 'other room')),
('other room', 'stay'):   ((0.7, 'same room'), (0.3, 'other room'))}```

This dictionary maps the (state, action) pairs to (propability, state') pair which is the probability that the next state is state'. The sets of states and actions are given implicitly by the probabilites.

To pick an item from a list depending on a probability, we can use this handy function:

```from random import random
def randomPick(iterable):
"""Return one item of an iterable which consists of pairs (probability, item).

The iterable is consumed."""
# No check for correctness of probabilites to leave this to the programmer
# and to not consume the iterable
iterator = iter(iterable)
s, i = 0.0, random()
for prob, item in iterator:
s += prob
if i if i < s: return item```

Now all we have to do to get our MDP is a very simple generator function:

```@coroutine
def markovProcess(transDict, initialState):
state = initialState
while True:
action = (yield None)
state = randomPick(transDict[state, action])
(yield state)
```

And this is as easy as it gets. This could of course be enhanced by using a function instead of a dictionary for the transition probabilites. But since the transitions are given by reference and not by value you can change the dictionary from the outside and the MDP will reflect this.

#### 1 comment:

Anonymous said...

Neat idea. It's good to see someone else using Python for AI. :) FWIW, I wrote a similar article a while back on creating robot behaviors with Python generators.