A Markov Decision Process (MDP) implementation using value and policy iteration to calculate the optimal policy.
Inputs used are a 4x3 world, a 5x5 world and a 11x11 world. The inputs can be easily modified as desired.
4x3 world:
([[-0.04, -0.04, -0.04, +1],
[-0.04, None, -0.04, -1],
[-0.04, -0.04, -0.04, -0.04]],
terminals=[(3, 2), (3, 1)])
5x5 world:
([[-0.04, -0.04, -0.04, -0.04, +1],
[-0.04, -0.04, -0.04, -0.4, -1],
[-0.04, -0.04, None, -0.04, -0.04],
[-0.04, -0.04, -0.04, -0.04, -0.04],
[-0.04, -0.04, -0.04, -0.04, -0.04]],
terminals=[(4, 4), (4, 3)])
11x11 world:
([[-0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, +1],
[-0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -1],
[-0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04],
[-0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04],
[-0.04, -0.04, -0.04, -0.04, None, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04],
[-0.04, -0.04, -0.04, -0.04, None, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04],
[-0.04, -0.04, -0.04, -0.04, None, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04],
[-0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04],
[-0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04],
[-0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04],
[-0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04, -0.04]],
terminals=[(10, 10), (10, 9)])