UniversoMath

Another Example of Value Iteration (Software Implementation)

Consider the same one-dimensional grid with reward values as in the first few problems in this vertical. However, consider the following change to the transition probabilities: At any given grid location the agent can choose to either stay at the location or move to an adjacent grid location. If the agent chooses to stay at the location, such an action is successful with probability $1/2$ and

if the agent is at the leftmost or rightmost grid location it ends up at its neighboring grid location with probability $1/2$,
if the agent is at any of the inner grid locations it has a probability $1/4$ each of ending up at either of the neighboring locations.

If the agent chooses to move (either left or right) at any of the inner grid locations, such an action is successful with probability $1/3$ and with probability $2/3$ it fails to move, and

if the agent chooses to move left at the leftmost grid location, then the action ends up exactly the same as choosing to stay, i.e., staying at the leftmost grid location with probability $1/2$, and ends up at its neighboring grid location with probability $1/2$,
if the agent chooses to move right at the rightmost grid location, then the action ends up exactly the same as choosing to stay, i.e., staying at the rightmost grid location with probability $1/2$, and ends up at its neighboring grid location with probability $1/2$.

Note in this setting, we assume that the game does not halt after reaching the rightmost cell.

Let $\gamma = 0.5$.

Run the value iteration algorithm for 100 iterations. Use any computational software of your choice.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Apr 28 10:51:05 2023

@author: UniversoMath
"""
import math
import numpy as np


def T(i_state,action,f_state):
    """
    Funcion para obtener la probabiilidad de llegar al estado
    f_state a partir del estado i_state mediante la acción 
    'action'.

    Parameters
    ----------
    i_state : int
        estado inicial
    action : string
        accion ('L','P', 'R') izquierda, permanecer, derecha
    f_state : int
        estado final

    Returns
    -------
    probabilidad de transición

    """
    inner = {1,2,3}
    states = {0,1,2,3,4}
   
        # probabilities with action 'P'
        
    if (action == 'P') and (i_state == f_state): #same state
        return 1/2
    elif (action == 'P') and (i_state == 0) and (f_state == 1):
        return 1/2
    elif (action == 'P') and (i_state == 4) and (f_state == 3):
        return 1/2
    elif (action == 'P') and (i_state in inner) and (np.abs(f_state-i_state)==1):
        return 1/4
    
        # probabilities with actions 'L' and 'R'
              
    elif (action == 'L') and (i_state in inner) and (i_state-f_state==1):
        return 1/3
    elif (action == 'R') and (i_state in inner) and (f_state-i_state==1):
        return 1/3
    elif (action == 'R') and (i_state == 0) and (f_state == 1):
        return 1/3
    elif (action == 'L') and (i_state == 4) and (f_state == 3):
        return 1/3
    
    
    #elif (action in {'L','R'}) and (i_state in inner) and (np.abs(f_state-i_state)==1):
     #   return 1/3
    
    #checar
    
    elif (action in {'L','R'}) and (i_state in inner) and (f_state==i_state):
        return 2/3
    elif (action == 'L') and (i_state == 4) and (f_state == 4):
        return 2/3
    elif (action == 'R') and (i_state == 0) and (f_state == 0):
        return 2/3
    
    elif (action == 'L') and (i_state == 0) and (f_state == 0):
        return 1/2
    elif (action == 'L') and (i_state == 0) and (f_state == 1):
        return 1/2
    
    elif (action == 'R') and (i_state == 4) and (f_state == 4):
        return 1/2
    elif (action == 'R') and (i_state == 4) and (f_state == 3):
        return 1/2
    else:
        return 0


def R(state):
    if state == 4:
        return 1
    else:
        return 0


def expresion2(s,action,states2,V_i):
    n = 0
    for sp in states2:
        Re = R(s)
        Tr =T(s,action,sp)
        V_anterior = V_i[sp]
        n +=  Tr*(Re+(gamma*V_anterior))
    return n


states2 = [0,1,2,3,4]
V_i = [0,0,0,0,0]
V_f=[0,0,0,0,0]
gamma =0.5

for i in range(300):
    V_f=[0,0,0,0,0]
    for s in [0,1,2,3,4]:
        a = []
        for action  in ['L','P','R']:
            a.append( expresion2(s, action, states2, V_i))
        b = np.max(a)
        V_f[s]=b
    V_i=V_f
  
print(V_f)

Buscar este blog

UniversoMath

Another Example of Value Iteration (Software Implementation)

Comentarios

Publicar un comentario