RL-VI

[Starter Notebook] RL - Value Iteration

This is a getting starter notebook for the Taxi Problem. It contains basic instructions for using the notebook to make submissions as well as listed tasks to perform & questions to answer. Please read the instruction carefully and then proceed. You are required to create a copy of it before start playing with it.

Happy Solving!😀

What is the notebook about?¶

Problem - Value Iteration¶

This problem deals with a grid world and stochastic actions. The tasks you have to do are:

Complete the Environment
Write code for value Iteration
Visualize Results
Explain the results

How to use this notebook? 📝¶

This is a shared template and any edits you make here will not be saved.You should make a copy in your own drive. Click the "File" menu (top-left), then "Save a Copy in Drive". You will be working in your copy however you like.
Update the config parameters. You can define the common variables here

Variable	Description
`AICROWD_DATASET_PATH`	Path to the file containing test data. This should be an absolute path.
`AICROWD_RESULTS_DIR`	Path to write the output to.
`AICROWD_ASSETS_DIR`	In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
`AICROWD_API_KEY`	In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me

Installing packages. Please use the Install packages 🗃 section to install the packages

Setup AIcrowd Utilities 🛠¶

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [1]:

!pip install -U git+https://gitlab.aicrowd.com/aicrowd/aicrowd-cli.git@notebook-submission-v2 > /dev/null

  Running command git clone -q https://gitlab.aicrowd.com/aicrowd/aicrowd-cli.git /tmp/pip-req-build-2zx5aixy
  Running command git checkout -b notebook-submission-v2 --track origin/notebook-submission-v2
  Switched to a new branch 'notebook-submission-v2'
  Branch 'notebook-submission-v2' set up to track remote branch 'notebook-submission-v2' from 'origin'.

AIcrowd Runtime Configuration 🧷¶

Define configuration parameters.

In [2]:

import os

AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/9db39385-0a4b-47db-8d20-fffb0480e47e_hw2_q2.zip")
AICROWD_RESULTS_DIR = os.getenv("OUTPUTS_DIR", "results")
API_KEY = "3df984307aad79642a8beea9e32c5d23" # Get your key from https://www.aicrowd.com/participants/me (ctrl + click the link)

In [3]:

!aicrowd login --api-key $API_KEY
!aicrowd dataset download -c rl-vi

API Key valid
Saved API Key successfully!
9db39385-0a4b-47db-8d20-fffb0480e47e_hw2_q2.zip: 100% 3.08k/3.08k [00:00<00:00, 144kB/s]

In [4]:

!unzip $AICROWD_DATASET_PATH

Archive:  /content/9db39385-0a4b-47db-8d20-fffb0480e47e_hw2_q2.zip
   creating: hw2_q2/
   creating: hw2_q2/targets/
  inflating: hw2_q2/targets/targets_0.npy  
   creating: hw2_q2/sample_results/
  inflating: hw2_q2/sample_results/sample_results_0.npy  
   creating: hw2_q2/inputs/
  inflating: hw2_q2/inputs/env_params_0.npy

In [5]:

DATASET_DIR = 'hw2_q2/'
!mkdir {DATASET_DIR}results/

Install packages 🗃¶

Please add all package installations in this section

In [5]:

Import packages 💻¶

In [6]:

import numpy as np
import matplotlib.pyplot as plt 
import os
# ADD ANY IMPORTS YOU WANT HERE

Task 0 - Complete the environment¶

You need to complete part of the environment which calculates the possible next states, their probabilities, and the reward.

In [7]:

class GridEnv_HW2:
    def __init__(self, 
                 goal_location, 
                 action_stochasticity,
                 non_terminal_reward,
                 terminal_reward,
                 grey_in,
                 brown_in,
                 grey_out,
                 brown_out
                ):

        # Do not edit this section 
        self.action_stochasticity = action_stochasticity
        self.non_terminal_reward = non_terminal_reward
        self.terminal_reward = terminal_reward
        self.grid_size = [10, 10]

        # Index of the actions 
        self.actions = {'N': (-1, 0), 
                        'E': (0,1), 
                        'S': (1,0), 
                        'W': (0,-1)}
        
        # Do not worry about the names not matching the direction you expect
        # Think of them as generic names and use the mapping to get the action direction and stochasticity
        self.perpendicular_order = ['N', 'E', 'S', 'W']
        
        l = ['normal' for _ in range(self.grid_size[0]) ]
        self.grid = np.array([l for _ in range(self.grid_size[1]) ], dtype=object)

        self.grid[goal_location[0], goal_location[1]] = 'goal'
        self.goal_location = goal_location

        for gi in grey_in:
            self.grid[gi[0],gi[1]] = 'grey_in'
        for bi in brown_in:    
            self.grid[bi[0], bi[1]] = 'brown_in'

        self.grey_out = go = grey_out
        self.brown_out = bo = brown_out

        self.grid[go[0], go[1]] = 'grey_out'
        self.grid[bo[0], bo[1]] = 'brown_out'
        
        self.states_sanity_check()
    
    def states_sanity_check(self):
        """ Implement to prevent cases where the goal gets overwritten etc """
        pass

    def visualize_grid(self):
        pass

    def _out_of_grid(self, state):
        if state[0] < 0 or state[1] < 0:
            return True
        elif state[0] > self.grid_size[0] - 1:
            return True
        elif state[1] > self.grid_size[1] - 1:
            return True
        else:
            return False

    def _grid_state(self, state):
        return self.grid[state[0], state[1]]        
        
    def get_transition_probabilites_and_reward(self, state, action):
        """ 
        Returns the probabiltity of all possible transitions for the given action in the form:
        A list of tuples of (next_state, probability, reward)
        Note that based on number of state and action there can be many different next states
        Unless the state is All the probabilities of next states should add up to 1
        """

        grid_state = self._grid_state(state)
        
        if grid_state == 'goal':
            return [(self.goal_location, 1.0, 0.0)]
        elif grid_state == 'grey_in':
            return [(self.grey_out, 1.0, self.non_terminal_reward)]
        elif grid_state == 'brown_in':
            return [(self.brown_out, 1.0, self.non_terminal_reward)]
        
        direction = self.actions.get(action, None)
        if direction is None:
            raise ValueError("Invalid action %s , please select among" % action, list(self.actions.keys()))

        nextstates_prob_rews = []

        # TASK 0 - Complete the environment

        # ADD YOUR CODE BELOW - DO NOT EDIT ABOVE THIS LINE

        # Hints: 
            # Get access to all actions with self.actions
            # Use self.action_stochasticity for the probabilities of the other action
                # The array will have probabilities for [0, 90, 180, -90] degrees
                # So self.action_stochasticity = [0.8, 0.1, 0.0, 0.1] means 0.8 for forward and 0.1 for left and right
            # Remember that you need to return a list of tuples with the form (next_state, probability, reward)
            # If you have 3 possible next states, you should return [(ns1, p1, r1), (ns2, p2, r2), (ns3, p3, r3)]
            # Use the helper function self._out_of_grid to check if any state is outside the grid
            
        # Important Note:
            # Do not hard code any state locations, they may be changed in the submissions


        # DO NOT EDIT BELOW THIS LINE

        return nextstates_prob_rews

Question - When do you decide to stop value iteration¶

Modify this cell and add you answer

Task 1¶

a) Implement Value iteration¶

In [8]:

def value_iter(env):
  value_grid = np.zeros((10, 10)) # Marked as J(s) in the homework pdf
  policy = np.zeros((10, 10), np.int32) # Marked as pi(s) in homework pdf

  value_grids = [] # Store all the J(s) grids at every iteration in this list
  policies = []  # Store all the pi(s) grids at every iteration in this list

  # ADD YOUR CODE BELOW - DO NOT EDIT ABOVE THIS LINE

  # Important Note: 
  # The action names are strings but the expected output is an integer array
  # To get the corresponding mapping of integer values use -> env.perpendicular_order
  # So if your action is 'E' 
  # Your result would be: env.perpendicular_order.index('E') -> 1
  

  # DO NOT EDIT BELOW THIS LINE
  results = {"value_grid": value_grid, "pi_s": policy}
  return results, value_grids, policies

Here is an example of what the "results" output from value_iter function should look like¶

The numpy array may look different from the image given in the pdf because the origins are different, do not worry about this, use indexing and directions as providided by the env class (env.actions) and do not hard code anything.

Ofcourse, it won't be all zeros

{'value_grid': array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
  'pi_s': array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)}

In [9]:

# DO NOT EDIT THIS CELL, DURING EVALUATION THE DATASET DIR WILL CHANGE
!mkdir $AICROWD_RESULTS_DIR
input_dir = os.path.join(DATASET_DIR, 'inputs')
for params_file in os.listdir(input_dir):
  kwargs = np.load(os.path.join(input_dir, params_file), allow_pickle=True).item()
  env = GridEnv_HW2(**kwargs)
  results, value_grids, policies = value_iter(env)
  idx = params_file.split('_')[-1][:-4]
  np.save(os.path.join(AICROWD_RESULTS_DIR, 'results_' + idx), results)

Task 2 - The value iteration loop goes to infinity (refer the pseudocode given above), so when would you stop your value iteration?¶

Modify this cell and add you answer

Task 3- Plot graph of $| | J_{i + 1} (s) - J_{i} (s) | |$ ¶

An example plot code is provided below, but you can change it if you want

In [10]:

import matplotlib.pyplot as plt
diffs = []
for ii in range(len(value_grids)-1):
    diff = np.linalg.norm(value_grids[ii+1] - value_grids[ii]) 
    diffs.append(diff)
plt.plot(diffs)

Out[10]:

[<matplotlib.lines.Line2D at 0x7f4613b495d0>]

Task 4 - Show $J (s)$ and $p i (s)$ after 10, 25 and final iteration.¶

In [11]:

# Use any visualization code you want to show the value and policies

Task 5 - Consider a new gridworld (GridWorld-2) as shown Figure 2 (GridWorld-2 differ from GridWorld-1 only in the position of the “Goal” state). Compare and contrast the behavior of J and greedy policy π for GridWorld-1 and GridWorld-2¶

Modify this cell and add you answer

Submit to AIcrowd 🚀¶

In [12]:

!DATASET_PATH=$AICROWD_DATASET_PATH aicrowd notebook submit -c rl-vi -a assets

WARNING: No assets directory at assets... Creating one...
Using notebook: /content/IITM_RL_VI_v1.ipynb for submission...
Mounting Google Drive 💾
Your Google Drive will be mounted to access the colab notebook
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.activity.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fphotos.native&response_type=code

Enter your authorization code:
4/1AY0e-g6jIIBZAntzt_iGN7Zp9azRu8Fw7xbyZhdSLRRPVy4pJcq_MS7tGNk
Mounted at /content/drive
Scrubbing API keys from the notebook...
Collecting notebook...
Validating the submission...
Executing install.ipynb...
[NbConvertApp] Converting notebook /content/submission/install.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python3
[NbConvertApp] Writing 1491 bytes to /content/submission/install.nbconvert.ipynb
Executing predict.ipynb...
[NbConvertApp] Converting notebook /content/submission/predict.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python3
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] Writing 25813 bytes to /content/submission/predict.nbconvert.ipynb
submission.zip ━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 31.1/29.5 KB • 2.3 MB/s • 0:00:00
                                   ╭─────────────────────────╮                                    
                                   │ Successfully submitted! │                                    
                                   ╰─────────────────────────╯                                    
                                         Important links                                          
┌──────────────────┬─────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/rliitm-1/submissions/125763              │
│                  │                                                                             │
│  All submissions │ https://www.aicrowd.com/challenges/rliitm-1/submissions?my_submissions=true │
│                  │                                                                             │
│      Leaderboard │ https://www.aicrowd.com/challenges/rliitm-1/leaderboards                    │
│                  │                                                                             │
│ Discussion forum │ https://discourse.aicrowd.com/c/rliitm-1                                    │
│                  │                                                                             │
│   Challenge page │ https://www.aicrowd.com/challenges/rliitm-1                                 │
└──────────────────┴─────────────────────────────────────────────────────────────────────────────┘

In [ ]:

Content

6323

Show Comments

Comments

You must login before you can post a comment.

RL-VI

[Starter Notebook] RL - Value Iteration

What is the notebook about?¶

Problem - Value Iteration¶

How to use this notebook? 📝¶

Setup AIcrowd Utilities 🛠¶

AIcrowd Runtime Configuration 🧷¶

Install packages 🗃¶

Import packages 💻¶

Task 0 - Complete the environment¶

Question - When do you decide to stop value iteration¶

Task 1¶

a) Implement Value iteration¶

Here is an example of what the "results" output from value_iter function should look like¶

Task 2 - The value iteration loop goes to infinity (refer the pseudocode given above), so when would you stop your value iteration?¶

Task 3- Plot graph of ||Ji+1(s)−Ji(s)||¶

Task 4 - Show J(s) and pi(s) after 10, 25 and final iteration.¶

Task 5 - Consider a new gridworld (GridWorld-2) as shown Figure 2 (GridWorld-2 differ from GridWorld-1 only in the position of the “Goal” state). Compare and contrast the behavior of J and greedy policy π for GridWorld-1 and GridWorld-2¶

Submit to AIcrowd 🚀¶

Content

Task 3- Plot graph of $| | J_{i + 1} (s) - J_{i} (s) | |$ ¶

Task 4 - Show $J (s)$ and $p i (s)$ after 10, 25 and final iteration.¶