NeurIPS 2023 - The Neural MMO Challenge

Preparing the pkl submission file for a custom policy

Open in Colab doesn't seem to work. Click this url instead: https://colab.research.google.com/drive/1UVQgJGgTOwphb2F9cXj1KZsrPCri5gYs

Set up your instance - gpu and google drive¶

In [1]:

# Check if (NVIDIA) GPU is available
import torch
assert torch.cuda.is_available, "CUDA gpu not available"

In [2]:

# Set up the work directory
import os
assert os.path.exists("/content/drive/MyDrive"), "Google Drive not mounted"

work_dir = "/content/drive/MyDrive/nmmo/"

Train your agent¶

Install nmmo env and pufferlib¶

In [3]:

# Install nmmo env
!pip install pufferlib nmmo > /dev/null
!pip show nmmo  # should be 2.0.3
!pip show pufferlib  # should be 0.4.5

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires fastapi, which is not installed.
lida 0.0.10 requires kaleido, which is not installed.
lida 0.0.10 requires python-multipart, which is not installed.
lida 0.0.10 requires uvicorn, which is not installed.
tensorflow 2.14.0 requires numpy>=1.23.5, but you have numpy 1.23.3 which is incompatible.
Name: nmmo
Version: 2.0.3
Summary: Neural MMO is a platform for multiagent intelligence research inspired by Massively Multiplayer Online (MMO) role-playing games. Documentation hosted at neuralmmo.github.io.
Home-page: https://github.com/neuralmmo/environment
Author: Joseph Suarez
Author-email: jsuarez@mit.edu
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: autobahn, dill, gym, imageio, numpy, ordered-set, pettingzoo, psutil, py, pylint, pytest, pytest-benchmark, scipy, tqdm, Twisted, vec-noise
Required-by: 
Name: pufferlib
Version: 0.4.5
Summary: PufferAI LibraryPufferAI's library of RL tools and utilities
Home-page: https://github.com/PufferAI/PufferLib
Author: Joseph Suarez
Author-email: jsuarez@mit.edu
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: cython, filelock, gym, numpy, opencv-python, openskill, pettingzoo
Required-by:

Install the baselines¶

In [ ]:

# Create the work directory, download the baselines code
%mkdir $work_dir
%cd $work_dir
!git clone https://github.com/carperai/nmmo-baselines baselines --depth=1

mkdir: cannot create directory ‘/content/drive/MyDrive/nmmo/’: File exists
/content/drive/MyDrive/nmmo
Cloning into 'baselines'...
remote: Enumerating objects: 57, done.
remote: Counting objects: 100% (57/57), done.
remote: Compressing objects: 100% (50/50), done.
remote: Total 57 (delta 2), reused 32 (delta 2), pack-reused 0
Receiving objects: 100% (57/57), 23.54 MiB | 18.83 MiB/s, done.
Resolving deltas: 100% (2/2), done.

In [4]:

# Install libs to run the baselines
%cd $work_dir
%cd baselines

# Create a requirements_colab.txt
with open(work_dir+'baselines/requirements_colab.txt', "w") as f:
  f.write("""
accelerate==0.21.0
bitsandbytes==0.41.1
dash==2.11.1
openelm
pandas
plotly==5.15.0
psutil==5.9.3
scikit-learn==1.3.0
tensorboard==2.11.2
tiktoken==0.4.0
torch
transformers==4.31.0
wandb==0.13.7
  """)

!pip install -r requirements_colab.txt > /dev/null

/content/drive/MyDrive/nmmo
/content/drive/MyDrive/nmmo/baselines
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
tensorflow 2.14.0 requires numpy>=1.23.5, but you have numpy 1.23.3 which is incompatible.
tensorflow 2.14.0 requires tensorboard<2.15,>=2.14, but you have tensorboard 2.11.2 which is incompatible.

Run `python train.py`¶

In [ ]:

# Just to check if the training flow works. The checkpoints are saved under nmmo/runs
%cd $work_dir
%cd baselines

ckpt_dir = work_dir + "runs"

!python train.py --runs-dir $ckpt_dir --local-mode true --train-num-steps=5_000

/content/drive/MyDrive/nmmo
/content/drive/MyDrive/nmmo/baselines
INFO:root:Training run: nmmo_20231210_223426 (/content/drive/MyDrive/nmmo/runs/nmmo_20231210_223426)
INFO:root:Training args: Namespace(attend_task='none', attentional_decode=True, bptt_horizon=8, checkpoint_interval=30, clip_coef=0.1, death_fog_tick=None, device='cuda', early_stop_agent_num=8, encode_task=True, eval_batch_size=32768, eval_mode=False, eval_num_policies=2, eval_num_rounds=1, eval_num_steps=1000000, explore_bonus_weight=0.01, extra_encoders=True, heal_bonus_weight=0.03, hidden_size=256, input_size=256, learner_weight=1.0, local_mode=True, map_size=128, maps_path='maps/train/', max_episode_length=1024, max_opponent_policies=0, meander_bonus_weight=0.02, num_agents=128, num_buffers=1, num_cores=None, num_envs=1, num_lstm_layers=0, num_maps=128, num_npcs=256, policy_store_dir=None, ppo_learning_rate=0.00015, ppo_training_batch_size=128, ppo_update_epochs=3, resilient_population=0.2, rollout_batch_size=1024, run_name='nmmo_20231210_223426', runs_dir='/content/drive/MyDrive/nmmo/runs', seed=1, spawn_immunity=20, sqrt_achievement_rewards=False, task_size=4096, tasks_path='reinforcement_learning/curriculum_with_embedding.pkl', track='rl', train_num_steps=5000, use_serial_vecenv=True, wandb_entity=None, wandb_project=None)
INFO:root:Using policy store from /content/drive/MyDrive/nmmo/runs/nmmo_20231210_223426/policy_store
Allocated 93.24 MB to environments. Only accurate for Serial backend.
PolicyPool sample_weights: [128]
Allocated to storage - Pytorch: 0.00 GB, System: 0.11 GB
INFO:root:PolicyPool: Updated policies: dict_keys(['learner'])
Allocated during evaluation - Pytorch: 0.01 GB, System: 1.53 GB
Epoch: 0 - 1K steps - 0:00:20 Elapsed
	Steps Per Second: Env=1034, Inference=157
	Train=473

Allocated during training - Pytorch: 0.07 GB, System: 0.25 GB
INFO:root:Saving policy to /content/drive/MyDrive/nmmo/runs/nmmo_20231210_223426/policy_store/nmmo_20231210_223426.000001
INFO:root:PolicyPool: Updated policies: dict_keys(['learner'])
Allocated during evaluation - Pytorch: 0.00 GB, System: 0.03 GB
Epoch: 1 - 2K steps - 0:00:24 Elapsed
	Steps Per Second: Env=1032, Inference=5534
	Train=759

Allocated during training - Pytorch: 0.01 GB, System: 0.03 GB
INFO:root:PolicyPool: Updated policies: dict_keys(['learner'])
Allocated during evaluation - Pytorch: 0.00 GB, System: 0.03 GB
Epoch: 2 - 3K steps - 0:00:27 Elapsed
	Steps Per Second: Env=942, Inference=5531
	Train=711

Allocated during training - Pytorch: 0.01 GB, System: 0.03 GB
INFO:root:PolicyPool: Updated policies: dict_keys(['learner'])
Allocated during evaluation - Pytorch: 0.00 GB, System: 0.00 GB
Epoch: 3 - 4K steps - 0:00:33 Elapsed
	Steps Per Second: Env=375, Inference=3836
	Train=700

Allocated during training - Pytorch: 0.01 GB, System: 0.00 GB
INFO:root:Saving policy to /content/drive/MyDrive/nmmo/runs/nmmo_20231210_223426/policy_store/nmmo_20231210_223426.000004

In [ ]:

# new policy store should create `_state.pth` files, which contains only the model's state_dict
!ls /content/drive/MyDrive/nmmo/runs/nmmo_20231210_014715/policy_store/

nmmo_20231210_014715.000001.pt	       nmmo_20231210_014715.000004.pt
nmmo_20231210_014715.000001_state.pth  nmmo_20231210_014715.000004_state.pth

Test python evaluate.py with the custom policy checkpoints¶

In [ ]:

%cd $work_dir
%cd baselines

!wget https://kywch.github.io/replays/different_policy.pkl -P policies/
!wget https://kywch.github.io/replays/random_policy.pkl -P policies/
!ls policies/

/content/drive/MyDrive/nmmo
/content/drive/MyDrive/nmmo/baselines
--2023-12-10 01:53:16--  https://kywch.github.io/replays/different_policy.pkl
Resolving kywch.github.io (kywch.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to kywch.github.io (kywch.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19430469 (19M) [application/octet-stream]
Saving to: ‘policies/different_policy.pkl’

different_policy.pk 100%[===================>]  18.53M  63.7MB/s    in 0.3s    

2023-12-10 01:53:18 (63.7 MB/s) - ‘policies/different_policy.pkl’ saved [19430469/19430469]

--2023-12-10 01:53:18--  https://kywch.github.io/replays/random_policy.pkl
Resolving kywch.github.io (kywch.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to kywch.github.io (kywch.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22372 (22K) [application/octet-stream]
Saving to: ‘policies/random_policy.pkl’

random_policy.pkl   100%[===================>]  21.85K  --.-KB/s    in 0.002s  

2023-12-10 01:53:19 (9.99 MB/s) - ‘policies/random_policy.pkl’ saved [22372/22372]

different_policy.pkl  puf41_seed425.000305.pt  random_policy.pkl  ranking.sqlite

In [ ]:

!python evaluate.py -p policies/

INFO:root:Ranking checkpoints from policies/
INFO:root:Replays will NOT be generated
INFO:root:Using policy store from policies/
INFO:root:Using existing policy ranker from policies/ranker.pickle
Allocated 31.00 MB to environments. Only accurate for Serial backend.
PolicyPool sample_weights: [0, 32, 32, 32, 32]
Allocated to storage - Pytorch: 0.00 GB, System: 3.41 GB
INFO:root:PolicyPool: Updated policies: dict_keys(['learner', 'different_policy', 'new_submission', 'puf41_seed425.000305', 'random_policy'])
('anchor', 1000.0, 33.333333333333336)
('different_policy', 1010.5687896061233, 33.333333333333336)
('puf41_seed425.000305', 1034.6498293586396, 33.333333333333336)
('learner', 1006.8908185882893, 33.333333333333336)
('random_policy', 954.3162223137092, 33.333333333333336)
('new_submission', 993.5743401332385, 33.333333333333336)
Allocated during evaluation - Pytorch: 0.01 GB, System: 1.65 GB
Epoch: 0 - 32K steps - 0:02:12 Elapsed
	Steps Per Second: Env=411, Inference=4557
INFO:root:PolicyPool: Updated policies: dict_keys(['learner', 'different_policy', 'new_submission', 'puf41_seed425.000305', 'random_policy'])
Traceback (most recent call last):
  File "/content/drive/MyDrive/nmmo/baselines/evaluate.py", line 358, in <module>
Process Process-2:
Process Process-5:
  File "/content/drive/MyDrive/nmmo/baselines/evaluate.py", line 238, in rank_policies
    _, stats, infos = evaluator.evaluate()
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/utils.py", line 223, in wrapper
Process Process-3:
    result = func(*args, **kwargs)
  File "/content/drive/MyDrive/nmmo/baselines/reinforcement_learning/clean_pufferl.py", line 292, in evaluate
Process Process-4:
    o, r, d, i = self.buffers[buf].recv()
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/vectorization.py", line 234, in recv
    returns = self._recv()
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/vectorization.py", line 359, in _recv
    return [queue.get() for queue in self.response_queues]
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/vectorization.py", line 359, in <listcomp>
    return [queue.get() for queue in self.response_queues]
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 103, in get
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/vectorization.py", line 351, in _worker_process
    response = func(*args, **kwargs)
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
    res = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/vectorization.py", line 144, in step
    o, r, d, i= env.step(atns)
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/emulation.py", line 323, in step
    obs, rewards, dones, infos = self.env.step(unpacked_actions)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/core/env.py", line 359, in step
    rewards, infos = self._compute_rewards()
  File "/usr/local/lib/python3.10/dist-packages/nmmo/core/env.py", line 503, in _compute_rewards
    self.game_state = self._gamestate_generator.generate(self.realm, self.obs)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 277, in generate
    event_index = precompute_index(event_data, EventAttr['ent_id']),
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 285, in precompute_index
    index[id_].append(row)
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/vectorization.py", line 351, in _worker_process
    response = func(*args, **kwargs)
KeyboardInterrupt
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/vectorization.py", line 144, in step
    o, r, d, i= env.step(atns)
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/emulation.py", line 323, in step
    obs, rewards, dones, infos = self.env.step(unpacked_actions)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/core/env.py", line 359, in step
    rewards, infos = self._compute_rewards()
  File "/usr/local/lib/python3.10/dist-packages/nmmo/core/env.py", line 503, in _compute_rewards
    self.game_state = self._gamestate_generator.generate(self.realm, self.obs)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 277, in generate
    event_index = precompute_index(event_data, EventAttr['ent_id']),
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 285, in precompute_index
    index[id_].append(row)
KeyboardInterrupt
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/vectorization.py", line 351, in _worker_process
    response = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/vectorization.py", line 144, in step
    o, r, d, i= env.step(atns)
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/emulation.py", line 323, in step
    obs, rewards, dones, infos = self.env.step(unpacked_actions)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/core/env.py", line 359, in step
    rewards, infos = self._compute_rewards()
  File "/usr/local/lib/python3.10/dist-packages/nmmo/core/env.py", line 506, in _compute_rewards
    task_rewards, task_infos = task.compute_rewards(self.game_state)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/task_api.py", line 97, in compute_rewards
    reward = self._map_progress_to_reward(gs) * self._reward_multiplier
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/task_api.py", line 83, in _map_progress_to_reward
    new_progress = max(min(self._eval_fn(gs)*1.0,1.0),0.0)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/predicate_api.py", line 57, in __call__
    progress = max(min(self._evaluate(gs)*1.0,1.0),0.0)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/predicate_api.py", line 169, in _evaluate
    return fn(gs, *self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/base_predicates.py", line 89, in AttainSkill
    skill_level = getattr(subject,skill.__name__.lower() + '_level') - 1  # base level is 1
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/group.py", line 93, in __getattr__
    return self._sd.__getattribute__(attr)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 242, in __getattribute__
    v = getattr(self.entity, attr)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 231, in __getattribute__
    return object.__getattribute__(self, attr)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 102, in __get__
    self.cache[instance] = self.func(instance)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 210, in entity
    return EntityView(self._gs, self._subject, self._sbj_ent)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 231, in __getattribute__
    return object.__getattribute__(self, attr)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 102, in __get__
    self.cache[instance] = self.func(instance)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 206, in _sbj_ent
    return self._gs.where_in_id('entity', self._subject.agents)
  File "/usr/local/lib/python3.10/dist-packages/nmmo/task/game_state.py", line 66, in where_in_id
    if data_type == 'item':
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/vectorization.py", line 351, in _worker_process
    response = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/vectorization.py", line 144, in step
    o, r, d, i= env.step(atns)
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/emulation.py", line 332, in step
    obs[agent], rewards[agent], dones[agent], infos[agent] = postprocess_and_flatten(
  File "/usr/local/lib/python3.10/dist-packages/pufferlib/emulation.py", line 399, in postprocess_and_flatten
    reward, done, info = postprocessor.reward_done_info(
  File "/content/drive/MyDrive/nmmo/baselines/environment.py", line 82, in reward_done_info
    reward, done, info = super().reward_done_info(reward, done, info)
  File "/content/drive/MyDrive/nmmo/baselines/leader_board.py", line 221, in reward_done_info
    self._curr_unique_count = len(extract_unique_event(log, self.env.realm.event_log.attr_to_col))
  File "/content/drive/MyDrive/nmmo/baselines/leader_board.py", line 397, in extract_unique_event
    log[idx, attr_to_col[attr]] = 0
KeyboardInterrupt

Create a checkpoint with custom policy¶

NOTE: Please check if the evaluation works with your checkpoint WITHOUT your custom files.

In [6]:

# replace policy.py with your file
custom_policy_file = work_dir + "baselines/reinforcement_learning/" + "policy.py"
assert os.path.exists(custom_policy_file), "CANNOT find the policy file"
print(custom_policy_file)

/content/drive/MyDrive/nmmo/baselines/reinforcement_learning/policy.py

In [5]:

# replace checkpoint with
checkpoint_to_submit = work_dir + "runs/nmmo_20231210_014715/policy_store/nmmo_20231210_014715.000004_state.pth"
assert os.path.exists(checkpoint_to_submit), "CANNOT find the checkpoint file"
assert checkpoint_to_submit.endswith("_state.pth"), "the checkpoint file must end with _state.pth"
print(checkpoint_to_submit)

/content/drive/MyDrive/nmmo/runs/nmmo_20231210_014715/policy_store/nmmo_20231210_014715.000004_state.pth

In [10]:

import pickle
import torch

def create_custom_policy_pt(policy_file, pth_file, out_name="my_submission.pkl"):
  assert out_name.endswith(".pkl"), "The file name must end with .pkl"
  with open(policy_file, "r") as f:
    src_code = f.read()

  # add the make_policy() function
  # YOU SHOULD CHECK the name of your policy (if not Baseline),
  # and the args that go into the policy
  src_code += """

class Config(nmmo.config.Default):
    PROVIDE_ACTION_TARGETS = True
    PROVIDE_NOOP_ACTION_TARGET = True
    MAP_FORCE_GENERATION = False
    TASK_EMBED_DIM = 4096
    COMMUNICATION_SYSTEM_ENABLED = False

def make_policy():
    from pufferlib.frameworks import cleanrl
    env = pufferlib.emulation.PettingZooPufferEnv(nmmo.Env(Config()))
    # Parameters to your model should match your configuration
    learner_policy = Baseline(
        env,
        input_size=256,
        hidden_size=256,
        task_size=4096
    )
    return cleanrl.Policy(learner_policy)
  """
  state_dict = torch.load(pth_file, map_location="cpu")
  checkpoint = {
    "policy_src": src_code,
    "state_dict": state_dict,
  }
  with open(out_name, "wb") as out_file:
    pickle.dump(checkpoint, out_file)

In [11]:

%cd $work_dir
%cd baselines

# put the checkpoint into the policies directory
create_custom_policy_pt(custom_policy_file, checkpoint_to_submit,
                        out_name=work_dir + "baselines/policies/new_submission2.pkl")

/content/drive/MyDrive/nmmo
/content/drive/MyDrive/nmmo/baselines

In [13]:

# see if new_submission.pkl works with the other checkpoints
!python evaluate.py -p policies/

Traceback (most recent call last):
  File "/content/drive/MyDrive/nmmo/baselines/evaluate.py", line 14, in <module>
    import torch
  File "/usr/local/lib/python3.10/dist-packages/torch/__init__.py", line 235, in <module>
    from torch._C import *  # noqa: F403
KeyboardInterrupt
^C

In [ ]:

Content

823

Show Comments

Comments

yejx0108

11 months ago

Your notebook is very great! And I have some problems: 1st, where could I modify RL algorithm, is this in clean_pufferl.p? 2nd, where could I modify the reward, in task_api.py?

Liked by

toppng

9 months ago

The same issue: Your notebook is excellent! However, I have a couple of concerns: Firstly, where can I modify the RL algorithm? Is it in clean_pufferl.py? Secondly, where can I adjust the rewards, in task_api.py?

You must login before you can post a comment.

NeurIPS 2023 - The Neural MMO Challenge

Preparing the pkl submission file for a custom policy

Set up your instance - gpu and google drive¶

Train your agent¶

Install nmmo env and pufferlib¶

Install the baselines¶

Run python train.py¶

Test python evaluate.py with the custom policy checkpoints¶

Create a checkpoint with custom policy¶

Content

Run `python train.py`¶