Location
Badges
Activity
Ratings Progression
Challenge Categories
Challenges Entered
Measure sample efficiency and generalization in reinforcement learning using procedurally generated environments
Latest submissions
See Allgraded | 94645 | ||
graded | 93873 | ||
graded | 93849 |
Participant | Rating |
---|---|
jon_baer | 0 |
anton_makiievskyi | 0 |
Participant | Rating |
---|
NeurIPS 2020: Procgen Competition
Solution Summary (8th) and Thoughts
About 4 years agoThank you for sharing! I will write up a small bit about what I did as well. I find it so interesting that you scored so well with your solution. I implemented everything you did, but was unable to improve my score much. I implemented a deeper version of impala (talked about briefly in the coinrun paper) with channels of [32, 64, 64, 64, 64], and [32, 64, 128, 128, 128] with Fixup Initialization (https://arxiv.org/abs/1901.09321). I think I may have gone overboard on tinkering with model architectures, but I was convinced with this network as I did so well in the warm up round (and semi-decently in round 1) with it. I did try shallower networks for a few submissions but I got worse scores everytime. I did use pytorch and Iβm thinking that may have had some effect on performance that I was missing. Wish I would have implemented in tensorflow as well.
I tried impala, apex and rainbow as well, but found PPO worked best and was the most consistent.
I implemented the improvements from this paper https://arxiv.org/pdf/2005.12729.pdf. Namely normalized rewards, orthogonal initialization and learning rate annealing. I played with all sorts of different LR schedules and entropy coefficient schedules. Lots and lots of hyperparameter tuning. I also tried a number of different reward scaling schemes as well, including log scales, and intrinsic bonuses (histogram based curiosity, bonuses for collecting rewards fast, penalties for dying). None of these payed off when evaluated on all environments, even though some environments showed improvement.
I played with all sorts of framestacks. All combinations of 4, 3, 2 and skipped frames. I also found good success with a frame difference stack. I think this helps a lot more in some environments than others.
I also did some clever environment wrapping. I replicated sequential_levels, and saw great success in some environments, and terrible performance in others. Wasnβt able to identify why this was the case. Also played with some βcheckpointingβ ideas. These didnβt work in the end, but some cool ideas there for sure.
I implemented a generalized version of action reduction. Worked really well for the warm up round and round 1, when thereβs fewer edge cases to consider but my solution ended up being brittle for all environments (especially the private environments). It worked really well for the warm up round and round 1, when thereβs fewer edge cases to consider. I think thereβs a lot of potential for future research with action reduction especially. Very promising and shows drastic performance increase.
And lastly, I implemented some ensembling techniques that Iβm writing a paper on.
I did explore augmentations as well. Crops, scaling and color jitter. I think this is a technique that would have helped the generalization track and not so much on sample efficiency. For the extra computation time, it did not seem worth it to add. But for generalization, it may be super important. Who knows. I really think the generalization should have been given some more thought to this competition. It feels like an afterthought, when it really should have been a primary goal for this round.
The hardest part of this competition was optimizing for so many different environments. There was a lot of back and forth where I would zero in on optimizing my weaknesses and would focus on plunder or bigfish and could get upwards of 18 for plunder and 25 for bigfish but then miner and starpilot would suffer. This happenend so many times. I think implementing a method for testing on all environments locally would have been huge instead of the manual approach I was using.
Notably, I did struggle with the private environments, hovercraft and safezone. My guess is these were the biggest detractors to my score with returns of ~2.3 and ~1.7 respectively. Iβm really hoping these environments are released so that I can see whatβs going on.
While I ended up in 11th place, I think I ended up in 1st for the most submissions (sorry for using up your compute budget AICrowd)
Canβt wait to read more about everyoneβs solutions. Itβs been real fun to follow everyoneβs progress and to have a platform to try so many ideas. I think procgen is really an amazing competition environment. Looking forward to the next one.
Anyone else having trouble with Plunder rollout timeout?
About 4 years agoYes! Same problem here. All of my submissions this round have had evaluation timeouts. Glad to know Iβm not the only one. Would love a more relaxed time limit.
How to find subtle implementation details
Over 4 years agoGood question. I think reading code, research papers and experimentation is the only way. But with your post here, Iβm left wondering if I missed something in the torch/tf implementation differences since you ended with such a good score in round 1!
Getting Rmax from environment
Over 4 years ago@jyotish Curious about how to reconcile this post with new announcement of no environment specific logic. Can we make use of knowing the max rewards and min rewards of each specific environment?
Change in daily submission limits for round 1
Over 4 years agoThanks @shivam. Curious if thereβs any update on the evaluation time as of this week?
Rllib custom env
Over 4 years agoThanks @jyotish. So can you confirm the config looks like this?
config:
env_config:
env_name: coinrun
num_levels: 0
start_level: 0
paint_vel_info: False
use_generated_assets: False
distribution_mode: easy
center_agent: True
use_sequential_levels: False
use_backgrounds: True
restrict_themes: False
use_monochrome_assets: False
rollout: False
...
Can you post a quick example of where we could access in an env wrapper? I was trying something like what you posted above, but could not get it to work:
def create_env(config):
rollout = config.pop("rollout")
procgen = ProcgenEnvWrapper(config)
env = MyWrapper(procgen, rollout)
return env
registry.register_env(
"my_wrapper", create_env,
)
Ray/rllib appears to call my create_env() function more than once and errors out because the rollout key was popped.
Rllib custom env
Over 4 years agoHas anyone found out a way to do what Anton asked? Is it possible for the evaluators to add a field in the βenv_config:β part of the yaml configuration that says is_training: true or false? I remember trying some trickery in the warm up round by modifying run.sh but I donβt think it worked and I gave up on that idea.
Change in daily submission limits for round 1
Over 4 years ago@jyotish or @mohanty with the submission limit being dropped to 2 per day, is it possible to relax the evaluation time limit a bit? Maybe from 30 min to 45 minutes or an hour? I find that miner and big fish are susceptible to longer episodes and the evaluation time limit can be hit even with a model that performs well on the other environments.
Edit: I actually did some measurements and have some data that might be interesting. Bigfish is especially problematic. Myself as a human player was finding that I could beat it in approx 600-700 timesteps. I timed the rollout.py script and found that a baseline impala ppo model was giving me a throughput of ~450 timesteps per second. As far as I know, the rollouts are performed sequentially and not using multiple workers. This means that an optimal human level performance would equate to:
(1000 episodes * 650 steps) / (450 timesteps per second) = ~24 minutes.
This is under the 30 minute time limit but itβs pretty close. A better model could actually take longer and fail due to the 30 minute limit.
I think it makes sense and would be fair for the evaluation limit to at least be long enough to handle a baseline model running the eval episodes num (1000) * the max steps per episode (1000) / baseline model throughput (450) which is ~37 minutes. Itβs a small increase, but it could make a difference for some of the failures weβre seeing. 45 min could be nice for algos that might need a bit more computation than the baseline.
Submission Limits
Over 4 years agoYea @maraoz if you click on create submission on the aicrowd competition page (https://www.aicrowd.com/challenges/neurips-2020-procgen-competition/submissions/new), it shows you how many submissions you have left for that day and when the next one is available. Also, Iβm fairly sure that submissions failed due to not having any more submissions left donβt actually count towards your total submission number.
2 hours training time limit
Over 4 years ago@jyotish @shivam Are the contest organizers willing to impose the 2 hour limit by forcing the time_total_s: 7200 like @xiaocheng_tang suggested instead of the current time limit the clusters are using? Iβm finding that some of my training sessions time out before theyβve had a full two hours two train. For example, my last submission failed at 5883.03 seconds. The clusters are much busier now this past week with more people submitting and Iβm guessing that a lot of time is being burnt on scheduling and overhead. Enforcing the 7200 second limit with the yaml file seems much more consistent and lets everyone have the same amount of training without being limited by how busy the clusters are.
Selecting seeds during training
Over 4 years agoJust to add a bit more discussion here. I thought a little bit more about curriculum learning and perhaps itβs a bit against the spirit of the competition. For round 1 it doesnβt matter, but when we get into the final round where num_levels=200, curriculum learning just seems like a way to skirt that rule by having more levels to work with. This would only work if youβre careful to only allot x numbers of levels for easy and 200-x levels for the hard distribution. Just something to keep in mind if anyone else wants to explore this idea.
Selecting seeds during training
Over 4 years agoAgreed. I wanted to try curriculum learning (and a few other ideas) but basically wrote them off as not possible since we canβt change any of the environment wrapper code. Would appreciate it if anyone found a workaround.
Training Error?
Over 4 years agoI just had a submission error as well. Stopped in the middle of training (after 5 million timesteps). Still under the 2 hour time limit and no out of memory errors as far as I can tell. Interestingly, it did move on to the rollout phase and give me a score.
Impala Pytorch Baseline Bug(s)
Over 4 years agoI just want to confirm that the pytorch impala baseline model does seem bugged. No changes, same model included in the starter kit. use_pytorch is True in the yaml config. Extremely low throughput.
Here is something I see in the metrics logs
(pid=102) /pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
Submission Compute Time Limits
Over 4 years agoSure thing, thanks for the response. I see in my most recent submission #68496 in the gitlab issue that βTime Elapsed: 2 hours 56 minutes 11 secondsβ. Seems like a lot of extra time for resource provision, but maybe thatβs normal.
Submission Compute Time Limits
Over 4 years agoThe competition readme states that we are given 8 million timesteps and 2 hours worth of compute. Iβve found myself go a bit over the 2 hour limit a couple times on accident. Iβm wondering if this 2 hour limit will be enforced in the next rounds of the competition? If it is enforced, will it be enforced at the submission level (i.e. will aicrowd stop training automatically after 2 hours) or is it on us to enforce the 2 hour limit somehow through our configuration? Or is the 2 hour compute not a hard requirement like the 8M timesteps is?
Several questions about the competition
Over 4 years ago@jyotish Above you mentioned that any changes to train.py will be dropped when submitted, can I assume that the same applies to rollout.py? Iβm brand new to RLlib and Ray so this warm up phase is very helpful. If we have rollout specific logic, is there a place youβd suggest implementing?
Submission Limits
Over 4 years agoI have a couple questions regarding submission limits. What are the best practices for figuring out when you can submit? I know thereβs a hard 99 submission limit for the round but Iβm also running into a daily limit and am finding myself wasting submissions only for it to fail and tell me to come back in a few hours to try again. This daily limit wasnβt clear in the competition documentation.
To go along with that, is there a way to retry previously failed submissions instead of pushing new tags?
How to find subtle implementation details
About 4 years ago@lars12lit just for some extra data, I used pytorch and rllibβs PPO. Seems like a significant difference between where I ended up (11th place) and everyone else in the top 10. I did a lot of tuning too. My hunch is pytorch is the culprit.