Loading
2 Follower
0 Following
tim_whitaker

Location

US

Badges

1
1
1

Activity

Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Mon
Wed
Fri

Ratings Progression

Loading...

Challenge Categories

Loading...

Challenges Entered

Measure sample efficiency and generalization in reinforcement learning using procedurally generated environments

Latest submissions

See All
graded 94645
graded 93873
graded 93849
Participant Rating
jon_baer 0
anton_makiievskyi 0
Participant Rating
tim_whitaker has not joined any teams yet...

NeurIPS 2020: Procgen Competition

How to find subtle implementation details

About 4 years ago

@lars12lit just for some extra data, I used pytorch and rllib’s PPO. Seems like a significant difference between where I ended up (11th place) and everyone else in the top 10. I did a lot of tuning too. My hunch is pytorch is the culprit.

Solution Summary (8th) and Thoughts

About 4 years ago

Thank you for sharing! I will write up a small bit about what I did as well. I find it so interesting that you scored so well with your solution. I implemented everything you did, but was unable to improve my score much. I implemented a deeper version of impala (talked about briefly in the coinrun paper) with channels of [32, 64, 64, 64, 64], and [32, 64, 128, 128, 128] with Fixup Initialization (https://arxiv.org/abs/1901.09321). I think I may have gone overboard on tinkering with model architectures, but I was convinced with this network as I did so well in the warm up round (and semi-decently in round 1) with it. I did try shallower networks for a few submissions but I got worse scores everytime. I did use pytorch and I’m thinking that may have had some effect on performance that I was missing. Wish I would have implemented in tensorflow as well.

I tried impala, apex and rainbow as well, but found PPO worked best and was the most consistent.

I implemented the improvements from this paper https://arxiv.org/pdf/2005.12729.pdf. Namely normalized rewards, orthogonal initialization and learning rate annealing. I played with all sorts of different LR schedules and entropy coefficient schedules. Lots and lots of hyperparameter tuning. I also tried a number of different reward scaling schemes as well, including log scales, and intrinsic bonuses (histogram based curiosity, bonuses for collecting rewards fast, penalties for dying). None of these payed off when evaluated on all environments, even though some environments showed improvement.

I played with all sorts of framestacks. All combinations of 4, 3, 2 and skipped frames. I also found good success with a frame difference stack. I think this helps a lot more in some environments than others.

I also did some clever environment wrapping. I replicated sequential_levels, and saw great success in some environments, and terrible performance in others. Wasn’t able to identify why this was the case. Also played with some β€œcheckpointing” ideas. These didn’t work in the end, but some cool ideas there for sure.

I implemented a generalized version of action reduction. Worked really well for the warm up round and round 1, when there’s fewer edge cases to consider but my solution ended up being brittle for all environments (especially the private environments). It worked really well for the warm up round and round 1, when there’s fewer edge cases to consider. I think there’s a lot of potential for future research with action reduction especially. Very promising and shows drastic performance increase.

And lastly, I implemented some ensembling techniques that I’m writing a paper on.

I did explore augmentations as well. Crops, scaling and color jitter. I think this is a technique that would have helped the generalization track and not so much on sample efficiency. For the extra computation time, it did not seem worth it to add. But for generalization, it may be super important. Who knows. I really think the generalization should have been given some more thought to this competition. It feels like an afterthought, when it really should have been a primary goal for this round.

The hardest part of this competition was optimizing for so many different environments. There was a lot of back and forth where I would zero in on optimizing my weaknesses and would focus on plunder or bigfish and could get upwards of 18 for plunder and 25 for bigfish but then miner and starpilot would suffer. This happenend so many times. I think implementing a method for testing on all environments locally would have been huge instead of the manual approach I was using.

Notably, I did struggle with the private environments, hovercraft and safezone. My guess is these were the biggest detractors to my score with returns of ~2.3 and ~1.7 respectively. I’m really hoping these environments are released so that I can see what’s going on.

While I ended up in 11th place, I think I ended up in 1st for the most submissions (sorry for using up your compute budget AICrowd) :slight_smile:

Can’t wait to read more about everyone’s solutions. It’s been real fun to follow everyone’s progress and to have a platform to try so many ideas. I think procgen is really an amazing competition environment. Looking forward to the next one.

Anyone else having trouble with Plunder rollout timeout?

About 4 years ago

Yes! Same problem here. All of my submissions this round have had evaluation timeouts. Glad to know I’m not the only one. Would love a more relaxed time limit.

How to find subtle implementation details

Over 4 years ago

Good question. I think reading code, research papers and experimentation is the only way. But with your post here, I’m left wondering if I missed something in the torch/tf implementation differences since you ended with such a good score in round 1!

Getting Rmax from environment

Over 4 years ago

@jyotish Curious about how to reconcile this post with new announcement of no environment specific logic. Can we make use of knowing the max rewards and min rewards of each specific environment?

Change in daily submission limits for round 1

Over 4 years ago

Thanks @shivam. Curious if there’s any update on the evaluation time as of this week?

Rllib custom env

Over 4 years ago

Ah of course. That makes sense. Thanks @dipam_chakraborty.

Rllib custom env

Over 4 years ago

Thanks @jyotish. So can you confirm the config looks like this?

config:
    env_config:
        env_name: coinrun
        num_levels: 0
        start_level: 0
        paint_vel_info: False
        use_generated_assets: False
        distribution_mode: easy
        center_agent: True
        use_sequential_levels: False
        use_backgrounds: True
        restrict_themes: False
        use_monochrome_assets: False
        rollout: False
...

Can you post a quick example of where we could access in an env wrapper? I was trying something like what you posted above, but could not get it to work:

def create_env(config):
    rollout = config.pop("rollout")
    procgen = ProcgenEnvWrapper(config)
    env = MyWrapper(procgen, rollout)
    return env

registry.register_env(
    "my_wrapper", create_env,
)

Ray/rllib appears to call my create_env() function more than once and errors out because the rollout key was popped.

Rllib custom env

Over 4 years ago

Has anyone found out a way to do what Anton asked? Is it possible for the evaluators to add a field in the β€œenv_config:” part of the yaml configuration that says is_training: true or false? I remember trying some trickery in the warm up round by modifying run.sh but I don’t think it worked and I gave up on that idea.

Change in daily submission limits for round 1

Over 4 years ago

@jyotish or @mohanty with the submission limit being dropped to 2 per day, is it possible to relax the evaluation time limit a bit? Maybe from 30 min to 45 minutes or an hour? I find that miner and big fish are susceptible to longer episodes and the evaluation time limit can be hit even with a model that performs well on the other environments.

Edit: I actually did some measurements and have some data that might be interesting. Bigfish is especially problematic. Myself as a human player was finding that I could beat it in approx 600-700 timesteps. I timed the rollout.py script and found that a baseline impala ppo model was giving me a throughput of ~450 timesteps per second. As far as I know, the rollouts are performed sequentially and not using multiple workers. This means that an optimal human level performance would equate to:

(1000 episodes * 650 steps) / (450 timesteps per second) = ~24 minutes.

This is under the 30 minute time limit but it’s pretty close. A better model could actually take longer and fail due to the 30 minute limit.

I think it makes sense and would be fair for the evaluation limit to at least be long enough to handle a baseline model running the eval episodes num (1000) * the max steps per episode (1000) / baseline model throughput (450) which is ~37 minutes. It’s a small increase, but it could make a difference for some of the failures we’re seeing. 45 min could be nice for algos that might need a bit more computation than the baseline.

Submission Limits

Over 4 years ago

Yea @maraoz if you click on create submission on the aicrowd competition page (https://www.aicrowd.com/challenges/neurips-2020-procgen-competition/submissions/new), it shows you how many submissions you have left for that day and when the next one is available. Also, I’m fairly sure that submissions failed due to not having any more submissions left don’t actually count towards your total submission number.

2 hours training time limit

Over 4 years ago

@jyotish @shivam Are the contest organizers willing to impose the 2 hour limit by forcing the time_total_s: 7200 like @xiaocheng_tang suggested instead of the current time limit the clusters are using? I’m finding that some of my training sessions time out before they’ve had a full two hours two train. For example, my last submission failed at 5883.03 seconds. The clusters are much busier now this past week with more people submitting and I’m guessing that a lot of time is being burnt on scheduling and overhead. Enforcing the 7200 second limit with the yaml file seems much more consistent and lets everyone have the same amount of training without being limited by how busy the clusters are.

Selecting seeds during training

Over 4 years ago

Just to add a bit more discussion here. I thought a little bit more about curriculum learning and perhaps it’s a bit against the spirit of the competition. For round 1 it doesn’t matter, but when we get into the final round where num_levels=200, curriculum learning just seems like a way to skirt that rule by having more levels to work with. This would only work if you’re careful to only allot x numbers of levels for easy and 200-x levels for the hard distribution. Just something to keep in mind if anyone else wants to explore this idea.

Selecting seeds during training

Over 4 years ago

Agreed. I wanted to try curriculum learning (and a few other ideas) but basically wrote them off as not possible since we can’t change any of the environment wrapper code. Would appreciate it if anyone found a workaround.

Training Error?

Over 4 years ago

I just had a submission error as well. Stopped in the middle of training (after 5 million timesteps). Still under the 2 hour time limit and no out of memory errors as far as I can tell. Interestingly, it did move on to the rollout phase and give me a score.

Impala Pytorch Baseline Bug(s)

Over 4 years ago

I just want to confirm that the pytorch impala baseline model does seem bugged. No changes, same model included in the starter kit. use_pytorch is True in the yaml config. Extremely low throughput.

Here is something I see in the metrics logs

(pid=102) /pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.

Submission Compute Time Limits

Over 4 years ago

Sure thing, thanks for the response. I see in my most recent submission #68496 in the gitlab issue that β€œTime Elapsed: 2 hours 56 minutes 11 seconds”. Seems like a lot of extra time for resource provision, but maybe that’s normal.

Submission Compute Time Limits

Over 4 years ago

The competition readme states that we are given 8 million timesteps and 2 hours worth of compute. I’ve found myself go a bit over the 2 hour limit a couple times on accident. I’m wondering if this 2 hour limit will be enforced in the next rounds of the competition? If it is enforced, will it be enforced at the submission level (i.e. will aicrowd stop training automatically after 2 hours) or is it on us to enforce the 2 hour limit somehow through our configuration? Or is the 2 hour compute not a hard requirement like the 8M timesteps is?

Several questions about the competition

Over 4 years ago

@jyotish Above you mentioned that any changes to train.py will be dropped when submitted, can I assume that the same applies to rollout.py? I’m brand new to RLlib and Ray so this warm up phase is very helpful. If we have rollout specific logic, is there a place you’d suggest implementing?

Submission Limits

Over 4 years ago

I have a couple questions regarding submission limits. What are the best practices for figuring out when you can submit? I know there’s a hard 99 submission limit for the round but I’m also running into a daily limit and am finding myself wasting submissions only for it to fail and tell me to come back in a few hours to try again. This daily limit wasn’t clear in the competition documentation.

To go along with that, is there a way to retry previously failed submissions instead of pushing new tags?

tim_whitaker has not provided any information yet.