NeurIPS 2022: MineRL BASALT Competition
Challenge Rules
Through the following rules, we aim to capture the spirit of the competition. Any submissions found to be violating the rules may be deemed ineligible for participation by the organizers.
Team formation
- There is no limit to the team size.
- Each participant must make an AIcrowd account. You cannot make a single AIcrowd account for a team.
- Each person may only be part of one team. I.e. participants may not create a submission as a team of one (participant themselves) and as part of the team. Submissions can be deleted upon request if mishaps happen.
Cheating
- Like retractions from journals, organizers reserve the right to disqualify participants if cheating is found at any point.
- We want you to succeed. Please reach out to us if you are uncertain about whether your current approach violates the rules.
Winning and receiving prizes
- For a team to be eligible to win, each member must satisfy the following conditions:
- be at least 18 and at least the age of majority in place of residence;
- not reside in any region or country subject to U.S. Export Regulations; and
- not be an organizer of this competition nor a family member of a competition organizer.
- To receive any awards from our sponsors, competition winners must attend the NeurIPS workshop (held online).
- Entries to the MineRL BASALT competition must be “open”. Teams will be expected to reveal most details of their method including source-code and instructions for human feedback collection (special exceptions may be made for pending publications).
Rule clarifications
- Official rule clarifications will be made in the FAQ section on the AIcrowd website.
- Answers within the FAQ are official answers to questions. Any informal answers to questions (e.g., via email or Discord) are superseded by answers added to the FAQ.
BASALT track submission
- Each submission should contain both training and testing code along with trained models.
- The models you provide will be used along with the testing code to evaluate your submission during the competition.
- The training code is used in the validation phase (see “Evaluation”) to ensure rule compliance.
- The training instance will already contain the following items which you do not need to upload and are free to use in any way:
- The BASALT dataset(s) provided by the competition organizers (this is not the dataset shared by OpenAI)
- The OpenAI VPT models which were shared publicly: https://github.com/openai/Video-Pre-Training
- For training, all files above size 30 MB will be removed from your submission to ensure rule compliance.
- You may upload small datasets of your own data (the total size should not exceed 30 MB).
- You may also use Minecraft data provided in the MineDojo project (https://minedojo.org/. You need to upload this and respect the 30 MB limit). The same restriction applies to OpenAI’s contractor data: https://github.com/openai/Video-Pre-Training.
- You may also use other publicly available pretrained models as long as they were not trained on Minecraft or on Minecraft data, and were publicly available on July 1, 2022. The intent of this rule is to allow participants to use models which are, for example, trained on ImageNet or similar datasets.
- Clarification: You are allowed to use OpenAI VPT models released in the following page, and they will be provided in the training instance pre-loaded (you do not need to upload them): https://github.com/openai/Video-Pre-Training . These do not count towards the 30 MB limit.
- Clarification: Uploading pretrained models does count towards the 30 MB limit.
- The training instance will already contain the following items which you do not need to upload and are free to use in any way:
BASALT track training and developing
- You cannot pull information out of the underlying Minecraft simulator; only information provided in the interfaces of the environments we give is allowed (i.e., through the OpenAI Gym interface). This applies to both training and testing code.
- You are allowed to use domain knowledge as part of your training and testing code submissions, e.g. hardcoded rules to move the player around or use computer vision to detect objects in the image.
- Submissions are limited to four days of compute on prespecified computing hardware to train models for all of the tasks. Hardware specifications will be shared later on the competition’s AICrowd page. In the previous year's competition, this machine contained 6 CPU cores, 56GB of RAM and a single K80 GPU (12GB vRAM).
- If you train using in-the-loop human feedback, you are limited to 10 hours of human feedback over the course of training all four models. You can create, for example, a GUI or a command-line interface for collecting this feedback. A person fluent in English should be able to understand and provide feedback after reading a max. 10-page Google Doc document. This is necessary for retraining since we will have to replicate both the computation of your algorithm and its requests for human feedback.
- During retraining (see “Evaluation”), while we aim to get human feedback to you as soon as possible, your program may have to wait for a few hours for human feedback to be available. (This will not count against the four-day compute budget, though you are allowed to continue background computation during this time.)
- Human feedback will be provided by remote contractors, so your code should be resilient to network delays. (In particular, contractors may find it particularly challenging to play Minecraft well over this connection.)
- The retraining code will be run on a machine with a display, and contractors will be able to access it like any other desktop machine with a display.
- You are permitted to ask for human feedback in separate batches (e.g., every hour or so, you ask for 10 minutes of human feedback).
BASALT track evaluation
- Online Evaluation (while submissions are accepted):
- During the competition, the online leaderboard will use quick evaluation by organizers: competition organizers will inspect a single recording of the submission and assign a grade 1 to 5 to the recording (1 = random agent, 2 = baseline, 5 = human level). The final leaderboard score is an average of these scores over four tasks.
- This metric will not be used for the final evaluation in any way. It is intended to give a rough ordering of the submissions during the competition.
- Phase 1 Evaluation:
- After submissions close, up to 50 teams will be included in the first round of TrueSkill evaluation using human feedback.
- During this round, a mixture of crowdsourced answers (e.g., MTurk) and hired contractors will evaluate submissions by providing feedback on which submissions were better at solving the task. See the AICrowd page on evaluation for details on this.
- The top 20 teams, as measured by the TrueSkill rating, will proceed to the Phase 2 evaluation.
- If more than 50 teams have submitted solutions, organizers reserve the right to select 50 teams for the Phase 1 Evaluation as they see fit (e.g., a fast round of evaluations, grading by organizers).
- Each team may only have one submission during evaluations. You will be given a chance to choose your final submission before the evaluation begins. The most recent submission will be the final submission by default.
- Phase 2 Evaluation:
- The top 20 teams will be included in a longer round of TrueSkill evaluation, where human evaluators (crowdsourced or contractors) provide more feedback on the solutions until the ordering of solutions converges (measured by the TrueSkill rating).
- The top 3 teams and other potential winners for other prizes will move to the validation phase.
- Validation phase
- The submission code of potential winners will be examined for rule compliance.
- The submissions of potential winners will be retrained using the training code and instructions the team provided. Organizers may contact the team if issues arise.
- If the retrained solutions are significantly worse than what was submitted (judged by organizers), organizers will contact the team. If no legitimate reason for the discrepancy is found (e.g., an error in retraining), the team is disqualified. (This check is meant to prevent teams from submitting agents that were produced by a training process other than the one submitted.)
- If a team’s submission violates the rules, the organizers will contact the team for appeal. Unless a successful appeal is made, the organizers will remove those submissions from the competition
Intro track submission and evaluation
- The above rules apply to the Intro track, with the following exceptions:
- You only need to submit the testing code and the accompanying models as part of the submission.
- The submission is evaluated by taking the maximum over the episodic rewards over 20 episodes on fixed seeds (no human evaluation). We use a maximum score instead of an average score to reflect the best case performance of the submission.
- There is only online evaluation; after submissions close and all submissions have been evaluated, the final leaderboard ranking will reflect the winners.
- The submissions of the Intro track are in no way connected to BASALT track submissions: the same team can create submissions for both tracks.
- Intro track submissions do not receive “top solution” prizes. However, interesting solutions in the track may receive research/specialization prizes, at the organizers’ discretion (see description of prizes for further details).