Loading
Warm Up Round: Completed Round 1: Completed Weight: 1.0
11.8k
444
54
696

๐Ÿš€ Starter Kit

๐Ÿš€ Official Baseline

๐Ÿ‘ฅ Looking for teammates or advice ? Join the competition Slack !

๐Ÿ™‹ NLP Task: Asking Clarifying Questions

This task is about determining when and what clarifying questions to ask. Given the instruction from the Architect (e.g., โ€œHelp me build a house.โ€), the Builder needs to decide whether it has sufficient information to carry out that described task or if further clarification is needed. For instance, the Builder might ask โ€œWhat material should I use to build the house?โ€ or โ€œWhere do you want it?โ€. In this NLP task, we focus on the research question "what to ask to clarify a given instruction" independently from learning to interact with the 3D environment. The original instruction and its clarification can be used as input for the Builder to guide its progress.

Top: architect's instruction was clear, not clarifying question gets asked. Bottom: 'leftmost' is ambiguous, so the builder asks a clarifying question.

The original description of the baselines and the methodologies can be found in the following paper:

@inproceedings{aliannejadi-etal-2021-building, title = "Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions", author = "Aliannejadi, Mohammad and Kiseleva, Julia and Chuklin, Aleksandr and Dalton, Jeff and Burtsev, Mikhail", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", publisher = "Association for Computational Linguistics", doi = "10.18653/v1/2021.emnlp-main.367", pages = "4473--4484",
}

๐Ÿ–Š Evaluation

Models submitted to this track track are going to be evaluated according to both when to ask and what to ask criteria with a two-step scoring process.

  • When to ask: This is a binary classification problem: Does the provided instruction require a clarification question? We use the the macro average F1 score to evaluate your classifer. However, we do not believe optimizing this metric too much is in the best use of your time from a research perspective. Hence we quantize the F1 score into the following bins:

  • 0.90 - 1.0

  • 0.85 - 0.90

  • 0.75 - 0.85

  • 0.65 - 0.75

  • 0.50 - 0.65

  • 0.35 - 0.50

  • 0.0 - 0.35

    So if your classifer gets a F1 score of 0.82, the binned F1 score will be 0.75. For a F1 score of 0.93, the binned score will be 0.90 and so on.

  • What to ask: The second problem evaluates how well your model can rank a list of human-issued clarifying questions for a given ambiguous instruction. Your model will be evaluated on Mean Reciprocal Rank (MRR), rounded off to 3 significant digits.

The leaderboard will be ranked based on the binned F1 score, submissions with the same binned F1 score will be sorted with the MRR.

Please note above mentioned metrics is subject to be modified after completion of warm-up phase of the competition.

๐Ÿ’พ Dataset

Download the public dataset for this Task using the link below, you'll need to accept the rules of the competition to access the data.

https://www.aicrowd.com/challenges/neurips-2022-iglu-challenge-nlp-task/dataset_files

The dataset consists of

  • clarifying_questions_train.csv
  • question_bank.csv
  • initial_world_paths folder

clarifying_questions_train.csv has the following columns:

  • GameId - Id of the game session.
  • InitializedWorldPath - Path to the file under initial_world_paths that contains state of the world intialized to the architect. The architect provides an instruction to build based on this world state. More information to follow on how the world state can be parsed/ visualized.
  • InputInstruction - Instruction provided by the architect.
  • IsInstructionClear - Specifies whether the instruction provided by architect is clear. This has been marked by another annotator who is not the architect.
  • ClarifyingQuestion - Question asked by annotator upon marking instruction as being unclear.
  • qrel - Question id (qid) of the relevant clarifying question for the current instruction.
  • qbank - List of clarifying question ids that need to be ranked for each unclear instruction. The mapping between clarifying questions and ids is present in the question_bank.csv.

Merged list of ids in the qrel and qbank columns will give you the list of all qids to be ranked for each unclear instruction.

question_bank.csv: This file contains mapping between qids mentioned in qrel and qbank columns of the clarifying_questions_train.csv to the bank of clarifying questions issued by annotators.

๐Ÿš€ Getting Started

Make your first submission using the starter Kit. ๐Ÿš€

๐Ÿ“… Timeline

  • July: Releasing materials: IGLU framework and baselines code.
  • 29th July: Competition begins! Participants are invited to start submitting their solutions.
  • 31th October: Submission deadline. Submissions are closed and organizers begin the evaluation process.
  • November: Winners are announced and are invited to contribute to the competition writeup.
  • 2nd-3rd of December: Presentation at NeurIPS 2022 (online/virtual).

๐Ÿ† Prizes

The challenge features a Total Cash Prize Pool of $16,500 USD.

This prize pool for NLP Task is divided as follows:

  • 1st place: $4,000 USD
  • 2nd place: $1,500 USD
  • 3st place: $1,000 USD

Research prizes: $3,500 USD

Task Winners. For each task, we will evaluate submissions as described in the Evaluation section. The three teams that score highest on this evaluation will receive prizes of \$4,000 USD , \$1,500 USD and \$1,000 USD.

Research prizes. We have reserved $3,500 USD of the prize pool to be given out at the organizersโ€™ discretion to submissions that we think made a particularly interesting or valuable research contribution. If you wish to be considered for a research prize, please include some details on interesting research-relevant results in the README for your submission. We expect to award around 2-5 research prizes in total.

Evaluation

Models submitted to the NLP track are going to be evaluated according to both when to ask and what to ask criteria with a two-step scoring process.

  • When to ask: This is a binary classification problem: Does the provided instruction require a clarification question? We use the the macro average F1 score to evaluate your classifer. However, we do not believe optimizing this metric too much is in the best use of your time from a research perspective. Hence we quantize the F1 score into the following bins:

  • 0.90 - 1.0

  • 0.85 - 0.90

  • 0.75 - 0.85

  • 0.65 - 0.75

  • 0.50 - 0.65

  • 0.35 - 0.50

  • 0.0 - 0.35

    So if your classifer gets a F1 score of 0.82, the binned F1 score will be 0.75. For a F1 score of 0.93, the binned score will be 0.90 and so on.

  • What to ask: The second problem evaluates how well your model can rank a list of human-issued clarifying questions for a given ambiguous instruction. Your model will be evaluated on Mean Reciprocal Rank (MRR), rounded off to 3 significant digits.

The leaderboard will be ranked based on the binned F1 score, submissions with the same binned F1 score will be sorted with the MRR.

Please note above mentioned metrics are subject to be modified after completion of warm-up phase of the competition.

Baselines

We shall be releasing the baselines soon, be on the lookout on the forums.

๐Ÿ‘ฅ Team

The organizing team:

  • Julia Kiseleva (Microsoft Research)
  • Alexey Skrynnik (MIPT)
  • Artem Zholus (MIPT)
  • Shrestha Mohanty (Microsoft Research)
  • Negar Arabzadeh (University of Waterloo)
  • Marc-Alexandre Cรดtรฉ (Microsoft Research)
  • Mohammad Aliannejadi (University of Amsterdam)
  • Milagro Teruel (Microsoft Research)
  • Ziming Li (Amazon Alexa)
  • Mikhail Burtsev (MIPT)
  • Maartje ter Hoeve (University of Amsterdam)
  • Zoya Volovikova (MIPT)
  • Aleksandr Panov (MIPT)
  • Yuxuan Sun (Meta AI)
  • Kavya Srinet (Meta AI)
  • Arthur Szlam (Meta AI)
  • Ahmed Awadallah (Microsoft Research)
  • Dipam Chakraborty (AIcrowd)

The advisory board:

  • Tim Rocktรคschel (UCL & DeepMind)
  • Julia Hockenmaier (University of Illinois at Urbana-Champaign)
  • Bill Dolan (Microsoft Research)
  • Ryen W. White (Microsoft Research)
  • Maarten de Rijke (University of Amsterdam)
  • Oleg Rokhlenko (Amazon Alexa Shopping)
  • Sharada Mohanty (AICrowd)

๐Ÿค Sponsors

Special thanks to our sponsors for their contributions.

๐Ÿ“ฑ Contact

Twitter URL

We encourage the participants to join our Slack workspace for discussions and asking questions.

You can also reach us at info@iglu-contest.net or via the AICrowd discussion forum.

Getting Started

Notebooks

See all
Baseline - BERT Classifier - BM25 Ranker
By
dipam
Over 2 years ago
0