AIcrowd | Amazon KDD Cup 2024: Multi-Task Online Shopping Challenge for LLMs

Round 1: Completed

Round 2: Completed

final-evaluations: Completed #llm #recommender_system

Amazon Search

189.9k

1836

508

9095

Problem Statements

Weight: 1.0

Amazon KDD Cup 24: Understanding Shopping Concepts

Decode Complex Shopping Concepts And Terminologies.

23.6k

2325

#nlp #llm #summarization #information_extraction

Weight: 1.0

Amazon KDD Cup 24: Shopping Knowledge Reasoning

Make Informed Decisions with Shopping Knowledge

15.6k

2086

#nlp #world_knowledge #llm #reasoning

Weight: 1.0

Amazon KDD Cup 24: User Behavior Alignment

Understand Dynamic Customer Behaviour

18.4k

2103

#nlp #llm #predictive_modeling #behavior_modeling

Weight: 1.0

Amazon KDD Cup 24: Multi-Lingual Abilities

Shopping Across Languages

8590

1453

#nlp #llm #multilingual_processing

Weight: 1.0

Amazon KDD Cup 24: All-Around

Multi-Task Online Shopping to the Max!

9839

1128

#llm #recommendation_system #specialized_llm

📰 News

(July 19th, 2024) 🎉Winners of each track are out🎉! Check here and here for details.

(July 13th, 2024) The call for workshop papers is out! We are looking forward to your insights and ingenious solutions. Deadline: August 2nd, AoE.

(July 10th, 2024) Please see this note about selecting your submissions for our final evaluation.

(July 4th, 2024) There was a glitch in our Track 3 evaluators earlier today, causing all Track 3 submissions to fail. We have debugged that. Feel free to submit them again.

(July 2nd, 2024) We double the submission quota once more to 20 successful submissions per week per track, and 160 submission failures across all tracks per week. Have fun!

(June 19th, 2024) Please take a look at this reminder about setting the backend of vLLM and cleaning unwanted files in your repo.

(June 8th, 2024) As we increased the AWS quota, we will increase the number of parallel submissions to 5 across all tracks.

(June 3rd, 2024) For participants who use LoRA, please take a look at this.

(May 25th, 2024) Phase 2 is now live! Read the announcement to see what's new here.

📄 External Resource

Since we do not provide large-scale training datasets, all solutions have to rely heavily on external resources. We would like to highlight that all solutions submitted to this challenge should be based on resources (e.g. datasets and models) that are publicly available. Submissions should not contain proprietary data or model checkpoints. Participants can paraphrase or extend upon existing datasets (e.g. manual labeling, or labeling/generation with GPT), but should make their extended datasets available after the competition.

We list some public resources that may be helpful to your solutions.

ECInstruct: An instruction tuning dataset also based on Amazon raw data.
Amazon-M2: A multi-lingual Amazon session dataset with rich meta-data used for KDD Cup 2023.
Amazon-ESCI: A multi-lingual Amazon query-product relation dataset used for KDD Cup 2022.

Challenge In a Glance

Imagine you're trying to find the perfect gift for a friend's birthday through an online store. You have to go through countless products, read reviews to gauge quality, compare prices, and finally decide on a purchase. This process is time-consuming and can sometimes be overwhelming due to the sheer volume of information and options available. The complexities of online shopping, such as navigating through a web of products, reviews, and prices, all while trying to make the best decision based on your understanding and preferences can be overwhelming.

This challenge aims to simplify the process with Large Language Models (LLMs). While current techniques often fall short in understanding the nuances of specific shopping terms and knowledge, customer behaviors, preferences, and the diverse nature of products and languages, we believe that LLMs, with their multi-task and few-shot learning abilities, have the potential to master such complexities of online shopping. Motivated by the potential, this challenge introduces Shopping MMLU, a comprehensive benchmark that mimics these real-world online shopping complexities. We invite participants to design powerful LLMs to improve how state-of-the-art techniques can better assist us in navigating online shopping, making it a more intuitive and satisfying experience, much like a knowledgeable shopping assistant would in real life.

🛍️ Introduction

Online shopping is complex, involving various tasks from browsing to purchasing, all requiring insights into customer behavior and intentions. This necessitates multi-task learning models that can leverage shared knowledge across tasks. Yet, many current models are task-specific, increasing development costs and limiting effectiveness. Large language models (LLMs) have the potential to change this by handling multiple tasks through a single model with minor prompt adjustments. Furthermore, LLMs can also improve customer experiences by providing interactive and timely recommendations. However, online shopping, as a highly specified domain, features a wide range of domain-specific concepts (e.g. brands, product lines) and knowledge (e.g. which brand produces which products), making it challenging to adapt existing powerful LLMs from general domains to online shopping.

Motivated by the potentials and challenges of LLMs, we present Shopping MMLU, a massive challenge for online shopping, with 57 tasks and ~20000 questions, derived from real-world Amazon shopping data. All questions in this challenge are re-formulated to a unified text-to-text generation format to accommodate the exploration of LLM-based solutions. Shoppping MMLU focuses on four main key shopping skills (which will serve as Tracks 1-4):

shopping concept understanding
shopping knowledge reasoning
user behavior alignment
multi-lingual abilities

In addition, we set up Track 5: All-around to encourage even more versatile and all-around solutions. Track 5 requires participants to solve all questions in Tracks 1-4 with a single solution, which is expected to be more principled and unified than track-specific solutions to Tracks 1-4. We will correspondingly assign larger awards to Track 5.

We hope that this challenge can provide participants with valuable hands-on experiences in developing state-of-the-art LLM-based techniques for real-world problems. We also believe that the challenge will benefit the industry of online user-oriented services with strong and ready-to-use LLM-based solutions, as well as the whole machine learning community with helpful insights and guidelines on LLM training and development.

📅 Timeline

There will be two phases in the challenge. Phase 1 will be open to all teams who sign up. After Phase 1, we will apply a top 75 cutoff, and only teams in the top 75 of Phase 1 will proceed to Phase 2. The number of 75 is tentative and may increase slightly.

Correspondingly, Shoppping MMLU will be split into two disjoint test sets, with Phase 2 containing harder samples and tasks. The final winners will be determined solely with Phase 2 data.

Phase 2 Start Date: 25th May, 2024
End Date: 10th July, 2024 23:55 UTC
Winner Notification: 15th July, 2024
Winner Announcement: 26th August, 2024 (At KDD 2024)

🏆 Prizes

The challenge carries a prize pool of $41,500 categorized into the following three types of prizes:

Winner Prizes: We will award winners (first, second, and third places) in each track with cash prizes.
AWS Credits: Teams immediately after the winners in each track will be awarded with AWS credits.
Student Awards: We are aware that developing LLMs require significant computation resources and engineering efforts, neither of which is accessible to students. Therefore, we setup a dedicated student award for the best student teams (i.e. all participants are students) in each track to motivate students to develop resource-efficient solutions.

Specifically, Tracks 1-4 carry the following prizes:

🥇 First place: $2,000
🥈 Second place: $1,000
🥉 Third place: $500
4th-7th places: AWS Credit $500
🏅 Student Award: $750

Track 5 (all-around) carries the following prizes:

🥇 First place: $7,000
🥈 Second place: $3,500
🥉 Third place: $1,500
4th-8th places: AWS Credit $500
🏅 Student Award: $2,000

All awards are cumulative. For example, if your solution ranks 2nd in Track 5 all-around, and also ranks 3rd in Track 4, you can get a total cash prize of 3,500+500=4,000. However, Track 5 solutions will not be automatically considered eligible for Tracks 1-4. You have to make a submission to the Track to be eligible.

In addition to cash prizes, the winning teams will also have the opportunity to present their work at the KDD Cup workshop 2024, held in conjunction with ACM SIGKDD 2024.

📊 Dataset

Shopping MMLU used in this challenge is an anonymized, multi-task dataset sampled from real-world Amazon shopping data. Statistics of Shopping MMLU is given in the following Table.

# Tasks	# Questions	# Products	# Product Category	# Attributes	# Reviews	# Queries
57	20598	~13300	400	1032	~11200	~4500

Shopping MMLU is split into a few-shot development set and a test set to better mimic real-world applications --- where you never know the customer's questions beforehand. With this setting, we encourage participants to use any resource that is publicly available (e.g. pre-trained models, text datasets) to construct their solutions, instead of overfitting the given development data (e.g. generating pseudo data samples with GPT).

The development datasets will be given in json format with the following fields.

input_field: This field contains the instructions and the question that should be answered by the model.
output_field: This field contains the ground truth answer to the question.
task_type: This field contains the type of the task (Details in the next Section, "Tasks")
task_name: This field contains the name of the task. However, the exact task names are redacted. We provide hashed task names instead (e.g. task1, task10).
metric: This field contains the metric used to evaluate the question (Details in Section "Evaluation Metrics").
track: This field specifies the track the question comes from.

However, the test dataset (which will be hidden from participants) will have a different format with only two fields:

input_field, which is the same as above.
is_multiple_choice: This field contains a True or False that indicates whether the question is a multiple choice or not. The detailed 'task_type' will not be given to participants.

👨‍💻👩‍💻 Tasks

Shopping MMLU is constructed to evaluate four important shopping skills, which correspond to Tracks 1-4 of the challenge.

Shopping Concept Understanding: There are many domain-specific concepts in online shopping, such as brands, product lines, etc. Moreover, these concepts often exist in short texts, such as queries, making it even more challenging for models to understand them without adequate contexts. This skill emphasizes the ability of LLMs to understand and answer questions related to these concepts.
Shopping Knowledge Reasoning: Complex reasoning with implicit knowledge is involved when people make shopping decisions, such as numeric reasoning (e.g. calculating the total amount of a product pack), multi-step reasoning (e.g. identifying whether two products are compatible with each other). This skill focuses on evaluating the model's reasoning ability on products or product attributes with domain-specific implicit knowledge.
User Behavior Alignment: User behavior modeling is of paramount importance in online shopping. However, user behaviors are highly diverse, including browsing, purchasing, query-then-clicking, etc. Moreover, most of them are implicit and not expressed in texts. Therefore, aligning with heterogeneous and implicit shopping behaviors is a unique challenge for language models in online shopping, which is the primary aim of this track.
Multi-lingual Abilities: Multi-lingual models are especially desired in online shopping as they can be deployed in multiple marketplaces without re-training. Therefore, we include a separate multi-lingual track, including multi-lingual concept understanding and user behavior alignment, to evaluate how a single model performs in different shopping locales without re-training.

In addition, we setup Track 5: All-around, requiring participants to solve all questions in Tracks 1-4 with a unified solution to further emphasize the generalizability and the versatility of the solutions.

Shopping MMLU involves a total of 5 types of tasks, all of which are re-formulated to text-to-text generation to accommodate LLM-based solutions.

Multiple Choice: Each question is associated with several choices, and the model is required to output a single correct choice.
Retrieval: Each question is associated with a requirement and a list of candidate items, and the model is required to retrieve all items that satisfy the requirement.
Ranking: Each question is associated with a requirement and a list of candidate items, and the model is required to re-rank all items according to how each item satisfies the requirement.
Named Entity Recognition: Each question is associated with a piece of text and an entity type. The model is required to extract all phrases from the text that fall in the entity type.
Generation: Each question is associated with an instruction and a question, and the model is required to generate text pieces following the instruction to answer the question. There are multiple types of generation questions, including extractive generation, translation, elaboration, etc.

To test the generalization ability of the solutions, the development set will only cover a part of all 57 tasks, resulting to tasks that are unseen throughout the challenge. However, all 5 task types will be covered in the development set to help participants understand the prompts and output formats.

🖊 Evaluation Framework

Evaluation Protocol

To ensure a thorough and unbiased evaluation, the challenge uses a hidden test set that will remain undisclosed to participants to prevent manual labeling or manipulation, and to promote generalizable solutions.

Evaluation Metrics

Shopping MMLU includes multiple types of tasks, each requiring specific metrics for evaluation. The metrics selected are as follows:

Multiple Choice: Accuracy is used to measure the performance for multiple choice questions.
Ranking: Normalized Discounted Cumulative Gain (NDCG) is used to evaluate ranking tasks.
Named Entity Recognition (NER): Micro-F1 score is used to assess NER tasks.
Retrieval: Hit@3 is used to assess retrieval tasks. The number of positive samples not exceeding 3 across Shopping MMLU.
Generation: Metrics vary based on the task type:
Extraction tasks (e.g., keyphrase extraction) uses ROUGE-L.
Translation tasks uses BLEU score.
For other generation tasks, we employ Sentence Transformer to calculate sentence embeddings of the generated text xgen and the ground truth text xgt. We then compute the cosine similarity between xgen and xgt (clipped to [0, 1]) as the metric. This approach focuses on evaluations on text semantics rather than just token-level accuracy.

As all tasks are converted into text generation tasks, rule-based parsers will parse the answers from participants' solutions. Answers that parsers cannot process will be scored as 0. The parsers will be available to participants.

Since all these metrics range from [0, 1], we calculate the average metric for all tasks within each track (macro-averaged) to determine the overall score for a track and identify track winners. Track 5 applies the same rule, in which metrics of all tasks are macro-averaged (instead of all tracks).

🗃️ Submission

The challenge would be evaluated as a code competition. Participants must submit their code and essential resources, such as fine-tuned model weights and indices for Retrieval-Augmented Generation (RAG), which will be run on our servers to generate results and then for evaluation.

Submission Instructions

For submission instructions, please see the starter kit and the submission guideline.

Hardware and System Configuration

We apply a limit on the hardware available to each participant to run their solutions. Specifically,

All solutions will be run on AWS g4dn.12xlarge instances equipped with NVIDIA T4 GPUs.
Solutions for Phase 2 will have access to 4 x NVIDIA T4 GPU. Please note that NVIDIA T4 uses somewhat outdated architectures and is thus not compatible with certain acceleration toolkits (e.g. Flash Attention 2), so please be careful about compatibility.
The maximum repo size is 200GB.

Besides, the following restrictions will also be imposed.

Network connection will be disabled.
Each submission will be assigned a certain amount of time to run. Submissions that exceed the time limits will be killed and will not be evaluated. The tentative time limit is set as follows.

Phase	Track 1	Track 2	Track 3	Track 4	Track 5
Phase 2	70 minutes	20 minutes	30 minutes	20 minutes	140 minutes

For reference, the baseline solution with zero-shot LLaMA3-8B-instruct (with VLLM) consumes the following amount of time.

Phase	Track 1	Track 2	Track 3	Track 4
Phase 2	1504s	406s	618s	374s

Each team will be able to make up to 5 submissions per week per track.

Note: It is important to load the models in torch.float16 (rather than torch.bfloat16 which is not supported by NVIDIA-T4).

Evaluation and Leaderboard

The approach uses undisclosed test datasets for few-shot learning, constructing a live leaderboard and determining the final winner.

Use of External Resources

By only providing a few-shot development set, we encourage participants to exploit public resource to build their solutions. However, participants should ensure that the used datasets or models are publicly available and equally accessible to use by all participants. Such a constraint rules out proprietary datasets and models by large corporations. Participants are allowed to re-formulate existing datasets (e.g. adding additional data/labels manually or with ChatGPT), but should make them publicly available after the competition.

Technical Report and Code Submission

Upon the end of the competition, we will notify potential winners, who will be required to submit a technical report to describe their solutions as well as necessary codes to reproduce their solutions. The organizers will review the submitted contents to check whether the solution follows the rules of the challenge. Teams whose solutions pass the review will get the chance to present their solutions at the KDD Cup 2024 Workshop.

🏛️ KDD Cup Workshop

KDD Cup is an annual data mining and knowledge discovery competition organised by the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining (ACM SIGKDD). The competition aims to promote research and development in data mining and knowledge discovery by providing a platform for researchers and practitioners to share their innovative solutions to challenging problems in various domains. The KDD Cup Workshop 2024 will be held in Barcelona, Spain, from Sunday, August 25, 2024, to Thursday, August 29, 2024, in conjuction with ACM SIGKDD 2024.