Problem Statements

Problems will be visible after challenge starts
Youβre on vacation, strolling through ancient sites as your smart glasses share their history. Later, at a local restaurant, they translate the menu, helping you order with confidence. As the day winds down, you head back to the parking lotβno searching, no stressβyour glasses pull up an image reminder of exactly where you parked.
Wearable devices are revolutionizing how we communicate, work, and experience the world. But to be truly valuable in everyday life, they must provide relevant, accurate, and reliable information tailored to users' needs.
π Introducing the Meta CRAG - MM: Comprehensive RAG Benchmark for Multi-modal, Multi-turn Challenge! π
π¬ Introduction
Vision Large Language Models (VLLMs) have undergone significant advancements in recent years, empowering multi-modal understanding and visual question-answering (VQA) capabilities behind smart glasses. Despite the progress, VLLMs still face a major challenge: generating hallucinated answers. Studies have shown that VLLMs encounter substantial difficulties in handling queries involving long-tail entities [1]; these models also encounter challenges in handling complex queries that require the integration of different capabilities: recognition, OCR, knowledge, and generation [2].
The Retrieval-Augmented Generation (RAG) paradigm has expanded to accommodate multi-modal (MM) input and demonstrated promise in addressing the knowledge limitation of VLLM. Given an image and a question, an MM-RAG system constructs a search query by synthesizing information from the image and the question, searches external sources to retrieve relevant information, and then provides grounded answers to address the question [3]
Figure 1: MM-RAG
Despite its potential, MM-RAG still faces many challenges, such as recognizing the correct subject and comprehending the visual context in the image to understand the question, performing effective searches to retrieve useful information, synthesizing information from different sources to generate coherent and informative answers, and engaging in smooth multi-turn conversations. A comprehensive benchmark that provides a standardized framework and clear metrics is in pressing need to enable reliable and informative assessment of MM-RAG systems to facilitate and advance innovations.
π» What is CRAG - MM: Comprehensive RAG Benchmark for multi-modal multi-turn question answering?
CRAG-MM is a visual question-answering benchmark that focuses on factual questions, offering a unique collection of image and question-answering sets to enable comprehensive assessment of wearable devices. Specifically, CRAG-MM features a diverse collection of 5k images, including 3k egocentric ones captured by RayBan Meta smart glasses, covering 14 domains and reflecting real-world challenges associated with handling egocentric images.
The benchmark includes 4 types of questions, ranging from simple queries that can be answered by looking at the image only to complex ones that require retrieving information from multiple sources and performing reasoning.
Moreover, CRAG-MM encompasses both single-turn and multi-turn conversations, providing a more overarching evaluation of MM-RAG solutions.
π Timeline
There will be two phases in the challenge. Phase 1 will be open to all teams who sign up. All teams that have at least one successful submission in Phase 1 can enter Phase 2.
Phase 1: Open Competition
- Website Open, Sample data available and Registration Begin: March 6, 2025, 23:55 UTC
- Data Available: March 15, 2025, 23:55 UTC
- Phase 1 Submission Start Date: March 17, 2025, 23:55 UTC
- Phase 1 Submission End Date: May 10, 2025, 23:55 UTC
Phase 2: Competition for Top Teams
- Phase 2 Start Date: May 11, 2025, 23:55 UTC
- Registration and Team Freeze Deadline: May 21, 2025, 23:55 UTC
- Phase 2 End Date: Jun 1, 2025, 23:55 UTC
Winners Announcement
- Winner Notification: July 1, 2025
- Winner Public Announcement: August 5, 2025 (At KDD Cup Winners event)
π Prizes
The challenge boasts a prize pool of USD 33,000. There are prizes for all three tasks.
Grand Prize
- π The top one team that gains the highest score for egocentric images: $5000
For Each Task
- π₯ First Place: $4,000
- π₯ Second Place: $2,500
- π₯ Third Place: $1,500
Special Awards
- π First Place for each of the 4 question types: $1000
The first, second, and third prize winners are not eligible to win prizes for complex question types.
π» META CRAG - MM Challenge
An MM-RAG QA system takes as input an image πΌ and a question π, and outputs an answer π΄; the answer is generated by MM-LLMs according to information retrieved from external sources, combined with knowledge internalized in the model. A Multi-turn MM-RAG QA system, in addition, takes questions and answers from previous turns as context to answer new questions. The answer should provide useful information to answer the question without adding any hallucination.
We first define four types of questions in our benchmark:
-
Simple questions: Questions asking for simple facts.
-
Simple recognition: This can be directly answered from the image (e.g., "What brand is the milk?" or "Who wrote this book?" where the brand name and the book author are shown on the image).
-
Simple knowledge: Requires external knowledge for the answers (e.g., "Whatβs the price of this sofa on Amazon?").
-
-
Multi-hop questions: Questions that require chaining multiple pieces of information to compose the answer (e.g., "What other movies have the director of this movie directed in the past?").
-
Comparison and Aggregation questions: Questions requiring aggregating or comparing multiple pieces of information (e.g., "Which drinks do not contain added sugar among these?" or "Is this cheaper on Amazon?").
-
Reasoning questions: Questions about an entity that cannot be directly looked up and require reasoning to answer (e.g., "Can the dryer be used in Europe?" where the image shows a dryer).
We designed three competition tasks. As shown in Figure 2, Task #1 and Task #2 contain single-turn questions, where the former provides image-KG-based retrieval, and the latter additionally introduces web retrieval; Task #3 focuses on multi-turn conversations.
Here, we provide the content that can be leveraged in QA to ensure fair competition. We describe how we generated the data in the next section.
Task #1: Single-source Augmentation
- Goal: To test the basic answer generation capability of MM-RAG systems.
- Provides an image mock API to access information from an underlying image-based mock KG. The mock KG is indexed by the image and stores structured data associated with the image. Answers to the questions may or may not exist in the mock KG. The mock API takes an image as input and returns similar images from the mock KG, along with structured data associated with each image to support answer generation.
Task #2: Multi-source Augmentation
- Goal: To test how well the MM-RAG system synthesizes information from different sources.
- In addition to Task #1, this task provides a web search mock API as a second retrieval source. The web pages may provide useful information for answering the question but also contain noise.
Task #3: Multi-turn QA
- Goal: To test context understanding for smooth multi-turn conversations.
- This task tests the systemβs ability to conduct multi-turn conversations. Each conversation contains 2β6 turns. Except for the first turn, questions in later turns may or may not require the image for answering.
The three tasks, each building upon the previous one, guide competition teams to build end-to-end RAG systems for multi-modal, multi-turn QA.
Figure 2: CRAG - MM Tasks
π― Evaluation Metrics
We adopt exactly the same metrics and methods used in the CRAG competition to assess the performance of MM-RAG systems. Below is a brief description of the evaluation criteria.
Single-turn QA
- For each question in the evaluation set, the answer is scored as:
- β Perfect (fully correct) β Score: 1.0
- β οΈ Acceptable (useful but with minor non-harmful errors) β Score: 0.5
- β Missing (e.g., βI donβt knowβ, βIβm sorry I canβt find β¦β) β Score: 0.0
- β Incorrect (wrong or irrelevant answer) β Score: -1.0
- Truthfulness Score: The average score across all examples in the evaluation set for a given MM-RAG system.
Multi-turn QA
There is not a dominant way to evaluate answer quality for multi-turn conversations. We adapt the method in [5], which is closest to the information-seeking flavor of conversations (in contrast to task fulfilling). In particular, we stop a conversation when the answers in two consecutive turns are wrong and consider answers to all remaining questions in the same conversation as missingβmimicking the behavior of real users when they lose trust or feel frustrated after repeated failures. We then take the average score of all multi-turn conversations.
π Evaluation Techniques
- Auto-evaluation is used to display scores on the leaderboard.
- In the final round, the top-10 teams are selected via auto-evaluation.
- Manual annotations determine the final top teams for each task.
Performance Constraints:
- 10-second timeout after the first token is generated.
- Only answer texts generated within 30 seconds are considered.
- Auto-evaluation truncates answers to 75 tokens to encourage conciseness.
- Full responses are checked manually for hallucination.
π Evaluation Details
- All missing answers should return a standard response:
"I don't know."
- Every query is associated with a query time (when the query was made).
- Dynamic questions may have different correct answers depending on query time.
- Ground truth answers are correct at the time the data was collected.
π CRAG-MM Dataset Description
CRAG-MM contains three parts:
- The Image Set
- The QA Set
- Retrieval Contents
πΌοΈ Image Set
- CRAG-MM contains two types of images:
- Egocentric images: Captured using Ray-Ban Meta Smart Glasses.
- Normal images: Collected from public sources.
π Question Answer Pairs
- Covers 14 domains, including Books, Food, Math & Science, Shopping, Animal, Vehicles, and more.
- Includes 4 types of questions: Simple-recognition and Simple-knowledge, Multi-hop, Comparison and Aggregation, Reasoning.
- Contains both single-turn and multi-turn conversations.
π Retrieval Contents
Image Search
- A mock image search API takes an image as input.
- Returns similar images with structured metadata from a mock KG.
- Example: Querying with a landmark image returns similar images with metadata.
Text-Based Web Search
- A text search API takes a text query as input.
- Returns relevant web pages (URLs, page titles, snippets, last updated time).
Both APIs include hard negative data to simulate real-world challenges.
π Submission and Participation
Participants must submit their code and model weights to run on the host's server for evaluation.
π§ Model
This KDD Cup requires participants to use Llama models to build their RAG solution. Especially, participants can use or fine-tune the following Llama 3 models from https://llama.meta.com/llama-downloads:
- Llama 3.2 11B
- Llama 3.2 90B
Any other non-llama models used need to be under 1.5b parameter size limit.
π¨ Hardware and system configuration
We set a limit on the hardware available to each participant to run their solution. Specifically,
All submissions will be run on a single G6e instance with a NVIDIA L40s GPU with 48GB of GPU memory on AWS. Please note that
- Llama 3.2 11B in full precision can run directly.
- Llama 3.2 90B in full precision cannot be directly run on this GPU instance. Quantization or other techniques need to be applied to make the model runnable.
- NVIDIA L40s is not using the latest architectures and hence might not be compatible with certain acceleration toolkits, so please make sure the submitted solution is compatible with the configuration.
Moreover, the following restrictions will also be imposed.
- The network connection will be disabled.
- Each example will have a time-out limit of 30 seconds. [TO BE TESTED WITH AICROWD SUBMISSION SYSTEM].
- To encourage concise answers, each answer will be truncated to 75 bpe tokens in the auto-eval. In human-eval, graders will check the first 75 bpe tokens to find valid answers, but check the whole response to judge for hallucination.
π€ Use of external resources
By only providing a small development set, we encourage participants to exploit public resources to build their solutions. However, participants should ensure that the used datasets or models are publicly available and equally accessible to use by all participants. Such a constraint rules out proprietary datasets and models by large corporations. Participants are allowed to re-formulate existing datasets (e.g., adding additional data/labels manually or with Llama models), but award winners are required to make them publicly available after the competition.
π Baseline implementation
We provide baseline RAG implementations based on the Llama 3.2 11B model to help participants onboard quickly.
π Participation and Submission
π Registration
- Teams of 1β5 participants can register on this page before submitting solutions.
- Team freeze deadline: May 21, 2025.
π€ Solution Submission
- Phase I: Each team can make 6 submissions per week across all three tracks. [TO BE TESTED WITH AICROWD SUBMISSION SYSTEM]
- Phase II: Each team can make 6 total submissions across all three tracks. [TO BE TESTED WITH AICROWD SUBMISSION SYSTEM]
π» Technical Report Submission
- Potential winners must submit a technical report.
- The report includes a solution description + code for reproducibility.
- Compliance with challenge rules is required for winning teams.
ποΈ KDD Cup Workshop
The KDD Cup is an annual data mining and knowledge discovery competition organized by ACM SIGKDD.
- KDD Cup 2025 will be held in Toronto, Canada.
- Dates: August 3β7, 2025.
β° What Makes CRAG-MM Standout?
- The first publicly released wearable benchmarks
- Includes real-world challenges from wearable device QA.
- Rich and insightful benchmark
- Covers diverse images and questions (e.g., low-quality images, long-tail entities).
- Reliable and fair evaluation
- Equal access to retrieval resources (image KG + web corpus).
ποΈ Related Work
To the best of our knowledge, the Meta CRAG-MM challenge is the first MM-RAG challenge for KDD Cups and broadly. CRAG-MM uniquely features natural uses cases for wearable devices based on egocentric images. Moreover, it encompasses a variety of domains and question types, effectively evaluating different capabilities of MM-RAG systems: entity recognition, OCR, query rewrite, answer generation, and so on. Furthermore, CRAG-MM extends beyond single-turn QA by including multi-turn conversations, a common and critical use case for smart assistant.
π± Contact
For inquiries, contact:
π§ crag-kddcup-2025@meta.com
Organizers of this KDD Cup consists of scientists and engineers from Meta Reality Labs and Meta GenAI. They are:
β’ Xiao Yang β’ Jiaqi Wang β’ Shervin Ghasemlou β’ Parth Suresh β’ Sanat Sharma β’ Surya Appini β’ Haidar Khan β’ Roy Luo β’ Ziqiang Guan β’ Juheon Lee β’ Prashan Wanigasekara β’ Lingkun Kong β’ Sajal Choudhary β’ Tammy Stark β’ Chen Zhou β’ Kai Sun β’ Shane Moon β’ Nicolas Scheffer β’ Zhaleh Feizollahi β’ Mangesh Pujari β’ Andrea Jessee β’ Rakesh Wanga β’ Rohit Patel β’ Anuj Kumar β’ Xin Luna Dong
Competition rules: https://www.aicrowd.com/challenges/meta-crag-mm-challenge-2025/challenge_rules
π References
[1] Qiu et al., "SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM". Available at: https://aclanthology.org/2024.findings-emnlp.14/
[2] Yu et al., "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities". Available at: https://arxiv.org/abs/2308.02490
[3] Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey". Available at: https://arxiv.org/abs/2312.10997
[4] Yang et al., "CRAG - Comprehensive RAG Benchmark". Available at: https://proceedings.neurips.cc/paper_files/paper/2024/hash/1435d2d0fca85a84d83ddcb754f58c29-Abstract-Datasets_and_Benchmarks_Track.html
[5] Bai et al., "MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues". Available at: https://aclanthology.org/2024.acl-long.401/
Participants















































