AIcrowd | Spatial Alignment Track

Phase 1: Completed

Phase 2: Completed Weight: 1.0

Sony Group Corporation

6772

332

376

💻 Watch the Sounding Video Generation Challenge Townhall Recording!

💻 Make your first submission using the starter-kit.
🗃️ Find the challenge resources over here.

🕵️ Introduction

Currently available generative models are typically restricted to a single modality, and generating both video and audio remains a challenging task. In this challenge, we focus on achieving spatial alignment between the generated video and audio. Participants take an initiative towards generating both modalities in an unconditional manner. The dataset used in this challenge primarily includes humans and musical instruments with sound captured by microphones that can reflect activities producing sound in the corresponding video.

📜 Task – Spatially aligned Audio and Video Generation

Goal: The objective of the spatial alignment track is to develop a generative model that can create videos and corresponding stereo audio that are spatially aligned using a 5-second video with associated stereo audio as training data. The resulting model should produce high-quality videos with matching audio. We evaluate the results based on both quality of generated video and sound and alignment scores, and participants are encouraged to achieve the highest possible scores in these metrics.

Unconditional Generation: The model focuses on unconditional generation tasks, meaning we need to build and train models that generate audio and video without any specific conditions or prompts.

Format: The target resolution for video is 256x256, and the channel is 2 for audio (stereo).

‍📁 Dataset

For this task, we are using a customised dataset named SVGSA24 derived from the STARSS23 dataset, where the original videos with an equirectangular view and Ambisonics audio have been converted to videos with a perspective view and stereo audio. Additionally, we have curated content focusing on on-screen speech and instrument sounds.

Some examples are shown below.

Content: The dataset predominantly features humans and musical instruments, providing a diverse range of audio-visual scenarios. This includes various speeches by humans and musical sounds.

Audio: The audio component is captured using high-quality microphones, ensuring that the sounds accurately reflect the activities depicted in the video. This includes speech, musical notes, and background sounds.

Video: The video component includes visual data from different angles. The duration of a video is set to 5 seconds.

Split: We release the development set to the public and keep the evaluation set for the challenge evaluation. These evaluation set serves as a target distribution to quantify the quality of generated video and audio.

📊 Evaluation Metrics

In this challenge, we have chosen to use the following evaluation scores for quality measurement:

Fréchet Video Distance (FVD Score): To assess the quality of generated videos.
Fréchet Audio Distance (FAD Score): To assess the quality of generated audio.

We have prepared an isolated dataset for evaluation.

To quantify the alignment spatially we use metrics below:

Spatial AV-Align: To evaluate spatial alignment, we newly employ Spatial AV-Align, a novel metric using pretrained object detection and sound event localization and detection (SELD) models.

evaluation metric

To be specific, we explain below:

We first detect candidate positions of sounding objects per frame in each modality separately.
Then, for each position in audio, we validate whether a position is also detected in the video.
We determine whether a SELD result has an area of overlap with an object detection result. If there is an area of overlap, it is TP; if not, it is FN.

We allow the use of object detection results within a video frame considering also t-1 and t+1 frames.

(We don’t validate whether each position in video is detected in audio because the dataset includes persons who don’t talk or play instruments.)
Finally, we calculate a recall metric as the alignment score ranging between zero and one: Given TP and FN, the alignment score is defined as: TP / (TP+FN).

The evaluations happen on a g5.2xlarge instance with 8 vCPUs, 32 GB RAM and 1 x NVIDIA A10G GPU. The timeout for a single prediction is 120s. evaluation metric

💰 Prizes

Spatial Alignment Track
Total Prize: 17,500 USD

🥇 1st: 10,000 USD
🥈 2nd: 5,000 USD
🥉 3rd: 2,000 USD

More details about the leaderboards and the prizes will be announced soon. Please refer to the Challenge Rules for more details about the Open Sourcing criteria for each of the leaderboards to be eligible for the associated prizes.

💪 Getting Started

Make your first submission to the challenge using this easy-to-follow starter kit. Find the challenge resources over here.

🚀 Baseline System

You can find the baseline pretrained models in the starter kit. The baseline models contain the base diffusion model for generating samples in 64x64 and the super resolution model to upsample the generated video to 256x256. Please note that during the challenge, we might add more baselines. Below we provide the score for our baseline model based on MM-Diffusion. In this implementation, we take into account stereo audio and train the model on the introduced SVGSA24 dataset. 💻 Make your first submission using the starter-kit.

Below are some generated results (256x256) using our provided pretrained models: Inference - DPM Solver

	FVD ↓	FAD ↓	Spatial AV Alignment ↑
Baseline	1050.3	9.65	0.48
Ground Truth	572.05	3.70	0.92

evaluation metric

📅 Timeline

The SVG Challenge takes place in two rounds, with an additional warm-up round. The tentative launch dates are:

Warmup Round: 29th Oct 2024
Phase I: 2nd Dec 2024
Phase II: 3rd Jan 2025
Challenge End: 25th Mar 2025

📖 Citing the Challenge

Consider citing it if you are participating in this challenge and/or the datasets involved.

@article{shimada2024savgbench,
  title={{SAVGBench}: Benchmarking Spatially Aligned Audio-Video Generation},
  author={Shimada, Kazuki and Simon, Christian and Shibuya, Takashi and Takahashi, Shusuke and Mitsufuji, Yuki},
  journal={arXiv preprint arXiv:2412.13462},
  year={2024}
}

📱 Challenge Organising Committee

Sounding Video Generation - Spatial Alignment Track
Kazuki Shimada, Christian Simon, Shusuke Takahashi, Shiqi Yang, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji (Sony)