AIcrowd | Temporal Alignment Track

Phase 1: Completed

Phase 2: Completed Weight: 1.0

Sony Group Corporation

4164

289

380

💻 Watch the Sounding Video Generation Challenge Townhall Recording!

‼️Key Updates: Local End-to-End Evaluation Script (Docker) and more.

💻 Make your first submission using the starter-kit.

🗃️ Find the challenge resources over here.

🎥 The Sounding Video Generation Challenge: Temporal Alignment

Welcome to the Temporal Alignment Task of the Sounding Video Generation Challenge 2024! This task focuses on generating videos where audio and video elements are perfectly synchronized in time, advancing the frontier of multi-modal content creation.

🕵️ Introduction

As multi-modal generation technology evolves, achieving flawless synchronization between audio and visual content has become increasingly important. Temporal Alignment ensures that sound perfectly matches the visual effect. It remains a significant challenge in contemporary research. This task invites participants to develop innovative solutions that enhance the temporal synchronization of audio and video in generated content.

📜 Task: Temporally Aligned Audio and Video Generation

The core objective of this track is to generate videos that are both temporally and semantically aligned with their respective audio tracks. Participants must produce high-resolution videos (minimum 256x256 pixels, 8fps) accompanied by monaural audio (1 channel, minimum 16kHz). There are two primary types of alignment to be achieved:

Semantic Alignment: The audio’s semantic content should match the video. For example, if a video shows a dog barking, the audio must accurately represent a barking sound.
Temporal Alignment: The audio should be precisely synchronized with the video actions. For instance, the barking sound should occur exactly when the dog is seen barking in the video.

Participants will be assessed based on how well their submissions synchronize audio and video over time. Training will be conducted using a customized Greatest Hits dataset, which includes video captions for ease of training. A baseline model, built on AnimateDiff and AudioLDM, is available for reference. Submissions will be evaluated using a specific set of text prompts to determine the level of synchronization achieved.

📁 Dataset

This task utilizes a customised dataset named SVGTA24 derived from the Greatest Hits dataset as a base dataset. It contains 977 videos of humans hitting various objects with a drumstick in the scene. Given the prominence of sound and motion in these videos, this dataset is ideal for assessing the temporal alignment between generated audio and video. The original videos have been segmented into non-overlapping four-second clips, resulting in 8,217 clips for training. While the dataset includes predefined train/validation splits, participants can use the entire dataset for training purposes.

It contains 977 videos of humans hitting various objects with a drumstick in the scene. Given the prominence of sound and motion in these videos, this dataset is ideal for assessing the temporal alignment between generated audio and video. The original videos have been segmented into non-overlapping four-second clips, resulting in 8,217 clips for training. While the dataset includes predefined train/validation splits, participants can use the entire dataset for training purposes.

To adapt this dataset for text-conditional sounding video generation, captions for all video clips were automatically generated using LLaVA-Next. These captions are provided along with the video clips. Additionally, a set of synthetic captions, unseen in the dataset, has been created and will be used for testing the submitted models.

Audio: The audio track is monaural, primarily featuring the sound of objects being struck by a drumstick, with a sampling rate of 16 kHz.
Video: The video captures scenes of a person striking objects with a drumstick. The camera angle remains largely fixed, and each video clip lasts four seconds, with eight frames per second. The resolution is 256x256 pixels.

📊 Evaluation Metrics

We use several metrics to evaluate both semantic and temporal alignment:

FVD (Frechet Video Distance):
FVD is used to assess the quality of generated video clips. It measures the distance between the distribution of real and generated video clips in a high-dimensional feature space obtained from a pre-trained neural network. A lower FVD indicates that the generated videos are more similar to real videos, suggesting higher quality and realism.
FAD (Frechet Audio Distance):
This metric evaluates the quality of generated audio samples, particularly in tasks like speech synthesis or music generation. FAD calculates the distance between the distribution of real audio samples and synthesized audio samples in a feature space (with a pre-trained model). A lower FAD score indicates that the generated audio is closer to the real audio in terms of quality and characteristics.
AV-Align score:
The metric is based on separately detecting energy peaks (optical flow for video and onset for audio) in both modalities and measuring their alignment. The premise behind this metric is that fast temporal energy changes in the audio signal often correspond to an object movement producing this sound.
CAVP score:
This metric is expected to measure both the semantic and temporal similarity of audio-video pairs, using a pre-trained contrastive audio-video model called CAVP.
ImageBind score:
This metric is used to measure the similarity of text-audio, text-video, and audio-video pairs, using the multimodal contrastive model ImageBind.
LanguageBind score:
Similar to ImageBind, the only difference here is using LanguageBind instead of ImageBind.

The evaluations happen on a g5.2xlarge instance with 8 vCPUs, 32 GB RAM and 1 x NVIDIA A10G GPU. The timeout for a single prediction is 120s.

🚀 Baseline System

A baseline system is provided to generate 4-second audio-video pairs. The videos are produced at eight frames per second with a resolution of 256x256 pixels, while the audio is sampled at 16 kHz. Make your first submission using the starter-kit. Find the challenge resources over here.

💰 Prizes

Total Prize Pool for Temporal Alignment Track: $17,500

🥇 1st Place: $10,000
🥈 2nd Place: $5,000
🥉 3rd Place: $2,000

More details about the leaderboards and prizes will be announced soon. Please refer to the Challenge Rules for more information on the Open Sourcing criteria required for eligibility for the associated prizes.

🗓 Timeline

The SVG Challenge takes place in two rounds, with an additional warm-up round. The tentative launch dates are:

Warmup Round: 29th Oct 2024
Phase I: 2nd Dec 2024
Phase II: 3rd Jan 2025
Challenge End: 25th Mar 2025

📖 Citing the Challenge

We are preparing a paper on our baseline model and dataset and will make it public soon. Consider citing it if you are participating in this challenge and/or the datasets involved.

Challenge Organizing Committee

Sounding Video Generation – Temporal Alignment Track
Masato Ishii, Shiqi Yang, Takashi Shibuya, Christian Simon, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji (Sony)