AIcrowd | Sounding Video Generation (SVG) Challenge 2024

Phase 1: Completed

Phase 2: Completed

Sony Group Corporation

33.5k

333

757

Problem Statements

Weight: 1.0

Temporal Alignment Track

Generate Videos with Temporal and Semantic Audio Sync

7022

381

Weight: 1.0

Spatial Alignment Track

Create Videos with Spatially Aligned Stereo Audio

6773

376

💻 Watch the Sounding Video Generation Challenge Townhall Recording!

‼️Key Updates: Local End-to-End Evaluation Script (Docker) and more.
⏰ The challenge is now live!

💻 Don't know where to start? Check out the starter-kits for Temporal Alignment Track and Spatial Alignment Track.

Generate Synchronized & Contextually Accurate Videos

Welcome to the Sounding Video Generation (SVG) Challenge 2024!

The Sounding Video Generation (SVG) Challenge 2024 is a competition to create AI models that make videos where the visuals match perfectly with sounds, like a dog barking in sync with the video. Participants will work to improve how well sounds and scenes align, with prizes for the best results.

This challenge invites you to build models that generate synchronized and contextually accurate videos. You can showcase their skills and push the boundaries of sounding video generation with two tracks -

Temporal Alignment
Spatial Alignment

📜 Introduction

Video generation research has progressed significantly, with large-scale diffusion models producing realistic videos. However, sounding video generation, which involves well-aligned video and audio modalities, remains underexplored. The SVG Challenge aims to advance this field by providing a platform for benchmarking and showcasing state-of-the-art models.

🎥 The Sounding Video Generation Challenge

Build state-of-the-art AI models to generate videos, ensuring the audio is synchronized and contextually appropriate.

⏰ Temporal Alignment Track

This track aims to generate videos that are temporally and semantically aligned with their corresponding audio. This involves producing high-resolution videos (256x256 pixels, 8fps) with monaural audio (1 channel, 16kHz).

You will tackle two types of alignment:

Semantic Alignment: The audio’s semantic class should match the video. For instance, if the video shows a dog barking, the audio should contain a barking sound.
Temporal Alignment: The audio should be synchronized with the video. For example, the barking sound should occur precisely when the dog is seen barking.

In this track, submissions will be evaluated on how well the audio and video synchronize over time. Participants will use customised datasets named SVGTA24 derived from the Greatest Hits dataset with prepared video captions for training. A baseline model based on AnimateDiff and AudioLDM is provided. Submissions will be tested on a set of text prompts to assess synchronization.

More details are available on the Temporal Alignment Track page.

🌐 Spatial Alignment Track

This track aims to create videos with spatially aligned audio, giving a sense of space and direction. This involves producing high-resolution videos (256x256 pixels, 4fps) with stereo audio (2 channels, 16kHz).

Participants should focus on generating videos where the spatial alignment of the audio enhances the sense of space and direction, ensuring that the audio and video components are well-integrated.

Participants will use a customized SVGSA24 dataset derived from the STARSS23 dataset, where the original videos with an equirectangular view and Ambisonics audio have been converted to videos with a perspective view and stereo audio. Additionally, we have curated content focusing on on-screen speech and instrument sounds. This will be used for training and submit systems that generate video and 2-channel audio signals. A baseline model based on MM-Diffusion is provided. Evaluation will consider how well the generated video and audio align spatially.

More details are available on the Spatial Alignment Track page.

🗓 Timeline

The SVG Challenge takes place in two rounds, with an additional warm-up round. The tentative launch dates are:

Warmup Round: 29th Oct 2024
Phase I: 2nd Dec 2024
Phase II: 3rd Jan 2025
Challenge End: 25th Mar 2025

🏆 Prizes

The total prize pool is $35,000, divided between the two tracks. Teams can win prizes across multiple leaderboards.

Track 1: Temporal Alignment ($17,500)

First place: $10,000
Second place: $5,000
Third place: $2,500

Track 2: Spatial Alignment ($17,500)

First place: $10,000
Second place: $5,000
Third place: $2,500

Please refer to the Challenge Rules for more details on the Open Sourcing criteria for eligibility.