UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

UniAVGen is a unified framework for high-fidelity joint audio-video generation. It addresses key limitations found in existing methods, including poor lip synchronization, insufficient semantic consistency, and limited task generalization. The framework creates temporally aligned audio and video content that maintains strong semantic relationships between the two modalities.

What is UniAVGen?

UniAVGen is a unified framework for joint audio and video generation. At its core, it adopts a symmetric dual-branch architecture with parallel Diffusion Transformers for audio and video processing. This design enables the framework to build a cohesive cross-modal latent space where audio and video information can interact effectively.

The framework introduces three critical innovations. First, the Asymmetric Cross-Modal Interaction mechanism enables bidirectional temporal alignment between audio and video, ensuring precise spatiotemporal synchronization. Second, Face-Aware Modulation dynamically prioritizes salient facial regions during the interaction process, focusing computational resources on areas that matter most for synchronization. Third, Modality-Aware Classifier-Free Guidance amplifies cross-modal correlation signals during inference, improving the quality of generated content.

UniAVGen achieves superior performance with far fewer training samples compared to other methods. While some approaches require over 30 million training samples, UniAVGen delivers better results with just 1.3 million samples. This efficiency makes the framework more accessible and practical for real-world applications.

Key Features

Asymmetric Cross-Modal Interaction: Enables bidirectional temporal alignment between audio and video for precise synchronization.
Face-Aware Modulation: Dynamically prioritizes salient facial regions during cross-modal interactions for better lip synchronization.
Modality-Aware Classifier-Free Guidance: Amplifies cross-modal correlation signals during inference to improve generation quality.
Dual-Branch Architecture: Uses parallel Diffusion Transformers for balanced audio and video processing.
Multi-Task Capability: Supports multiple audio-video generation tasks within a single unified framework.
Training Efficiency: Achieves superior performance with significantly fewer training samples compared to other methods.

Multi-Task Capabilities

UniAVGen supports five main tasks within a single unified framework:

Joint Audio-Video Generation: Creates synchronized audio and video from reference image, video caption, and speech content.
Joint Generation with Reference Audio: Generates aligned audio-video with timbre matching the reference audio.
Joint Audio-Video Continuation: Extends existing audio and video content while preserving temporal consistency.
Video-to-Audio Dubbing: Generates audio that aligns with video emotions and expressions.
Audio-Driven Video Synthesis: Creates video with expressions and motions aligned to the driving audio.

Technical Architecture

UniAVGen's architecture consists of two parallel Diffusion Transformers, one for audio and one for video. These transformers work together to create a unified representation space where both modalities can interact. The dual-branch design ensures that neither audio nor video dominates the generation process, maintaining balance throughout.

The Asymmetric Cross-Modal Interaction mechanism is the heart of UniAVGen's synchronization capabilities. This system allows audio and video to influence each other bidirectionally, meaning audio can guide video generation and video can guide audio generation. The interaction happens at multiple temporal scales, ensuring that both short-term and long-term synchronization are maintained.

Note: This is an unofficial about page for UniAVGen. For the most accurate information, please refer to official documentation and research papers.

About UniAVGen

What is UniAVGen?

Key Features

Multi-Task Capabilities

Technical Architecture