Video-conditioned sound and speech technology, encompassing video-to-sound (V2S) and visible text-to-speech (VisualTTS) duties, are conventionally addressed as separate duties, with restricted exploration to unify them inside a signle framework. Current makes an attempt to unify V2S and VisualTTS face challenges in dealing with distinct situation sorts (e.g., heterogeneous video and transcript situations) and require complicated coaching levels. Unifying these two duties stays an open drawback. To bridge this hole, we current VSSFlow, which seamlessly integrates each V2S and VisualTTS duties right into a unified flow-matching framework. VSSFlow makes use of a novel situation aggregation mechanism to deal with distinct enter indicators. We discover that cross-attention and self-attention layer exhibit completely different inductive biases within the strategy of introducing situation. Subsequently, VSSFlow leverages these inductive biases to successfully deal with completely different representations: cross-attention for ambiguous video situations and self-attention for extra deterministic speech transcripts. Moreover, opposite to the prevailing perception that joint coaching on the 2 duties requires complicated coaching methods and should degrade efficiency, we discover that VSSFlow advantages from the end-to-end joint studying course of for sound and speech technology with out additional designs on coaching levels. Detailed evaluation attributes it to the realized basic audio prior shared between duties, which accelerates convergence, enhances conditional technology, and stabilizes the classifier-free steerage course of. Intensive experiments display that VSSFlow surpasses the state-of-the-art domain-specific baselines on each V2S and VisualTTS benchmarks, underscoring the important potential of unified generative fashions.
- †Renmin College of China







