1 UNC Chapel Hill
2 Microsoft
* Work done during an internship at Microsoft
Key Challenges in Joint Audio-Video Editing. Existing methods primarily focus on zero-shot text-to-video or text-to-audio editing separately. Solely editing only video or only audio often leads to coherence and synchronization issues between two modalities. As highlighted in red circle, the motion or presence of sounding objects may not align with the corresponding audio. Additionally, edited content may exhibit audio artifacts along the temporal dimension (shown in the purple squares). These factors make the edited results feel less natural and cohesive. In contrast, our AVED jointly edits audio and video by leveraging cross-modal information as additional supervision to improve editing quality to alleviate synchronization issues.

All in one

Demo Videos

Source prompt:

machine gun

Target prompt:

laser gun

Input Video

ControlVideo+ZEUS

TokenFlow+ZEUS

RAVE+ZEUS

AvED (Ours)

Source prompt:

a dog is howling

Target prompt:

a lion is roaring

Source prompt:

a race car is driving on the road

Target prompt:

a police car is driving on the road

Source prompt:

firework lit up in the night sky

Target prompt:

a splash of water lit up in the night sky

BibTeX


        @article{lin2025zeroshot,
          title={Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising},
          author={Yan-Bo Lin and Kevin Lin and Zhengyuan Yang and Linjie Li and Jianfeng Wang and Chung-Ching Lin and Xiaofei Wang and Gedas Bertasius and Lijuan Wang},
          year={2025},
          journal={arXiv preprint arXiv:2503.20782},
        }