Zero-Shot Audio-Visual Editing via
Cross-Modal Delta Denoising

Yan-Bo Lin^1*

Kevin Lin²

Zhenyuan Yang²

Linjie Li²

Jianfeng Wang²

Chung-Ching Lin²

Xiaofei Wang²

Gedas Bertasius¹

Lijuan Wang²

¹ UNC Chapel Hill
² Microsoft
^* Work done during an internship at Microsoft

Paper Coming Soon Data

**Key Challenges in Joint Audio-Video Editing.** Existing methods primarily focus on zero-shot text-to-video or text-to-audio editing separately. Solely editing only video or only audio often leads to coherence and synchronization issues between two modalities. As highlighted in red circle, the motion or presence of sounding objects may not align with the corresponding audio. Additionally, edited content may exhibit audio artifacts along the temporal dimension (shown in the purple squares). These factors make the edited results feel less natural and cohesive. In contrast, our AVED jointly edits audio and video by leveraging cross-modal information as additional supervision to improve editing quality to alleviate synchronization issues.

All in one

Demo Videos

Source prompt:

machine gun

↓

Target prompt:

laser gun

Input Video

ControlVideo+ZEUS

TokenFlow+ZEUS

RAVE+ZEUS

AvED (Ours)

Source prompt:

a dog is howling

↓

Target prompt:

a lion is roaring

Source prompt:

a race car is driving on the road

↓

Target prompt:

a police car is driving on the road

Source prompt:

firework lit up in the night sky

↓

Target prompt:

a splash of water lit up in the night sky

BibTeX


        @article{lin2025zeroshot,
          title={Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising},
          author={Yan-Bo Lin and Kevin Lin and Zhengyuan Yang and Linjie Li and Jianfeng Wang and Chung-Ching Lin and Xiaofei Wang and Gedas Bertasius and Lijuan Wang},
          year={2025},
          journal={arXiv preprint arXiv:2503.20782},
        }