|Paper|
Daniel P. W. Ellis, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, Manoj Plakal
Google DeepMind
Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., “enhance Door”) and a graphical representation of the event timing derived from an “event roll” transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic input/desired output audio examples formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions – action, class, timing. We believe our work demonstrates “recomposition” is an important and practical application.
The table below provides results from our full system and ablated systems for examples with backgrounds from YFCC100M foreground events from FSD50k.
Edit Instructions | enhance Zipper (clothing): 8.7s-9.2s delete Sound effect: 1.1s-1.5s |
delete Laughter: 0.1s-1.7s |
insert Cat: 8.7s-9.2s insert Digestive: 0.1s-1.7s |
delete Generic impact sounds: 5.7s-5.9s delete Bird: 0.2s-2.0s |
enhance Bird: 0.2s-2.0s |
enhance Bird: 0.7s-1.6s delete Breathing: 2.7s-4.4s |
Input Audio | ||||||
Reference Output |
||||||
Full System (T/A/C) | ||||||
Ablation T/A/- | ||||||
Ablation T/-/C | ||||||
Ablation T/-/- | ||||||
Ablation -/A/C | ||||||
Ablation -/-/- | ||||||