GANSynth: Adversarial Neural Audio Synthesis

Online Supplement

Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, Adam Roberts

Google AI

Related Material

Colab Notebook
ICLR 2019 Paper
Github Code

Overview
Baseline Comparisons
Interpolations
Consistent Timbre
Model Comparisons

Overview

GANSynth learns to produce individual instrument notes like the NSynth Dataset. With pitch provided as a conditional attribute, the generator learns to use its latent space to represent different instrument timbres. This allows us to synthesize performances from MIDI files, either keeping the timbre constant, or interpolating between instruments over time.

Consistent Timbre	Interpolation

Bach's Prelude Suite No. 1 in G major MIDI

Bach's Prelude Suite No. 1 in G major — Constant-Q Transform (CQT) spectrograms of the two audio clips above. Single latent vector corresponds to holding the timbre constant, as shown by the consistent spectrogram shapes for each note, while interpolation changes the timbre over time. The arrows mark the continuation of the spectrogram.

Baseline Comparisons

We compare our best performing GANSynth models across a range of pitches with real samples and a pitch-conditional WaveNet and WaveGAN baselines. While the baselines are state-of-the-art, they have high bias and fail to capture the diversity of pitches and timbres in the dataset, while GANSynth produces high quality samples similar to the real data. To help qualitative evalution, we show models only trained on the subset of acoustic instruments. Samples were hand selected to try and best reflect the diversity and quality of samples from each model. Quantitative comparisons can be found in the paper.

Real Data	GANSynth	WaveNet	WaveGAN
Pitch 36



Pitch 48



Pitch 60



Pitch 72



Pitch 84

Interpolations

We compare interpolations for GANSynth with a WaveNet Autoencoder from the original NSynth paper. Inital and target timbres are chosen from GANSynth samples because it lacks an encoder. GANSynth conditions on a single global latent vector, while the WaveNet AE uses a temporally-distributed latent code. This leads the GANSynth interpolations to all sound like reasonable instruments, as interpolations were seen during training, while the WaveNet AE wanders off the data manifold by mixing in time, leading to unrealistic sounds.

Initial Timbre	Target Timbre	Initial Timbre	Target Timbre
Example 1		Example 2

GANSynth	WaveNet AE	GANSynth	WaveNet AE

Consistent Timbre

Since GANSynth uses global latent and pitch conditioning, it is possible to hold the latent vector fixed and maintain consistent timbre across a large range of pitches.

Example 1	Example 2	Example 3	Example 4

Model Comparisons

We generate audio using image-style GAN generators and discriminators. This approach works better for some audio representations than others. We experiment with mel scaling for spectrograms (Mel) instead of linear scaling, instantaneous frequency (IF) instead of raw phase (Phase), and increased frequency resolution (H) of the spectrograms. Quantiatively, each modification helps with the quality and diversity of genearted outputs, with instantaneous frequency helping the most for the highly periodic waveforms of musical instruments.

IF + Mel + H	IF + Mel	IF	Phase
Pitch 36



Pitch 48



Pitch 60



Pitch 72



Pitch 84