Enabling Factorized Piano Music Modeling and Generation with the `MAESTRO` Dataset

Online Supplement

Related Material

Main Paper on arXiv
Dataset Download
Blog Post

Synthesis Samples
Music Transformer Samples
Listening Test Samples
Alteration Samples
Year Conditioning Samples
Listening Study Results

Synthesis Samples

Samples from the MAESTRO test split (left) re-synthesized by the WaveNet model trained on MAESTRO (center) and basic MIDI synthesis (right).

Real Audio	WaveNet Synthesis	Other Synthesis
Domenico Scarlatti - Sonata in B Minor, K. 87

Franz Schubert - Moments Musicaux Op. 94 No. 3 in F-sharp Minor

Frédéric Chopin - Mazurka in D Major, Op. 33, No. 2

Music Transformer Samples

These are 1800-step samples from the Music Transformer model, synthesized by the WaveNet model trained on MAESTRO (left) and basic MIDI synthesis (right).

	WaveNet Synthesis	Other Synthesis
Sample 1
Sample 2
Sample 3
Sample 4

20-second Listening Test Samples (section 7)

Ground Truth Recordings

Random samples from MAESTRO.

WaveNet Unconditioned

Clips generated by the WaveNet model trained with audio from MAESTRO with no conditioning

WaveNet Ground/Test

Clips generated by the WaveNet model trained with audio/MIDI pairs from the MAESTRO training and validation splits, conditioned on random 20-second MIDI subsequences from the MAESTRO test split.

WaveNet Transcribed/Test

Clips generated by the WaveNet model trained with audio and transcribed MIDI from MAESTRO-T (see section 4), conditioned on random 20-second subsequences from the MAESTRO test split.

WaveNet Transcribed/Transformer

Clips generated by the WaveNet model trained with audio and transcribed MIDI from MAESTRO-T (see section 4), conditioned on random 20-second subsequences from the Music Transformer model described in section 5 that was trained on MAESTRO-T.

Alteration Samples

As a fun side-effect, we are also able to alter performances and resynthesize with a different / more natural sound than other traditional signal processing techniques. The audio alterations were performed with Abelton Live 10 on "Complex" mode. All samples are from Prelude and Fugue in A Minor, WTC I, BWV 865 by Johann Sebastian Bach.

Original Audio	MIDI alteration, WaveNet Synthesis	Audio alteration

Shift up by 1 octave

Shift down by 1 octave

Reduce tempo by 50%

Increase tempo by 100%

Year Conditioning Samples

We find that longer samples often have timbral shifts due to variation in recording settings in the ground truth data. By training with a conditioning signal for the year of the recording, we can force the model to generate with a single timbre over long time scales.

For example, this WaveNet synthesis of Prelude and Fugue in A Minor, WTC I, BWV 865 by Johann Sebastian Bach includes a timbral shift at time 0:34:

Here we synthesize this same score with different year conditionings:

Year	Audio
2004
2006
2008
2009
2011
2013
2014
2015
2017

Listening Study Results

The full anonymized listening study results are available in CSV form: listening_study_anon.csv.

Within this data, the models from the paper have the following identifiers:

Model Name	Identifier
WaveNet Unconditioned	unconditioned
WaveNet Transcribed/Transformer	transformer_xs
WaveNet Transcribed/Test	test_xs
WaveNet Ground/Test	test
Ground Truth Recordings	validation

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset