General-purpose, long-context autoregressive modeling with Perceiver AR

Online Supplement

Related Material

Main Paper on arXiv


Transcribed Piano Performances Dataset
MAESTRO v3 Symbolic
MAESTRO v3 Audio

Transcribed Piano Performances Dataset

To showcase long-term coherence of Perceiver AR, we use a large dataset of symbolic piano performances. These were transcribed from 10,000+ hours of audio, using a variation of the Onsets and Frames model. From this dataset, we only used pieces resulting in 1024–32,768 tokens. Samples shorter than the lower limit are unlikely to contain song content; at the other end, there are only about 200 pieces longer than 32,768 tokens. The same tokenization as for MAESTRO is used. We train a model with 1024 latents and 24 self-attention layers on input contexts of 32,768 tokens, achieving a negative log-likelihood of 1.2418 on the test set.

The samples obtained from this large dataset exhibit stylistic and structural coherence that spans several minutes, containing repeating musical themes, chord patterns, arpeggios and even ritardandos.

Playlist of 8 synthesized example model outputs:

MAESTRO v3 Symbolic

Next we show results from models trained on the MAESTROv3 dataset. The first model was trained on a symbolic represenation of the MIDI data. As above, the symbolic music generated with the model was synthesized using Fluidsynth.

MAESTRO v3 Audio

We also trained models on the audio recordings within MAESTRO encoded with the SoundStream codec at 12, 18, and 22kbps.

32k context

First, we trained a model using a context of 32,768 tokens. Samples at 12kbps show coherence over 27 seconds of audio, while samples at 22kbps demonstrate 14 seconds of high quality audio.


65k context

Next, we trained a model using a context of 65,536 tokens. Samples at 12kbps show coherence over 54 seconds of audio, while samples at 22kbps demonstrate 28 seconds of high quality audio.

Because recordings within the MAESTRO dataset were made with varying microphone placement, some samples sound end up sounding like recordings made with a very close mic placement (e.g., 22kbps sample #2) or very ambient mic placement (e.g., 22kbps sample #1).

This effect has been noticed before in the Midi2Wave Synthesis model. However, unlike the WaveNet architecture of Midi2Wave, our model can attend to the full sequence. So once a "mic placement" has been selected it is consistent throughout the sequence without needing any additional conditioning.


Reconstruction Reference

For a sound quality reference, here are some examples from the MAESTRO training dataset reconstructed using the same SoundStream codec (no Perceiver AR inference).

12kbps (reconstruction)18kbps (reconstruction)22kbps (reconstruction)