MT3: Multi-Task Multitrack Music Transcription

Online Supplement

Main Paper

https://openreview.net/pdf?id=iMSjopcOn0p

Contents

Overview
Inputs and Outputs
Transcription Results
Baseline Comparison
In-the-Wild Transcriptions

Overview

In this paper, we propose a new framework for Multi-Task Multitrack Music Transcription, along with a model to achieve state-of-the-art performance on this task; we refer to both the task and the model as MT3. Here, we present detailed example results for our model. For a complete overview of the system, refer to the paper linked above.

Model Inputs and Outputs

The model operates on log Mel spectrogram inputs shown above, and generates output sequences in a MIDI-like vocabulary, shown below. This vocabulary can be deterministically decoded to a piano roll representation, as shown, or resynthesized to reconstruct the original audio.

In the results that follow, we translate our vocabulary into this piano roll representation, and then use FluidSynth to render the piano roll to audio.

Transcription Results

We evaluate our model across six datasets representing a diverse range of sizes, recording processes, genres, and instrumentations. Details on the datasets used are provided in the paper (see link above). Here we provide examples of sets of (input, ground truth, MT3) audio triplets for each dataset. Piano rolls for the ground truth and MT3-predicted MIDI are shown (hover over notes for details), along with the input spectrogram of the original audio.

MAESTRO

Original Audio Ground Truth MT3

Slakh2100

Original Audio Ground Truth MT3

Cerberus4

Original Audio Ground Truth MT3

GuitarSet

Original Audio Ground Truth MT3

MusicNet

Original Audio Ground Truth MT3

URMP

Original Audio Ground Truth MT3

Baseline Comparison

For each dataset, we compare our model to the output of a state-of-the-art DSP-based music transcription software, providing 30-second sample clips. Note that the baseline software does not provide instrument labels associated with its predictions, and only predicts pitch; in order to render the audio, we render all baseline notes as piano. The explicit association of each predicted note with an instrument is an advantage of our approach over many existing transcription systems. In our paper, we also provide metrics from baselines against state of the art transcription models trained on each dataset.

Cerberus4

Original Audio Ground Truth Baseline MT3

GuitarSet

Original Audio Ground Truth Baseline MT3

MusicNet

Original Audio Ground Truth Baseline MT3

Slakh2100

Original Audio Ground Truth Baseline MT3

URMP

Original Audio Ground Truth Baseline MT3

In-The-Wild Transcriptions

This section provides examples of transciptions taken from a diverse set of in-the-wild audio sources. We show the input audio (left) along with the transcription via MT3 and the corresponding piano roll. Note that ground-truth transcriptions for these sources are not available.
Original Audio MT3