In this paper, we propose a new framework for Multi-Task Multitrack Music Transcription, along with a model to achieve state-of-the-art performance on this task; we refer to both the task and the model as MT3.
Here, we present detailed example results for our model. For a complete overview of the system, refer to the paper linked above.
The model operates on log Mel spectrogram inputs shown above, and generates output sequences in a MIDI-like vocabulary, shown below.
This vocabulary can be deterministically decoded to a piano roll representation, as shown, or resynthesized to reconstruct the original audio.
In the results that follow, we translate our vocabulary into this piano roll representation, and then use FluidSynth to render the piano roll to audio.
We evaluate our model across six datasets representing a diverse range of sizes, recording processes, genres, and instrumentations. Details on the datasets used are provided in the paper (see link above). Here we provide examples of sets of (input, ground truth, MT3) audio triplets for each dataset. Piano rolls for the ground truth and MT3-predicted MIDI are shown (hover over notes for details), along with the input spectrogram of the original audio.
For each dataset, we compare our model to the output of a state-of-the-art DSP-based music transcription software, providing 30-second sample clips.
Note that the baseline software does not provide instrument labels associated with its predictions, and only predicts pitch; in order to render the audio, we render all baseline notes as piano. The explicit association of each predicted note with an instrument is an advantage of our approach over many existing transcription systems.
In our paper, we also provide metrics from baselines against state of the art transcription models trained on each dataset.
This section provides examples of transciptions taken from a diverse set of in-the-wild audio sources. We show the input audio (left) along with the transcription via MT3 and the corresponding piano roll. Note that ground-truth transcriptions for these sources are not available.