DSTC11: Speech-Aware Dialog Systems Technology Challenge

Overview

This challenge evaluates task-oriented dialog systems end-to-end, from users' spoken utterances to inferred slot values. For ease of comparison with existing literature, the challenge is built on the popular MultiWoz task (version 2.1). The challenge is focused on dialog state tracking (DST) track since DST is more impacted by the switch from written to audio input than response generation.

Modified Dev (dev-dstc11) and Test (test-dstc11) Sets

The dev and test sets have been modified. The original slot values have been replaced with new values. One of the main reasons for this is that the distribution of slot values overlap substantially between these two sets and the training data. As a result, the evaluation on the original dev and test sets overestimates the performance of systems, especially for those systems that tend to memorize the slot values in the training data. Introducing new slot values also creates an element of surprise for conducting a fair evaluation for this benchmark.

The categorical slots such as hotel-name, restaurant-names, bus-departure, bus-destination, train-departure, and train-destination were replaced. All time mentions were offset by a constant amount for each dialog.

This modification may impact the expected model performance. For example, the updated dev-dstc11 shows that the performance (JGA) of the DST and D3ST models were overestimated by about 38% and 17% respectively compared to original dev.

Models	dev	dev-dstc11
DST	58.3	20.1
D3ST-XXL	57.5	40.1

Time format

In spoken conversations, the 24-hr time format is very unnatural. Since crowdworkers will be asked to record user responses as natural as possible, it was imperative to switch all time references in MultiWoz from 24-hr to 12-hr format. For consistency, we modified the training, dev and test sets appropriately. This is not expected to impact the model performance significantly.

Data: Raw PCM, Encoder Outputs, Transcripts and Alignments

One of the aims of this task is to facilitate research in novel models that combine text and speech inputs.

ASR outputs: Not all dialog modeling teams are likely to have easy access to ASR systems. For reducing the barriers to enter the challenge, we are making available a variety of intermediate ASR outputs. For this purpose and for the ease of reproducibility, we chose to build a strong baseline ASR system with 33k hours of PeopleSpeech corpus. This corpus is publicly available without any licensing restrictions.

With these considerations, we are making four types of data available.

Raw audio in the standard PCM format, 2 bytes per sample, at 16KHz sampling rate.
Audio encoder output from the ASR system, consisting of 512-dimension vectors at a rate of 75 vectors per second.
Transcripts from the ASR system.
Time alignment describes how the recognized words map to the encoder output sequences. Each word or word piece (w:) is followed by (t:) at which it was emitted. The concatenation of word pieces are indicated by (w:_). For example, w:while t:2 w:in t:5 w:cam t:8 w:bridge t:11 w:▁ t:15 w:i t:15 ...

Inserting user responses back into MultiWoz dialogs

Since the audio and related features are generated only for user responses, they will need to be stitched back into the original dialog. For this purpose, the audio files are indexed using an explicit identifier of the form -- tpe_line_nr: 4519 dialog_id: mul0016.json turn_id: 1 -- where tpe is the TTS speaker identifier, the line_nr is the line in an associated text file (see below), dialog id is the original MultiWoz dialog identifier (name of the json) and turn_id is n-th user response in the dialog.

Challenge Constraints

The participating teams can use any type of model. They are not required to use the ASR outputs provided in this task and are free to use any ASR system at their disposal. However, for ease of comparison across submission, we require that the dialog components are trained only on the MultiWoz training data.

Note that the final evaluation data will contain disfluencies (e.g., speech repair), reflecting natural conversations.

September 2022 Data Update

Training data (TTS):

train.tts-verbatim.2022-07-27.zip contains 4 subdirectories, one for each TTS speaker (tpa, tpb, tpc, tpd), and each subdirectories contains all the 8434 dialogs corresponding to the original training set. The TTS outputs were generated using speakers that are available via Google Cloud Speech API.
train.tts-verbatim.2022-07-27.txt for mapping the user utterances to original dialogs.

Dev data (TTS):

dev-dstc11.tts-verbatim.2022-07-27.zip contains all the 1000 dialogs corresponding to TTS output from a held-out speaker.
dev-dstc11.tts-verbatim.2022-07-27.txt for mapping the user utterances to original dialogs.

Dev data (Human Verbatim):

dev-dstc11.human-verbatim.2022-09-14.zip contains all the 1000 dialogs with user responses spoken by crowd-sourced workers, instructed to speak verbatim versions of the written responses.
Mapping file is the same as dev-dstc11.tts-verbatim.2022-07-27.txt.

Test data (TTS):

test-dstc11.tts-verbatim.2022-09-21.zip contains all the 1000 dialogs with user responses synthesized using the voice of a held-out speaker.
test-dstc11.tts-verbatim.2022-09-21.txt for mapping the user utterances to original dialogs.

Data Format (h5p)

For interoperability, the data is distributed in the popular H5P format that can be read using the standard h5py python module.

You can test the sanity of the file using h5utils.

h5stat train/tpa/mul0016.hd5

Here is a code snippet for reading the data from the h5p file.

import h5py
import numpy as np
import scipy.io.wavfile

data=h5py.File('/tmp/mul0016.hd5', 'r')
group=data.keys()[0]  # Iterate over keys() for all user turns in the dialog
print(group)  # 'tpe_line_nr: 4519 dialog_id: mul0016.json turn_id: 1'
# Extracting vectors for each user turn (group)
print(data[group]['audio'])  # <HDF5 dataset "audio": shape (72264,), type "<i2">
print((data[group]['feat'])  # <HDF5 dataset "feat": shape (60, 512), type "<f4">
audio_pcm = np.array(data[group]['audio'])
enc_output = np.array(data[group]['feat'])
print(data[group].attrs['hyp'])  # while in cambridge i need a hotel that ...
print(data[group].attrs['align'])  # w:while t:2 w:in t:5 w:cam t:8 w:bridge t:11 w:▁ t:15 w:i t:15 ...
scipy.io.wavfile.write('/tmp/ex.wav', 16000, audio_pcm)

Evaluation Metrics

The performance of the submitted outputs will be evaluated using Joint Goal Accuracy (JGA) as the primary metrics, using standard MultiWoz evaluation script. In addition, Slot Error Rate will be used as a secondary metric to avoid excessive influence of the early turns in the dialog.

Baseline Performance

In the table below, we report the performance of a baseline D3ST (XXL) trained on original written text and evaluated on the above dev sets.

Dev set	WER	JGA	SER
tts-verbatim	8.1%	26.3	27.5%
human-verbatim	11.9%	22.6	31.6%

The ASR performance degrades by about 3% going from tts-verbatim to human-verbatim. A similar drop is observed in the performance of the dialog model with about 3.5% and 4% degradation in JGA and SER (Slot Error Rate) respectively.