DSTC11: Speech-Aware Dialog Systems Technology Challenge

Overview

This challenge evaluates task-oriented dialog systems end-to-end, from users' spoken utterances to inferred slot values. For ease of comparison with existing literature, the challenge is built on the popular MultiWoz task (version 2.1). The challenge is focused on dialog state tracking (DST) track since DST is more impacted by the switch from written to audio input than response generation.

Results of the Challenge

We received 11 system outputs from 5 teams. The performance was measured using the JGA and SER metrics. The team names are kept anonymous and we leave it up to the team to identify themselves in the upcoming workshop publications.

Joint Goal Accuracy (JGA)

Systems TTS-Verbatim Human-Verbatim Human-paraphrased
F-p 44.0 39.5 37.9
F-s 40.4 36.1 34.3
C-p 40.2 31.9 31.8
A-s 37.7 30.1 30.7
C-s 33.1 28.6 28.1
D-s 30.3 23.5 23.2
B-p 27.3 23.9 22.6
D-p 28.6 21.8 21.4
A-p 21.9 21.2 20.0
B-s 22.4 19.2 18.3
E-p 21.3 20.0 18.2

Slot Error Rate (SER)

Systems TTS-Verbatim Human-Verbatim Human-Paraphrased
F-p 17.1 20.0 20.4
F-s 19.2 21.9 22.4
A-s 20.3 26.9 26.2
C-p 20.9 28.1 27.2
C-s 25.0 28.7 29.5
B-p 26.2 30.0 30.6
B-s 28.7 32.2 32.6
A-p 32.8 33.5 33.8
D-s 26.6 36.5 35.1
E-p 35.1 35.5 35.3
D-p 28.0 36.7 36.0

Modified Dev (dev-dstc11) and Test (test-dstc11) Sets

The dev and test sets have been modified. The original slot values have been replaced with new values. One of the main reasons for this is that the distribution of slot values overlap substantially between these two sets and the training data. As a result, the evaluation on the original dev and test sets overestimates the performance of systems, especially for those systems that tend to memorize the slot values in the training data. Introducing new slot values also creates an element of surprise for conducting a fair evaluation for this benchmark.

The categorical slots such as hotel-name, restaurant-names, bus-departure, bus-destination, train-departure, and train-destination were replaced. All time mentions were offset by a constant amount for each dialog.

This modification may impact the expected model performance. For example, the updated dev-dstc11 shows that the performance (JGA) of the DST and D3ST models were overestimated by about 38% and 17% respectively compared to original dev.

Models dev dev-dstc11
DST 58.3 20.1
D3ST-XXL 57.5 40.1

Time format

In spoken conversations, the 24-hr time format is very unnatural. Since crowdworkers will be asked to record user responses as natural as possible, it was imperative to switch all time references in MultiWoz from 24-hr to 12-hr format. For consistency, we modified the training, dev and test sets appropriately. This is not expected to impact the model performance significantly.

Data: Raw PCM, Encoder Outputs, Transcripts and Alignments

One of the aims of this task is to facilitate research in novel models that combine text and speech inputs.

ASR outputs: Not all dialog modeling teams are likely to have easy access to ASR systems. For reducing the barriers to enter the challenge, we are making available a variety of intermediate ASR outputs. For this purpose and for the ease of reproducibility, we chose to build a strong baseline ASR system with 33k hours of PeopleSpeech corpus. This corpus is publicly available without any licensing restrictions.

With these considerations, we are making four types of data available.

  1. Raw audio in the standard PCM format, 2 bytes per sample, at 16KHz sampling rate.
  2. Audio encoder output from the ASR system, consisting of 512-dimension vectors at a rate of 75 vectors per second.
  3. Transcripts from the ASR system.
  4. Time alignment describes how the recognized words map to the encoder output sequences. Each word or word piece (w:) is followed by (t:) at which it was emitted. The concatenation of word pieces are indicated by (w:_). For example, w:while t:2 w:in t:5 w:cam t:8 w:bridge t:11 w:▁ t:15 w:i t:15 ...

Inserting user responses back into MultiWoz dialogs

Since the audio and related features are generated only for user responses, they will need to be stitched back into the original dialog. For this purpose, the audio files are indexed using an explicit identifier of the form -- tpe_line_nr: 4519 dialog_id: mul0016.json turn_id: 1 -- where tpe is the TTS speaker identifier, the line_nr is the line in an associated text file (see below), dialog id is the original MultiWoz dialog identifier (name of the json) and turn_id is n-th user response in the dialog.

Deadlines

Challenge Constraints

The participating teams can use any type of model. They are not required to use the ASR outputs provided in this task and are free to use any ASR system at their disposal. However, for ease of comparison across submission, we require that the dialog components are trained only on the MultiWoz training data.

The Mapping file contains the maps the user utterances to the original dialogs. There are 1000 dialogs in the Dev and Test sets, as in the original MultiWoz.

The TTS-Verbatim were generated using speakers that are available via Google Cloud Speech API. Four speaker voices were used in the training set. The dev and test set contains voice from a held-out speaker.

The Human-Verbatim were generated using crowd-sourced workers who were instructed to speak verbatim versions of the written user responses.

The Augmented Training data was generated by replacing the slot values in the original dialogs with values sampled from lists of city names, hotel names, restaurant names and different timestamps. This is similar to how dev-dstc11 and test-dstc11 were generated, however, there are no overlaps in slot values between the training and the evaluation sets.

October 2022 Data Update

Training data:

Augmented Training data:

Dev data:

Test data:

Submission format

We request the participants to submit the results in the format defined in MultiWOZ_Evaluation.

Output Format:

{
    "xxx0000" : [
        {
            "response": "Your generated delexicalized response.",
            "state": {
                "restaurant" : {
                    "food" : "eatable"
                }, ...
            }, 
            "active_domains": ["restaurant"]
        }, ...
    ], ...
}

The input to the evaluator should be a dictionary (or a .json file) with keys matching dialogue ids in the xxx0000 format (e.g. sng0073 instead of SNG0073.json), and values containing a list of turns. Each turn is a dictionary with key:

Data Format (h5p)

For interoperability, the data is distributed in the popular H5P format that can be read using the standard h5py python module.

You can test the sanity of the file using h5utils.

h5stat train/tpa/mul0016.hd5

Here is a code snippet for reading the data from the h5p file.

import h5py
import numpy as np
import scipy.io.wavfile

data=h5py.File('/tmp/mul0016.hd5', 'r')
group=data.keys()[0]  # Iterate over keys() for all user turns in the dialog
print(group)  # 'tpe_line_nr: 4519 dialog_id: mul0016.json turn_id: 1'
# Extracting vectors for each user turn (group)
print(data[group]['audio'])  # <HDF5 dataset "audio": shape (72264,), type "<i2">
print((data[group]['feat'])  # <HDF5 dataset "feat": shape (60, 512), type "<f4">
audio_pcm = np.array(data[group]['audio'])
enc_output = np.array(data[group]['feat'])
print(data[group].attrs['hyp'])  # while in cambridge i need a hotel that ...
print(data[group].attrs['align'])  # w:while t:2 w:in t:5 w:cam t:8 w:bridge t:11 w:▁ t:15 w:i t:15 ...
scipy.io.wavfile.write('/tmp/ex.wav', 16000, audio_pcm)

Evaluation Metrics

The performance of the submitted outputs will be evaluated using Joint Goal Accuracy (JGA) as the primary metrics, using standard MultiWoz evaluation script. In addition, Slot Error Rate will be used as a secondary metric to avoid excessive influence of the early turns in the dialog.

The Slot Error Rate will be computed as defined in Equation 11 of Makhoul et al, 1999. That is the ratio of total number of slot errors (substitutions + deletions + insertions) and total number of slots in reference across all the dialogs.

Baseline Performance

In the table below, we report the performance of a baseline D3ST (XXL) trained on original written text and evaluated on the above dev sets.

Dev set WER JGA SER
tts-verbatim 8.1% 26.3 27.5%
human-verbatim 11.9% 22.6 31.6%

The ASR performance degrades by about 3% going from tts-verbatim to human-verbatim. A similar drop is observed in the performance of the dialog model with about 3.5% and 4% degradation in JGA and SER (Slot Error Rate) respectively.

Data Augmentation Performance

In the table below, we report the benefits of the Augmented Training data. The DST and D3ST-XXL models were trained and evaluated on the written and human-verbatim versions of the dev set.

Training condition Dev input type DST D3ST-XXL
Multiwoz 2.1 Text 20.1 43.9
TTS-Verbatim TTS-Verbatim 20.6 27.6
100x Text Text 41.7 57.1
100x TTS-Verbatim TTS-Verbatim 31.0 40.8

Both DST and D3ST-XXL models benefit substantially from the Augmented Training data provided above.

Scoring

The challenge is scored using a patched version of the standard MultiWoz Evaluation script. You can download the patch from the link below.

git clone https://github.com/Tomiinek/MultiWOZ_Evaluation MultiWOZ_Evaluation
cd MultiWOZ_Evaluation
patch -p1 < ../patch.2022-11-02.txt
python3 ./evaluate.py --dst --golden ../test-dstc11.2022-1102.gold.json --input your_system_output.json