When Text and Speech are Not Enough: A Multimodal Dataset of Collaboration in a

To adequately model information exchanged in real human-human interactions, considering speech or text alone leaves out many critical modalities. The channels contributing to the “making of sense” in human-human interactions include but are not limited to gesture, speech, user-interaction modeling, gaze, joint attention, and involvement/engagement, all of which need to be adequately modeled to automatically extract correct and meaningful information. In this paper, we present a multimodal dataset of a novel situated and shared collaborative task, with the above channels annotated to encode these different aspects of the situated and embodied involvement of the participants in the joint activity.

When Text and Speech are Not Enough: A Multimodal Dataset of Collaboration in a Situated Task The Weights Task Dataset (WTD) is a novel dataset of a situated, shared collaborative task, originally collected to study multimodal indicators of collaborative problem solving.This dataset complements other datasets for human-human interaction such as Anderson et al. (1991); Liu, Cai, Ji, and Liu (2017); Van Gemeren, Poppe, and Veltkamp (2016); Wang et al. (2017); Yun, Honorio, Chattopadhyay, Berg, and Samaras (2012), which lack at least one of: multimodal data, physical object manipulation, or multiparty interaction.Our data is novel in the joint presence of speech, gestures, and actions in a collaborative multiparty task.Annotation encodes many cross-cutting aspects of the situated and embodied involvement of the participants in joint activity.

METHOD
The Weights Task is completed by triads at a round table.A webcam captures the task equipment and participants.Kinect Azure cameras capture RGBD video from different angles.Task equipment includes 6 blocks (of varying weight, size, and color), a balance scale, a worksheet, and a computer with a survey where participants submit their answers.

STEPS
Participants (English speakers, ≥18 years) were recruited from the student body of Colorado State University.Informed consent was obtained.Table 1 shows breakdown of gender and ethnicity.
Participants are given a balance scale to determine the weights of five blocks.They are given the weight of one of the blocks (10g), and must determine the weights of the others.As the weight of each block is discovered, it is placed on the worksheet in the cell corresponding to the weight.Next, participants are given a new block and must identify its weight without the scale, by deducing it based on the pattern observed in the initial block weights.Finally, participants must infer the weight of the next hypothetical block in the set and explain how they determined it.After each stage, groups submit their answers in the survey form.
The dataset consists of 10 videos (~170 minutes).Table 2 provides descriptive statistics of the data.Figure 1 shows participants engaging with the objects on the table from the perspective of the main Kinect.Figure 2 shows different annotations (described below).

Utterance Segmentation and Transcription
Audio from all groups were segmented into utterances, or a single person's continuous speech, delimited by silence, and transcribed.Segmentation and transcription were conducted by humans, by Google Cloud ASR (Velikovich et al., 2018), and by OpenAI's Whisper model (Radford et al., 2023).Human transcription was performed by listening and transcribing what was said by each participant during a given manually-segmented utterance.Google and Whisper transcriptions were conducted over the utterances segmented by the same system (which may conflate overlapping speech by multiple people).Transcriptions are presented in .csvfiles.

Gesture Abstract Meaning Representation (GAMR)
Participant gestures are annotated using the GAMR framework (Brutti, Donatelli, Lai, & Pustejovsky, 2022).Most WTD gestures are deictic, indicating reference to an object or a location.Iconic gestures represent attributes of an action or object.The meaning of emblematic gestures is set by cultural convention.GAMR was dual annotated by annotators trained by authors of the framework (SMATCH F1-score = 0.75).This data is presented in PENMAN notation in .eaffiles.

Nonverbal Indicators of Collaborative-Learning Environments (NICE)
The NICE coding scheme (Dey et al., 2023) captures nonverbal behaviors when people are working together in groups, such as the direction of gaze, posture (e.g., leaning toward or away from the activity area), and usage of tools (including pointing at or to the tool, as well as directly manipulating it).NICE was annotated by an author of the framework over Groups 1-3 and Group 5.This data is presented in .xlsxformat.Figure 2 Multichannel (GAMR, NICE, speech transcription, and CPS) annotation "score" using ELAN (Brugman & Russel, 2004).

REUSE POTENTIAL
This data was originally gathered to study multimodal indicators of CPS, but its rich multichannel nature also lends itself well to other lines of research.Researchers in education and learning sciences can use it to develop activities to support collaborative interaction and learning.Researchers in linguistics and psychology can use it to study interactive behavior and communication, including modeling the evolution of group common ground over time, a la Clark and Carlson (1981), and for natural language processing tasks such as assessing speech recognition fidelity (e.g., Terpstra et al., 2023, which compared the effects of different segmentation methods).The rich multimodality will be of use to researchers in AI.For example, the Kinect data can be used to develop and train gesture recognition algorithms (e.g., VanderHoeven, Blanchard, & Krishnaswamy, 2023) or object and action detectors.The different modalities can serve as signals to an interactive AI agent that assists facilitators and scale up collaborative group activities by interpreting key multimodal aspects of collaborative group interaction in context (cf.Bradford, Khebour, Blanchard, & Krishnaswamy, 2023).The dataset will continue to be updated at the public repository as additional annotations are performed, including of object positions, actions taken with the different objects, and of the common ground constructed between participants as the task unfolds.Potential limitations or issues with reuse may include: while using the Azure (skeleton) data, the body IDs in some frames do not align with participant IDs, as the Microsoft tracker assigns a new body ID if it loses and regains a participant.The prosodic features, although useful in a number of applications, could introduce noise if during a single segmented utterance, more than one voice is actually talking at the same time.
Updates will be noted at the dataset link.The data is freely available for research purposes, as indicated in the consent form (also available at the dataset link).

Figure 1
Figure 1 Three participants engaged in the Weights Task.Participant #3 (on the right) is taking a block off the scale to try another configuration while Participant #2 (in the middle) wants to clarify the weight of the block under it.Multimodal information is required to make such a judgment.

Table 1
Sun et al. (2020)formed at the utterance level using the framework ofSun et al. (2020).Annotators watched the video and coded each utterance with potentially multiple labels based on content, context, and position in the conversational sequence.Videos were annotated by two annotators (κ = 0.62) and adjudicated by an expert who underwent extensive training in the framework.CPS is presented in .csvfiles.

Table 2
Dataset descriptive statistics.