Translated Wikipedia Biographies

Access Dataset

English -> Spanish (516 KB)

English-> German (517 KB)

Data Card

Link

Dataset type

Text

License type

CC-BY-SA 3.0

Authors

Anja Austermann

Michelle Linch

Romina Stella

Kellie Webster

Publication year

2021

Description

A research area for machine translation has been using context from surrounding sentences or passages to improve gender accuracy. Traditional NMT methods translate sentences individually, but gendered information is not always explicitly stated in each individual sentence. The challenge for the traditional methods that translate sentences in isolation appears when a choice in a translated sentence needs context present in earlier sentences. In other words, contextual information explicit in previous sentences in the source is needed to disambiguate gender that will be reflected explicitly in target sentences.

The Translated Wikipedia Biographies dataset has been designed to analyze common gender errors in machine translation like incorrect gender choices in pro-drop, possessives and gender agreement.

Each instance of the dataset represents a person (identified in the biographies as feminine or masculine), a rock band or a sport team (considered genderless). Each entity is represented by a long text translation (8 to 15 connected sentences referring to that central entity). Articles are written in native English and have been professionally translated to Spanish and German. For Spanish, translations were optimized for pronoun-drop, so the same set could be used to analyze pro-drop (Spanish → English) and gender agreement (English → Spanish).