cointegrated.bsky.social
@cointegrated.bsky.social
7 followers 20 following 5 posts
Posts Media Videos Starter Packs
The Seed training dataset also received a few submissions, including new translations into Spanish and Italian (from which it might be easier to translate into lower-resourced languages).
BTW, last year, as part of the previous shared task (aclanthology.org/2024.wmt-1.4), FLORES+ was extended with the languages Emakhuwa, Erzya, Tuvan, Karakalpak, Aragonese, Aranese, Asturian, Valencian, and Wu Chinese, and received a number of edits to other languages.
What to do now?
- Download the dataset and benchmark multilingual models: huggingface.co/datasets/ope...
- Subscribe to our newsletter: openlanguagedata.substack.com/about
- Participate in the WMT25 Open Data shared task to enrich open datasets with new languages www2.statmt.org/wmt25/open-d...
We (oldi.org) recently released version 3.0 of the FLORES+ dataset: a benchmark for multilingual machine translation.

In this version, we added Ladin language (now there are 222 language varieties in the dataset!), corrected the spelling for Chuvash and Dargwa, and fixed sentence order in Aranese.