Author | Lightnews

@cointegrated.bsky.social

7 followers 20 following 5 posts

Posts Media Videos Starter Packs

cointegrated.bsky.social @cointegrated.bsky.social · Jul 5

Adding a bunch of tags for discoverability: #machinetranslation #flores #seed #languages #multilinguality #ai #nlp #mt

cointegrated.bsky.social @cointegrated.bsky.social · Jul 5

The Seed training dataset also received a few submissions, including new translations into Spanish and Italian (from which it might be easier to translate into lower-resourced languages).

1 1

cointegrated.bsky.social @cointegrated.bsky.social · Jul 5

BTW, last year, as part of the previous shared task (aclanthology.org/2024.wmt-1.4), FLORES+ was extended with the languages Emakhuwa, Erzya, Tuvan, Karakalpak, Aragonese, Aranese, Asturian, Valencian, and Wu Chinese, and received a number of edits to other languages.

cointegrated.bsky.social @cointegrated.bsky.social · Jul 5

What to do now?
- Download the dataset and benchmark multilingual models: huggingface.co/datasets/ope...
- Subscribe to our newsletter: openlanguagedata.substack.com/about
- Participate in the WMT25 Open Data shared task to enrich open datasets with new languages www2.statmt.org/wmt25/open-d...

1 1

cointegrated.bsky.social @cointegrated.bsky.social · Jul 5

We (oldi.org) recently released version 3.0 of the FLORES+ dataset: a benchmark for multilingual machine translation.

In this version, we added Ladin language (now there are 222 language varieties in the dataset!), corrected the spelling for Chuvash and Dargwa, and fixed sentence order in Aranese.

1 2