Pedro Ortiz Suarez
@pjox.bsky.social
310 followers 440 following 18 posts
Principal Research Scientist at the Common Crawl Foundation. Weird coffee person ☕️, runner 🏃🏻‍♂️. (he/him) 🇫🇷🇪🇺🇨🇴
Posts Media Videos Starter Packs
Reposted by Pedro Ortiz Suarez
nfel.bsky.social
We introduce the TableEval benchmark and investigate the effectiveness and robustness of text-based and multimodal LLMs on table understanding through a cross-domain & cross-modality evaluation.

Joint work by DFKI SLT incl. Fabio Barth, Raia Abu Ahmad, @malteos.bsky.social @pjox.bsky.social
pjox.bsky.social
If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task on Language Identification! 💬

Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/

Deadline: July 23, 2025 (AoE) ⏰
Reposted by Pedro Ortiz Suarez
catherinearnett.bsky.social
Just a few days left to contribute annotations before the first release of training data. We have over 17,000 document annotations so far!
catherinearnett.bsky.social
One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc
Reposted by Pedro Ortiz Suarez
Reposted by Pedro Ortiz Suarez
commoncrawl.bsky.social
The deadline for paper submissions has been extended!

The new deadline is July 3, 2025. AoE.

For more information, please visit: wmdqs.org
Reposted by Pedro Ortiz Suarez
commoncrawl.bsky.social
The Common Crawl Foundation, together with IBM, the AI Alliance, and BrightQuery will be hosting an "UN Conference" at IBM's new flagship NYC HQ at One Madison Avenue on Friday, June 20, from 12:30-5pm.

If you are in NYC, it would be great to see you there!

lu.ma/p0a1scde
AI Alliance @ IBM One Madison (UN Open Source Week 2025) · Luma
This year’s UN Open Source Week 2025, June 16-20) will once again bring together a global “who is who” of Open Source leaders. As part of the official…
lu.ma
Reposted by Pedro Ortiz Suarez
commoncrawl.bsky.social
Call for papers!
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!

Submission deadline is 23 June, more info: wmdqs.org
1st Workshop on Multilingual Data Quality Signals
wmdqs.org
pjox.bsky.social
I’ll be running the Paris Marathon this Sunday for cancer research and treatment 🏃🏻‍♂️

Please donate if you can! Every donation no matter how small, helps immensely.

marathon-paris.dossards-solidaires.org/fundraisers/...
Reposted by Pedro Ortiz Suarez
netpreserve.bsky.social
We would like to welcome all of our attending members to Oslo, with a special welcome to two of our newest members, the Publications Office of the European Union and @commoncrawl.bsky.social!

@nettarkivet.bsky.social | #iipcGA25 | #webarchiving
pjox.bsky.social
Same thing is true for coffee, prices haven’t increased much in the last 60 years, but the cost of living for the producers has skyrocketed in recent years 😢
Reposted by Pedro Ortiz Suarez
internetarchive.eu
Today is "I love Free Software Day".

Thank you to the @commoncrawl.bsky.social Foundation for all their hard work. Onwards! @pjox.bsky.social - So great to meet in person.
Two men Brewster and Pedro standing in front of a fireplace and mirror smiling and facing the camera.
pjox.bsky.social
I’ll be today at the AI Action Summit in Paris, if you’re attending and want to discuss about @commoncrawl.bsky.social or about open data, please DM me!
pjox.bsky.social
We're very happy to release cc-downloader, a new CLI tool to download Common Crawl data 📚🚀🧑‍💻

‍cc-downloader is still under active development, so if you find any issues or would like to submit a feature request, please visit its GitHub repository at github.com/commoncrawl/....
pjox.bsky.social
If you care about open data or anything related to crawling, The Common Crawl Foundation @commoncrawl.bsky.social is now on Bluesky 📊📈📚🥳
pjox.bsky.social
😂 No worries, I do mostly Rust and Python these days 🦀
pjox.bsky.social
Ran the Berlin marathon yesterday and while it was not my best marathon and I was recovering from injury, I had an amazing time. I really hope I can do better next year in Paris where I'll run for cancer research. If you can donate please do so: marathon-paris.dossards-solidaires.org/fundraisers/...
Report of Pedro's Berlin Marathon splits, first half time was 2:06:57 and second half was 2:44:01. Final time was 4:50:58. Photo of Pedro holding his medal in front of the Brandenburg Gate. Photo of Pedro's Bib number (26445) and his medal.
pjox.bsky.social
If you can and want to give a donation to the Gustave Roussy Institute, however small, I'd be extremely grateful. If you cannot donate, resharing/boosting is always appreciated! Thank you! ❤️
pjox.bsky.social
Ran the Paris Marathon yesterday. It was an amazing experience. Getting into running was probably the best decision I’ve made in recently. It has helped massively with both physical and mental health. I highly recommend any type of physical activity, especially for researchers 🏃🏻‍♂️
Pedro’s medal and bib (number 77156) for the Paris Marathon. Pedro’s splits for the Paris Marathon and final time of 4:47:33. Full readable results should be available in https://resultscui.active.com/participants/45607218
pjox.bsky.social
I still don’t know how, but I finished my first marathon in 5:03:04 🥹
Pedro at the Branderburg gate after finishing the Berlin Marathon. Pedro’s medal for finishing the Berlin Marathon, time reads 5:03:04
pjox.bsky.social
Very happy to announce this new release of @oscarproject.bsky.social 🥳. We're still working on documentation so please be patient, more details and features are coming soon! 👀

We're always open for feedback and collaboration, so please join our community: https://t.co/toLKAPje4E