Riccardo Cappuzzo
@riccardocappuzzo.com
210 followers 580 following 83 posts
Research engineer at Inria Saclay, working on the Skrub library. Python, data preparation, ML, tabular learning. ORCID: 0000-0002-4448-2959 Hoshiyomi ☄️ https://www.riccardocappuzzo.com https://github.com/rcap107
Posts Media Videos Starter Packs
riccardocappuzzo.com
"ok the test run is done, let's see"

...

"this will be hard to debug"
Reposted by Riccardo Cappuzzo
emilienschultz.bsky.social
What a banger is skrub @skrub-data.bsky.social !

Big thumbs up for the sklearn team & the maintainer of this package
riccardocappuzzo.com
Thanks a lot for the compliments! I had a lot of fun giving the talk, and I'm happy to see people liked it
riccardocappuzzo.com
My first actual talk in front of a ton of people 🙃
skrub-data.bsky.social
📅 Less than a week away! The talk will be on Oct 1st at 10.05AM in room Louis Armand 1 - Est.

If you want to contribute to skrub, we will also have a sprint on Thursday.

See you there!
pydataparis.bsky.social
📢 Talk Announcement

"Skrub: machine learning for dataframes", by Guillaume Lemaitre, Jérôme Dockès and @riccardocappuzzo.com.
@skrub-data.bsky.social

📜 Talk info: pretalx.com/pydata-paris-2025/talk/T9KTPU
📅 Schedule: pydata.org/paris2025/schedule
🎟 Tickets: pydata.org/paris2025/tickets
Reposted by Riccardo Cappuzzo
skrub-data.bsky.social
Do you have to deal with numerical features that involve large outliers, and need to train linear models or neural networks?

Then you might want to try the skrub SquashingScaler. The SquashingScaler behaves like scikit-learn RobustScaler, but smoothly clips outliers to predefined boundaries.
riccardocappuzzo.com
Working hard on the next @skrub-data.bsky.social slide deck...
Reposted by Riccardo Cappuzzo
ogrisel.bsky.social
Today at #EuroScipy2025, @glemaitre58.bsky.social and I presented a tutorial on pitfalls of machine learning for imbalanced classification problems.

We discussed what (not) to do when fitting a classifier and obtaining degenerate precision or recall values.

probabl-ai.github.io/calibration-...
Imbalanced classification: pitfalls and solutions — Probabilistic calibration of cost-sensitive learning
probabl-ai.github.io
Reposted by Riccardo Cappuzzo
pydataparis.bsky.social
📢 Talk Announcement

"Skrub: machine learning for dataframes", by Guillaume Lemaitre, Jérôme Dockès and @riccardocappuzzo.com.
@skrub-data.bsky.social

📜 Talk info: pretalx.com/pydata-paris-2025/talk/T9KTPU
📅 Schedule: pydata.org/paris2025/schedule
🎟 Tickets: pydata.org/paris2025/tickets
Reposted by Riccardo Cappuzzo
ogrisel.bsky.social
Attending the @skrub-data.bsky.social tutorial by @riccardocappuzzo.com and @glemaitre58.bsky.social at #EuroScipy2025. They introduce the new DataOps feature released in skrub 0.6.

Here is the repo with the material for the tutorial: github.com/skrub-data/E...
Photo of Riccardo presenting skrub DataOps in a lecture room to an audience of ~50 people.
Reposted by Riccardo Cappuzzo
miketheman.com
Heads Up, #Python Developers!

There is an active phishing attack targeting PyPI users.

• Threat: Emails from [email protected] (with a 'j') link to a fake login page.
• Action: Do not click any links. If you already did, change your PyPI password ASAP.
• Note: PyPI itself has not been breached.
riccardocappuzzo.com
"My watch gained gemini power overnight"

One sentence horror
riccardocappuzzo.com
Huge release, and the first one where I felt like I actually contributed a lot to the final result.

I really think DataOps are a game changer, and I can't wait to see what people come up with with them.

I also ended up rewriting most of the user guide, hopefully improving it along on the way 😂
skrub-data.bsky.social
⚡ Release 0.6.0 is now out! ⚡

🚀 Major update! Skrub DataOps, various improvements for the TableReport, new tools for applying transformers to the columns, and a new robust transformer for numerical features are only some of the features included in this release.
riccardocappuzzo.com
The walrus operator mystifies me
Reposted by Riccardo Cappuzzo
riccardocappuzzo.com
Really cool graffiti I spotted while walking around in the town where I live
riccardocappuzzo.com
Out of all the features in the expressions, this may be my personal favorite. I always end up adding too many configurations just because the syntax is so convenient.
skrub-data.bsky.social
👀 This week's post will be another sneak peek into skrub expressions, an upcoming feature that will ease the preparation and execution of machine learning pipelines on dataframes.

This time we will focus on how expressions can simplify the construction of complex hyperparameter grids.
Reposted by Riccardo Cappuzzo
skrub-data.bsky.social
📝 The skrub TextEncoder brings the power of HuggingFace language models to embed text features in tabular machine learning, for all those use cases that involve text-based columns.
Reposted by Riccardo Cappuzzo
skrub-data.bsky.social
The Skrub Cleaner is a lightweight transformer that performs consistency checks on a dataframe:

🔍 It gives a uniform representation of null values, converting those represented as strings (such as "N/A")
🗑️ It drops columns that contain too many null values (according to a user-defined threshold)
riccardocappuzzo.com
Now that the paper is out, I can finally share the totally-not-confusing script/plot/table map I made to track which scripts prepare which figures and tables and from what data.

If it wasn't clear, don't do this. If you *really* have to, I used the @obsidian.md canvas for this.
riccardocappuzzo.com
Tech issues happen 🙈 this is what it should look like bsky.app/profile/ricc...

And thanks!
riccardocappuzzo.com
A bit of a mess up with this figure! This is what it's supposed to look like 🙈