Lightnews — Scholar-powered news

Reposted by Olivier Grisel

Skrub @skrub-data.bsky.social · 13d

⚡ Release 0.6.2 is out ⚡

github.com/skrub-data/s...

New features The DataOp.skb.full_report() now displays the time each node took to evaluate. #1596 by Jérôme Dockès. The User guide has been reworked and expanded. Changes and deprecations Ken em...

github.com

1 4 6

Olivier Grisel @ogrisel.bsky.social · 13d

I will speak about probabilistic regressions, @skrub-data.bsky.social and skore contributors will also present their libraries. Come join us!

scikit-learn @scikit-learn.org · 13d

A bunch of scikit-learn core contributors will attend or speak at @pydataparis.bsky.social 2025 on Tuesday and Wednesday next week.

Ticketing, practical infos and schedule at: pydata.org/paris2025

PyData Paris 2025

pydata.org

3 9

Olivier Grisel @ogrisel.bsky.social · Sep 2

More info about free-threading here: py-free-threading.github.io

Python Free-Threading Guide

The free-threading guide is a centralized collection of documentation and trackers around compatibility with free-threaded CPython for the Python open source ecosystem

py-free-threading.github.io

1

Olivier Grisel @ogrisel.bsky.social · Sep 2

We set up some dedicated automated tests and discovered a bunch of thread-safety bugs, but they are now tracked by dedicated issues, and we have plans to fix them all, hopefully in time for 1.8.

1

Olivier Grisel @ogrisel.bsky.social · Sep 2

scikit-learn 1.8 will be the first scikit-learn release with native extensions that are officially marked as free-threading compatible.

github.com/scikit-learn...

MNT Mark cython extensions as free-threaded compatible by lesteve · Pull Request #31342 · scikit-learn/scikit-learn

Part of #30007 Cython 3.1 has been released on May 8 2025. Following scipy PR scipy/scipy#22658 to use -Xfreethreading_compatible=True cython argument if cython >= 3.1 This cleans up the lock-fi...

github.com

1 3 9

Reposted by Olivier Grisel

PyData Paris @pydataparis.bsky.social · Aug 28

We’re happy to announce our Social Event, taking place on Tuesday 30th September at 6pm at the Cité des sciences. A perfect opportunity to unwind and connect with fellow attendees after a day of interesting talks!

pydata.org/paris2025/so...
pydata.org/paris2025/ti...

4 4

Olivier Grisel @ogrisel.bsky.social · Aug 28

Looking forward to attending PyData Paris 2025! I will give a talk about probabilistic predictions for regression problems (I need to start working on my slides ;)

PyData Paris @pydataparis.bsky.social · Aug 28

📢 Talk Announcement

"Probabilistic regression models: let's compare different modeling strategies and discuss how to evaluate them", by @ogrisel.bsky.social from @probabl.ai .

📜 pretalx.com/pydata-paris-2025/talk/DVMZBT
📅 pydata.org/paris2025/schedule
🎟 pydata.org/paris2025/tickets

1 6

Reposted by Olivier Grisel

Jeremy Tuloup @jtp.io · Aug 19

👋 JupyterLab and Jupyter Notebook users:

What's one thing you'd love to see improved in JupyterLab, Jupyter Notebook, or JupyterLite?

The team is prepping the upcoming 4.5/7.5 releases and wants to tackle some usability issues.

Drop your feedback below, this will help prioritize what gets fixed!👇

4 10 17

Olivier Grisel @ogrisel.bsky.social · Aug 19

The video recording is already live!

www.youtube.com/live/jvyWTa1...

19.08.2025 Predictive modeling for imbalanced classification using scikit-learn

YouTube video by EuroSciPy

www.youtube.com

2

Olivier Grisel @ogrisel.bsky.social · Aug 19

However, the Elkan 2001 post-hoc prevalence correction can be used for any (well-specified) probabilistic classifier, including gradient boosting classifiers, assuming the training set is a uniform sample of the population conditionally on the class.

1

Olivier Grisel @ogrisel.bsky.social · Aug 19

Interestingly, for logistic regression, this is equivalent to shifting the intercept by the difference of the logits of the prevalence of the positive class in the population and in the training set distributions, respectively.

1

Olivier Grisel @ogrisel.bsky.social · Aug 19

Equivalently, we can append a monotonic post-hoc transformation to a naively trained classifier to get a prevalence-corrected classifier as a result as show in Theorem 2 of cseweb.ucsd.edu/~elkan/resca...

cseweb.ucsd.edu

1

Olivier Grisel @ogrisel.bsky.social · Aug 19

In this case, we can use weight-based training to correct the model's probabilistic predictions to stay well calibrated with respect to the target deployment setting.

1

Olivier Grisel @ogrisel.bsky.social · Aug 19

This problem typically happens when the class of interest (positive class) is so rare (medical screening, predictive maintenance, fraud detection...) that collecting training features for the negative cases in the correct proportion would be too costly (or even illegal/unethical).

1

Olivier Grisel @ogrisel.bsky.social · Aug 19

We then discussed another common related problem: how to deal with a prevalence shift between observed data and the deployment setting?

probabl-ai.github.io/calibration-...

Diagram explaining the general flow of data science operations to correct for prevalence shifts. See the linked notebook for an exhaustive description of the setting.

1

Olivier Grisel @ogrisel.bsky.social · Aug 19

If you can, consider defining a business specific cost function and use that to tune the decision threshold automatically for your deployment setting.

We covered that precise setting in an earlier workshop:

probabl-ai.github.io/calibration-...

Cost-sensitive learning to optimize business metrics — Probabilistic calibration of cost-sensitive learning

probabl-ai.github.io

1

Olivier Grisel @ogrisel.bsky.social · Aug 19

Instead, you should probably keep the well calibrated model and look at the influence of the decision threshold on your precision-recall trade-off. The default value of the cut-off is 0.5 in scikit-learn, but it's not necessarily meaningful to turn predicted probabilities into operational decisions.

1

Olivier Grisel @ogrisel.bsky.social · Aug 19

Spoiler: rebalancing the training data is rarely the correct fix. You will break probabilistic calibration and can no longer relate the predicted class probabilities to your deployment setting.

1

Olivier Grisel @ogrisel.bsky.social · Aug 19

Today at #EuroScipy2025, @glemaitre58.bsky.social and I presented a tutorial on pitfalls of machine learning for imbalanced classification problems.

We discussed what (not) to do when fitting a classifier and obtaining degenerate precision or recall values.

probabl-ai.github.io/calibration-...

Imbalanced classification: pitfalls and solutions — Probabilistic calibration of cost-sensitive learning

probabl-ai.github.io

1 10 23

Olivier Grisel @ogrisel.bsky.social · Aug 18

Attending the @skrub-data.bsky.social tutorial by @riccardocappuzzo.com and @glemaitre58.bsky.social at #EuroScipy2025. They introduce the new DataOps feature released in skrub 0.6.

Here is the repo with the material for the tutorial: github.com/skrub-data/E...

Photo of Riccardo presenting skrub DataOps in a lecture room to an audience of ~50 people.

2 5

Olivier Grisel @ogrisel.bsky.social · Jul 30

It's an interesting new deep learning architecture that can be somewhat successfully trained to solve challenging reasoning tasks where other methods completely fail.

2

Olivier Grisel @ogrisel.bsky.social · Jul 30

The paper gives no evidence that it's possible to unsupervised pre-train HRM modules and then do transfer learning on other reasoning tasks.

1 1

Olivier Grisel @ogrisel.bsky.social · Jul 30

But even the 2 inner recurrent modules are trained in a task-specific way. Assume you manage to train them on sudoku solving with generic llm as input and output modules, i doubt that the resulting model can correctly reason about anything else than sudoku problems.

1 1

Olivier Grisel @ogrisel.bsky.social · Jul 30

Still, the fact that it's possible to do gradient based learning to reach 40% and 5% success rate on ARC-AGI-1 and ARC-AGI-2 tasks is quite impressive.

2 3

Olivier Grisel @ogrisel.bsky.social · Jul 30

As a result: a pre-trained LLM model is much more generic than a trained HRM model.

1 2