Lightnews — Scholar-powered news

Alicia Curth

@aliciacurth.bsky.social

Honestly hurts my feelings a little that I didn’t even make this list 🥲🥲

November 22, 2024 at 9:12 PM

Alicia Curth

@aliciacurth.bsky.social

This is what I came to this app for 🦮

November 21, 2024 at 4:56 PM

Alicia Curth

@aliciacurth.bsky.social

Thank you for sharing!! Sounds super interesting, so will definitely check it out :)

November 21, 2024 at 3:58 PM

Alicia Curth

@aliciacurth.bsky.social

Exactly this!! thank you 🤗

November 21, 2024 at 1:47 PM

Alicia Curth

@aliciacurth.bsky.social

Oh exciting! On which one? :)

November 21, 2024 at 10:21 AM

Alicia Curth

@aliciacurth.bsky.social

To be fair, it’s actually a really really good TLDR!! I’m honestly just a little scared this will end up on the wrong side of twitter now 😳

November 21, 2024 at 10:21 AM

Reposted by Alicia Curth

Alicia Curth

@aliciacurth.bsky.social

Now might be the worst possible point in time to admit that I don’t own a physical copy of the book myself (yet!! I’m actually building up a textbook bookshelf for myself) BUT because Hastie, Tibshirani & Friedman are the GOATs that they are, they made the pdf free: hastie.su.domains/ElemStatLearn/

Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.

hastie.su.domains

November 21, 2024 at 9:27 AM

Alicia Curth

@aliciacurth.bsky.social

Now might be the worst possible point in time to admit that I don’t own a physical copy of the book myself (yet!! I’m actually building up a textbook bookshelf for myself) BUT because Hastie, Tibshirani & Friedman are the GOATs that they are, they made the pdf free: hastie.su.domains/ElemStatLearn/

Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.

hastie.su.domains

November 21, 2024 at 9:27 AM

Alicia Curth

@aliciacurth.bsky.social

Now continued below with case study 2: understanding performance differences of neural networks and gradient boosted trees on irregular tabular data!!

Alicia Curth @aliciacurth.bsky.social · Nov 20

Part 2: Why do boosted trees outperform deep learning on tabular data??

@alanjeffares.bsky.social & I suspected that answers to this are obfuscated by the 2 being considered very different algs🤔

Instead we show they are more similar than you’d think — making their diffs smaller but predictive!🧵1/n

November 20, 2024 at 9:12 PM

Alicia Curth

@aliciacurth.bsky.social

There’s one more case study & thoughts on the effect of design choices on function updates left— I’ll cover that in a final thread! (next week, giving us all a break😅)

Until then, find the paper here arxiv.org/abs/2411.00247

and/or recap part 1 of this thread below! 🤗 14/14

Alicia Curth @aliciacurth.bsky.social · Nov 18

From double descent to grokking, deep learning sometimes works in unpredictable ways.. or does it?

For NeurIPS(my final PhD paper!), @alanjeffares.bsky.social & I explored if&how smart linearisation can help us better understand&predict numerous odd deep learning phenomena — and learned a lot..🧵1/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

Thus in conclusion this 2nd case study showed that the telescoping approximation of a trained neural network can be a useful lens to investigate performance diffs with other methods!

Here we used it to show how some perf diffs are predicted by specific model diffs(ie diffs in implied kernels)💡13/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

Importantly, this growth in performance gap is tracked by the behaviour of the models’ kernels:

while there is no difference in kernel weights for GBTs across different input irregularity levels, the neural net’s kernel weights for the most irregular ex grow more extreme! 12/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

We test this hypothesis by varying the proportion of irregular inputs in the testset for fixed trained models.

We find that GBTs outperform NNs already in the absence of irregular ex; this speaks to diff in baseline suitability

The performance gap then indeed grows as we increase irregularity!11/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

This highlights a potential explanation why GBTs outperform neural nets on tabular data in the presence of input irregularities:

The kernels implied by the neural network might behave much much more unpredictably for test inputs different to inputs observed at train time! 💡🤔10/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

Trees issue preds that are proper averages: all kernel weights are between 0 & 1. That is: trees never “extrapolate” from the convex hull of training observations 💡

Neural net tangent kernels OTOH are generally unbounded and could take on very different vals for unseen test inputs!😰 9/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

One diff is obvious and purely architectural: either kernel might be able to better fit a particular underlying outcome generating process!

A second diff is a lot more subtle and relates to how regular (or: predictable) the two will likely behave on new data: … 8/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

but WAIT A MINUTE — isn’t that literally the same formula as the kernel representation of the telescoping model of a trained neural network I showed you before?? Just with a different kernel??

Surely this diff in kernel must account for at least some of the observed performance differences… 🤔7/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

Gradient boosted trees (aka OG gradient boosting) simply implement this process using trees!

From our previous work on random forests(arxiv.org/abs/2402.01502) we know we can interpret trees as adaptive kernel smoothers, so we can rewrite the GBT preds as weighted avgs over training loss grads!6/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

Quick refresher: what is gradient boosting?

Not to be confused with other forms of boosting (eg Adaboost), *Gradient* boosting fits a sequence of weak learners that execute steepest descent in function space directly by learning to predict the loss gradients of training examples! 5/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

In arxiv.org/abs/2411.00247 we ask: why? What distinguishes gradient boosted trees from deep learning that would explain this?

A first reaction might be “they are SO different idk where to start 😭” — BUT we show that through the telescoping lens (see part 1 of this🧵⬇️) things become more clear..4/n

Alicia Curth @aliciacurth.bsky.social · Nov 18

We exploit that you can express preds of a trained network as a telescoping sum over all training steps💡

For a single step changes are small,so we can better linearly approximate individual steps sequentially(instead of the whole trajectory at once)

➡️ We refer to this as ✨telescoping approx✨!6/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

And you know who continues to rule the tabular benchmarks? Gradient boosted trees (GBTs)!!(or their descendants)

While the severity of the perf gap over neural nets is disputed, arxiv.org/abs/2305.02997 still found as recently as last year that GBTs esp outperform when data is irregular! 3/n

November 20, 2024 at 5:02 PM

Alicia Curth

@aliciacurth.bsky.social

First things first, why do we care about tabular?

Deep learning sometimes seems to forget we used to do data formats that weren’t text or image (😉) BUT in data science applications — from medicine to marketing and econ — tabular data still rules big parts of the world!!
2/n

November 20, 2024 at 5:02 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news