In search of statistical intuition for modern ML & simple explanations for complex things👀
Interested in the mysteries of modern ML, causality & all of stats. Opinions my own.
https://aliciacurth.github.io
@alanjeffares.bsky.social & I suspected that answers to this are obfuscated by the 2 being considered very different algs🤔
Instead we show they are more similar than you’d think — making their diffs smaller but predictive!🧵1/n
Until then, find the paper here arxiv.org/abs/2411.00247
and/or recap part 1 of this thread below! 🤗 14/14
For NeurIPS(my final PhD paper!), @alanjeffares.bsky.social & I explored if&how smart linearisation can help us better understand&predict numerous odd deep learning phenomena — and learned a lot..🧵1/n
Until then, find the paper here arxiv.org/abs/2411.00247
and/or recap part 1 of this thread below! 🤗 14/14
Here we used it to show how some perf diffs are predicted by specific model diffs(ie diffs in implied kernels)💡13/n
Here we used it to show how some perf diffs are predicted by specific model diffs(ie diffs in implied kernels)💡13/n
while there is no difference in kernel weights for GBTs across different input irregularity levels, the neural net’s kernel weights for the most irregular ex grow more extreme! 12/n
while there is no difference in kernel weights for GBTs across different input irregularity levels, the neural net’s kernel weights for the most irregular ex grow more extreme! 12/n
We find that GBTs outperform NNs already in the absence of irregular ex; this speaks to diff in baseline suitability
The performance gap then indeed grows as we increase irregularity!11/n
We find that GBTs outperform NNs already in the absence of irregular ex; this speaks to diff in baseline suitability
The performance gap then indeed grows as we increase irregularity!11/n
The kernels implied by the neural network might behave much much more unpredictably for test inputs different to inputs observed at train time! 💡🤔10/n
The kernels implied by the neural network might behave much much more unpredictably for test inputs different to inputs observed at train time! 💡🤔10/n
Neural net tangent kernels OTOH are generally unbounded and could take on very different vals for unseen test inputs!😰 9/n
Neural net tangent kernels OTOH are generally unbounded and could take on very different vals for unseen test inputs!😰 9/n
A second diff is a lot more subtle and relates to how regular (or: predictable) the two will likely behave on new data: … 8/n
A second diff is a lot more subtle and relates to how regular (or: predictable) the two will likely behave on new data: … 8/n
Surely this diff in kernel must account for at least some of the observed performance differences… 🤔7/n
Surely this diff in kernel must account for at least some of the observed performance differences… 🤔7/n
From our previous work on random forests(arxiv.org/abs/2402.01502) we know we can interpret trees as adaptive kernel smoothers, so we can rewrite the GBT preds as weighted avgs over training loss grads!6/n
From our previous work on random forests(arxiv.org/abs/2402.01502) we know we can interpret trees as adaptive kernel smoothers, so we can rewrite the GBT preds as weighted avgs over training loss grads!6/n
Not to be confused with other forms of boosting (eg Adaboost), *Gradient* boosting fits a sequence of weak learners that execute steepest descent in function space directly by learning to predict the loss gradients of training examples! 5/n
Not to be confused with other forms of boosting (eg Adaboost), *Gradient* boosting fits a sequence of weak learners that execute steepest descent in function space directly by learning to predict the loss gradients of training examples! 5/n
A first reaction might be “they are SO different idk where to start 😭” — BUT we show that through the telescoping lens (see part 1 of this🧵⬇️) things become more clear..4/n
For a single step changes are small,so we can better linearly approximate individual steps sequentially(instead of the whole trajectory at once)
➡️ We refer to this as ✨telescoping approx✨!6/n
A first reaction might be “they are SO different idk where to start 😭” — BUT we show that through the telescoping lens (see part 1 of this🧵⬇️) things become more clear..4/n
While the severity of the perf gap over neural nets is disputed, arxiv.org/abs/2305.02997 still found as recently as last year that GBTs esp outperform when data is irregular! 3/n
While the severity of the perf gap over neural nets is disputed, arxiv.org/abs/2305.02997 still found as recently as last year that GBTs esp outperform when data is irregular! 3/n
Deep learning sometimes seems to forget we used to do data formats that weren’t text or image (😉) BUT in data science applications — from medicine to marketing and econ — tabular data still rules big parts of the world!!
2/n
Deep learning sometimes seems to forget we used to do data formats that weren’t text or image (😉) BUT in data science applications — from medicine to marketing and econ — tabular data still rules big parts of the world!!
2/n