scottjmaddox.bsky.social
@scottjmaddox.bsky.social
Gotcha. Yeah, if you're willing to spend the extra training $$$, ALiBi is supposed to length generalize much better, although I didn't verify that.
November 26, 2024 at 10:30 PM
With new flex_attention baseline, ALiBi is actually slightly faster wall clock, but loss is considerably higher.
November 26, 2024 at 9:42 PM
What's your plan for eliminating tokenization? Hierarchical architecture seems to be bare minimum. And I suspect some form of latent loss, so that rewording isn't heavily penalized.
November 26, 2024 at 9:39 PM
ALiBi definitely seems nicer than RoPE, but it doesn't perform as well. Not in nanoGPT speed run, using the per-head-m factors from the paper, anyways.
November 26, 2024 at 9:33 PM