Lightnews — Scholar-powered news

@scottjmaddox.bsky.social

8 followers 62 following 5 posts

Posts Replies Media Videos

scottjmaddox.bsky.social

@scottjmaddox.bsky.social

Gotcha. Yeah, if you're willing to spend the extra training $$$, ALiBi is supposed to length generalize much better, although I didn't verify that.

November 26, 2024 at 10:30 PM

scottjmaddox.bsky.social

@scottjmaddox.bsky.social

With new flex_attention baseline, ALiBi is actually slightly faster wall clock, but loss is considerably higher.

November 26, 2024 at 9:42 PM

scottjmaddox.bsky.social

@scottjmaddox.bsky.social

What's your plan for eliminating tokenization? Hierarchical architecture seems to be bare minimum. And I suspect some form of latent loss, so that rewording isn't heavily penalized.

November 26, 2024 at 9:39 PM

scottjmaddox.bsky.social

@scottjmaddox.bsky.social

ALiBi definitely seems nicer than RoPE, but it doesn't perform as well. Not in nanoGPT speed run, using the per-head-m factors from the paper, anyways.

November 26, 2024 at 9:33 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news