https://scholar.google.com/citations?user=I80vy5cAAAAJ
Since the dawn of time, people have been messing with (or dropping entirely) these pesky time-dependent loss scaling terms, mostly because the models train better without them.
Since the dawn of time, people have been messing with (or dropping entirely) these pesky time-dependent loss scaling terms, mostly because the models train better without them.
bsky.app/profile/benj...
bsky.app/profile/benj...
This was a team effort from a few people in my lab, including @antonoresten.bsky.social and others (not sure who is on this app)
This was a team effort from a few people in my lab, including @antonoresten.bsky.social and others (not sure who is on this app)
These large values are where RoPE has the slowest(?) effect. Why?
These large values are where RoPE has the slowest(?) effect. Why?