We show that floating point errors in the Softmax play a surprising role in grokking, explaining among other things, why weight decay seems necessary for grokking in most cases!
🧵
We show that floating point errors in the Softmax play a surprising role in grokking, explaining among other things, why weight decay seems necessary for grokking in most cases!
🧵