Our experiments on Llama-3-70B show that FP8 significantly boosts training throughput (415 → 570 TFLOP/s) but induces loss spikes, leading to downstream performance drops. FP8 isn't always the best choice—it depends! (1/n)
Our experiments on Llama-3-70B show that FP8 significantly boosts training throughput (415 → 570 TFLOP/s) but induces loss spikes, leading to downstream performance drops. FP8 isn't always the best choice—it depends! (1/n)