Outperforms all models at similar GPU RAM usage and tokens throughputs
Blog post: huggingface.co/blog/smolvlm
Outperforms all models at similar GPU RAM usage and tokens throughputs
Blog post: huggingface.co/blog/smolvlm
Our experiments on Llama-3-70B show that FP8 significantly boosts training throughput (415 → 570 TFLOP/s) but induces loss spikes, leading to downstream performance drops. FP8 isn't always the best choice—it depends! (1/n)
Our experiments on Llama-3-70B show that FP8 significantly boosts training throughput (415 → 570 TFLOP/s) but induces loss spikes, leading to downstream performance drops. FP8 isn't always the best choice—it depends! (1/n)