garreth
banner
garreth
@garrethlee.bsky.social
🇮🇩 | Co-Founder at Mundo AI (YC W25) | ex-{Hugging Face, Cohere}
All this history is nice, but which method actually performs best for math?

Read our latest blog to find out:
huggingface.co/spaces/huggi...

[6/N]
Number Tokenization Blog - a Hugging Face Space by huggingface
Discover amazing ML apps made by the community
huggingface.co
December 16, 2024 at 5:31 PM
Rumor has it that earlier Claude models used a modified three-digit tokenization, processing numbers right-to-left instead of left-to-right.

This method mirrors how we often read and interpret numbers, like grouping digits with commas. Theoretically, this should help with math reasoning!

[5/N]
December 16, 2024 at 5:31 PM
Alas, tokenizing numbers as digits was costly:

A 10-digit numbers now took 10 tokens instead of 3-4, which is ~2-3x more than before. That's a significant hit on training & inference costs!

LLaMA 3 fixed this by grouping numbers into threes, balancing compression and consistency.

[4/N]
December 16, 2024 at 5:31 PM
Then came LLaMA 1, which took a clever approach to fix number inconsistencies: it tokenized numbers into individual digits (0-9), meaning any large number could now be represented with just 10 tokens.

The consistent representation of numbers made mathematical reasoning much better!

[3/N]
December 16, 2024 at 5:31 PM
When GPT-2 came out in 2019, its tokenizer used byte-pair encoding (BPE), still common today:

• Merges frequent substrings, saving memory vs. inputting single characters
• However, vocabulary depends on training data
• Common numbers (e.g., 1999) get single tokens; others are split

[2/N]
December 16, 2024 at 5:31 PM