Lightnews — Scholar-powered news

garreth

@garrethlee.bsky.social

84 followers 76 following 9 posts

🇮🇩 | Co-Founder at Mundo AI (YC W25) | ex-{Hugging Face, Cohere}

Posts Replies Media Videos

garreth

@garrethlee.bsky.social

All this history is nice, but which method actually performs best for math?

Read our latest blog to find out:
huggingface.co/spaces/huggi...

[6/N]

Number Tokenization Blog - a Hugging Face Space by huggingface

Discover amazing ML apps made by the community

huggingface.co

December 16, 2024 at 5:31 PM

garreth

@garrethlee.bsky.social

Rumor has it that earlier Claude models used a modified three-digit tokenization, processing numbers right-to-left instead of left-to-right.

This method mirrors how we often read and interpret numbers, like grouping digits with commas. Theoretically, this should help with math reasoning!

[5/N]

December 16, 2024 at 5:31 PM

garreth

@garrethlee.bsky.social

Alas, tokenizing numbers as digits was costly:

A 10-digit numbers now took 10 tokens instead of 3-4, which is ~2-3x more than before. That's a significant hit on training & inference costs!

LLaMA 3 fixed this by grouping numbers into threes, balancing compression and consistency.

[4/N]

December 16, 2024 at 5:31 PM

garreth

@garrethlee.bsky.social

Then came LLaMA 1, which took a clever approach to fix number inconsistencies: it tokenized numbers into individual digits (0-9), meaning any large number could now be represented with just 10 tokens.

The consistent representation of numbers made mathematical reasoning much better!

[3/N]

December 16, 2024 at 5:31 PM

garreth

@garrethlee.bsky.social

When GPT-2 came out in 2019, its tokenizer used byte-pair encoding (BPE), still common today:

• Merges frequent substrings, saving memory vs. inputting single characters
• However, vocabulary depends on training data
• Common numbers (e.g., 1999) get single tokens; others are split

[2/N]

December 16, 2024 at 5:31 PM

garreth

@garrethlee.bsky.social

github.com/garrethlee/g...

GitHub - garrethlee/gcmt: A simple CLI tool that uses LLMs to automatically generate meaningful & conventional commit messages

A simple CLI tool that uses LLMs to automatically generate meaningful & conventional commit messages - garrethlee/gcmt

github.com

November 25, 2024 at 4:31 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news