ikyle.bsky.social
@ikyle.bsky.social
Reposted
100,277 tokens – The vocabulary size of GPT-4’s tokenizer, showcasing the complexity of tokenization.

$40,000 (2019) → $100 (2024) – The reduction in cost to train a GPT-2 equivalent model, illustrating improvements in efficiency and hardware advancements.
February 6, 2025 at 2:28 AM
Reposted
Paper: Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages by one of the author - @jannikbrinkmann.bsky.social ( arxiv.org/abs/2501.06346 )

Repo: github.com/jannik-brink...

Post on X: x.com/jannikbrinkm...
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are mul...
arxiv.org
February 6, 2025 at 7:19 AM
Reposted
Llama2’s internal “lingua franca” is not English, but concepts - and, crucially, concepts that are biased toward English. Hence, English could still be seen as an “internal language”, but in a semantic, rather than a purely lexical.

Paper: arxiv.org/abs/2402.10588

Post on X: x.com/cervisiarius...
February 6, 2025 at 7:26 AM