Lightnews — Scholar-powered news

Reposted

Micha the DevOp

@michabbb.bsky.social

100,277 tokens – The vocabulary size of GPT-4’s tokenizer, showcasing the complexity of tokenization.

$40,000 (2019) → $100 (2024) – The reduction in cost to train a GPT-2 equivalent model, illustrating improvements in efficiency and hardware advancements.

February 6, 2025 at 2:28 AM

Reposted

Sung Kim

@sungkim.bsky.social

Paper: Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages by one of the author - @jannikbrinkmann.bsky.social ( arxiv.org/abs/2501.06346 )

Repo: github.com/jannik-brink...

Post on X: x.com/jannikbrinkm...

Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages

Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are mul...

arxiv.org

February 6, 2025 at 7:19 AM

Reposted

Sung Kim

@sungkim.bsky.social

Llama2’s internal “lingua franca” is not English, but concepts - and, crucially, concepts that are biased toward English. Hence, English could still be seen as an “internal language”, but in a semantic, rather than a purely lexical.

Paper: arxiv.org/abs/2402.10588

Post on X: x.com/cervisiarius...