LLM360
banner
llm360.bsky.social
LLM360
@llm360.bsky.social
Working on fully open-source LLMs and training data. We believe in community-owned AI.

https://www.llm360.ai
Building on FineWeb’s global deduplication findings, we introduce a strategic upsampling recipe which outperforms FineWeb using TxT360. Full details are in the Upsampling Experiment section of the release blog.
November 19, 2024 at 10:51 PM
🪟🛠️LLM360 is committed to making open source AI accessible, transparent, and reproducible.

High-quality data is the first step toward better open source models...and we are excited to join the party contributing the first globally deduplicated dataset containing 5.7T tokens!
November 19, 2024 at 10:51 PM
📢📢 Check out:

TxT360: a globally deduplicated dataset for LLM pretraining

🌐 99 Common Crawls
📘 14 Curated Sources
👨‍🍳 recipe to easily adjust data weighting and train the most performant models

Dataset:
huggingface.co/datasets/LLM...

Blog:
llm360-txt360.hf.space
November 19, 2024 at 10:51 PM