- Using compute-optimal scaling, a Llama 3 3B outperforms 70B (22x larger) on mathematical reasoning tasks
- Using compute-optimal scaling, a Llama 3 3B outperforms 70B (22x larger) on mathematical reasoning tasks
- Different search strategies work better for different problem difficulties - beam search for harder problems, Best-of-N for simpler ones
- Different search strategies work better for different problem difficulties - beam search for harder problems, Best-of-N for simpler ones
- Explored Best-of-N sampling, beam search, and Diverse Verifier Tree Search (DVTS)
- Llama 3 1B achieved 55% accuracy on the MATH benchmark using optimal search strategies
- Explored Best-of-N sampling, beam search, and Diverse Verifier Tree Search (DVTS)
- Llama 3 1B achieved 55% accuracy on the MATH benchmark using optimal search strategies
OpenAI trained a new Turbo model to make it easier and faster to use. With "storyboards" users get a CapCut/Tiktok/Reel-like text-to-video editor, that can be used to edit and create new short-form content! Social media will be flooded.🌊
OpenAI trained a new Turbo model to make it easier and faster to use. With "storyboards" users get a CapCut/Tiktok/Reel-like text-to-video editor, that can be used to edit and create new short-form content! Social media will be flooded.🌊
🔓 Released under Apache 2.0 on @huggingface.bsky.social
📱 Can run efficiently on laptops and edge devices
🔓 Released under Apache 2.0 on @huggingface.bsky.social
📱 Can run efficiently on laptops and edge devices
🛠️ Released 3 variants with Base, Synthetic, and Instruct
💾 Requires only 5GB GPU RAM and achieves 38.8% on MMMU, 81.6% on DocVQA
⚡ 3.3-4.5x faster prefill and 7.5-16x faster generation vs Qwen2-VL
🛠️ Released 3 variants with Base, Synthetic, and Instruct
💾 Requires only 5GB GPU RAM and achieves 38.8% on MMMU, 81.6% on DocVQA
⚡ 3.3-4.5x faster prefill and 7.5-16x faster generation vs Qwen2-VL
- ⚡ 1.4-2.1x better multi-query throughput
- 🌱 Pruned using 13B tokens training, 26 hours on 32 H100s
- 🔧 Optimized for NVIDIA Ampere GPUs and newer
- ⚡ 1.4-2.1x better multi-query throughput
- 🌱 Pruned using 13B tokens training, 26 hours on 32 H100s
- 🔧 Optimized for NVIDIA Ampere GPUs and newer
- 🚀 30% higher throughput and 1.8x lower latency with up to 5.0x when combined with quantization
- 💻 Works with 4-bit quantization (GPTQ) and Sparse-Marlin kernels
- 🚀 30% higher throughput and 1.8x lower latency with up to 5.0x when combined with quantization
- 💻 Works with 4-bit quantization (GPTQ) and Sparse-Marlin kernels
For now, it supports Llama. Which one would you want to see next?
For now, it supports Llama. Which one would you want to see next?
No-structured outputs can actually improve LLM performance when implemented correctly.
No-structured outputs can actually improve LLM performance when implemented correctly.
🔮 Examples in prompts should match the exact format expected in the actual tasks
🧰 Structured generation works best when implemented as "running our response parser as a generator"
🔮 Examples in prompts should match the exact format expected in the actual tasks
🧰 Structured generation works best when implemented as "running our response parser as a generator"
📌 JSON generation requires careful prompt design, including specifying the desired schema.
📝 Good prompts should mimic information for human to understand the task and expected response format
📌 JSON generation requires careful prompt design, including specifying the desired schema.
📝 Good prompts should mimic information for human to understand the task and expected response format
📊 Structured outputs outperform unstructured on the test GSM8K: 0.78 vs 0.77, Last Letter: 0.77 vs 0.73, Shuffle Object: 0.44 vs 0.41
📊 Structured outputs outperform unstructured on the test GSM8K: 0.78 vs 0.77, Last Letter: 0.77 vs 0.73, Shuffle Object: 0.44 vs 0.41