We are releasing two new games, Poker and Werewolf, along with an updated Chess leaderboard next Monday, February 2, running daily from 9:30 AM PT to 11:30 AM PT through February 4
We are releasing two new games, Poker and Werewolf, along with an updated Chess leaderboard next Monday, February 2, running daily from 9:30 AM PT to 11:30 AM PT through February 4
As AI evolves at an unprecedented pace, measuring intelligence requires more than a few AI research labs alone – it requires the imagination and collective expertise of the global community. That’s why we’re launching Community Benchmarks.
As AI evolves at an unprecedented pace, measuring intelligence requires more than a few AI research labs alone – it requires the imagination and collective expertise of the global community. That’s why we’re launching Community Benchmarks.
We're excited to announce the top 12 teams who showcased exceptional creativity & technical skill using AI agents! Check out their innovative projects & learn more about their submissions here:
www.kaggle.com/competitions...
We're excited to announce the top 12 teams who showcased exceptional creativity & technical skill using AI agents! Check out their innovative projects & learn more about their submissions here:
www.kaggle.com/competitions...
This benchmark focuses on complex web research tasks and tests agent comprehensiveness.
Check the leaderboard: www.kaggle.com/benchmarks/g...
This benchmark focuses on complex web research tasks and tests agent comprehensiveness.
Check the leaderboard: www.kaggle.com/benchmarks/g...
Developed by Google DeepMind and Google Research, this suite measures LLM factuality across four dimensions: Parametric knowledge, Search, Multimodal understanding & Grounding.
Explore the leaderboard: www.kaggle.com/benchmarks/g...
Developed by Google DeepMind and Google Research, this suite measures LLM factuality across four dimensions: Parametric knowledge, Search, Multimodal understanding & Grounding.
Explore the leaderboard: www.kaggle.com/benchmarks/g...
Developed by Google DeepMind, this benchmark spans 29 Indic languages, including first-ever evaluation data for 18 Indic languages. It supports language tasks like summarization, translation and question answering.
Developed by Google DeepMind, this benchmark spans 29 Indic languages, including first-ever evaluation data for 18 Indic languages. It supports language tasks like summarization, translation and question answering.
You can now download Kaggle Benchmark leaderboard results!
Compare your favorite models with a simple CURL command or download the full CSV directly for deeper analysis.
Get started: www.kaggle.com/benchmarks
You can now download Kaggle Benchmark leaderboard results!
Compare your favorite models with a simple CURL command or download the full CSV directly for deeper analysis.
Get started: www.kaggle.com/benchmarks
A new benchmark that tests reasoning beyond memorization. Each game starts from one of 20 popular openings, pushing models to adapt and think strategically rather than rely on learned patterns.
A new benchmark that tests reasoning beyond memorization. Each game starts from one of 20 popular openings, pushing models to adapt and think strategically rather than rely on learned patterns.
www.kaggle.com/benchmarks/c...
www.kaggle.com/benchmarks/c...
We’ve partnered with Google DeepMind and Google Research to launch a curated 1,000-prompt benchmark designed to provide a more reliable and challenging evaluation of LLM short-form factuality.
Check out the leaderboard here: www.kaggle.com/benchmarks/d...
We’ve partnered with Google DeepMind and Google Research to launch a curated 1,000-prompt benchmark designed to provide a more reliable and challenging evaluation of LLM short-form factuality.
Check out the leaderboard here: www.kaggle.com/benchmarks/d...
In the first #KaggleGameArena — Chess Text Input — AI models faced off using only text inputs (no tools, no move validation) in 40+ matches per pairing to build a robust Elo-like ranking ♟️
www.kaggle.com/benchmarks/k...
In the first #KaggleGameArena — Chess Text Input — AI models faced off using only text inputs (no tools, no move validation) in 40+ matches per pairing to build a robust Elo-like ranking ♟️
www.kaggle.com/benchmarks/k...
Big thanks to
@magnuscarlseny.bsky.social , @gmhikaru.bsky.social, @gothamchess.bsky.social and GM David Howell for the fantastic commentary and analysis on Chessom and TakeTakeTakeApp.
Big thanks to
@magnuscarlseny.bsky.social , @gmhikaru.bsky.social, @gothamchess.bsky.social and GM David Howell for the fantastic commentary and analysis on Chessom and TakeTakeTakeApp.
The first round is complete, and we have our four semi-finalists! Congratulations to o4-mini, o3, Gemini 2.5 Pro & Grok 4!
Come back tomorrow! Semi-finals kick off, August 6th, at 10:30 am PT.
The first round is complete, and we have our four semi-finalists! Congratulations to o4-mini, o3, Gemini 2.5 Pro & Grok 4!
Come back tomorrow! Semi-finals kick off, August 6th, at 10:30 am PT.
Tune in today at 10:30AM PT to watch 4 head-to-head AI matchups 🤖 in a single-elimination bracket
Tune in today at 10:30AM PT to watch 4 head-to-head AI matchups 🤖 in a single-elimination bracket
Reply to this post with your filled-out bracket to let us know who you think will take home the gold medal!
Reply to this post with your filled-out bracket to let us know who you think will take home the gold medal!
For the next 3 days, August 5-7, tune in daily at 10:30 am PST, and catch commentary from
@gmhikaru.bsky.social, @gothamchess.bsky.social and @magnuscarlseny.bsky.social ⬇️
For the next 3 days, August 5-7, tune in daily at 10:30 am PST, and catch commentary from
@gmhikaru.bsky.social, @gothamchess.bsky.social and @magnuscarlseny.bsky.social ⬇️
Kaggle Benchmarks is the fastest, easiest way to test new models.
Let Kaggle handle infrastructure while you focus on AI breakthroughs and benefit from competition-grade rigor.
Sign up here: goo.gle/kaggle-benchmarks-waitlist
Kaggle Benchmarks is the fastest, easiest way to test new models.
Let Kaggle handle infrastructure while you focus on AI breakthroughs and benefit from competition-grade rigor.
Sign up here: goo.gle/kaggle-benchmarks-waitlist
Meet our team, explore an interactive demo, & our new community platform for building and sharing top models evaluations.
➕ learn more about Kaggle team's upcoming talk on GenAI evaluation! #ICML2025
Meet our team, explore an interactive demo, & our new community platform for building and sharing top models evaluations.
➕ learn more about Kaggle team's upcoming talk on GenAI evaluation! #ICML2025
Access Kaggle's powerful compute resources like GPUs, TPUs & large datasets from your preferred editor, like Colab or VS Code.
Try it now! 👇 www.kaggle.com/discussions/...
Access Kaggle's powerful compute resources like GPUs, TPUs & large datasets from your preferred editor, like Colab or VS Code.
Try it now! 👇 www.kaggle.com/discussions/...
www.kaggle.com/competitions...
www.kaggle.com/competitions...