— ERNIE 4.5: beats GPT 4.5 for 1% of price
— Reasoning model X1: beats DeepSeek R1 for 50% of price.
China continues to build intelligence too cheap to meter. The AI price war is on.
— ERNIE 4.5: beats GPT 4.5 for 1% of price
— Reasoning model X1: beats DeepSeek R1 for 50% of price.
China continues to build intelligence too cheap to meter. The AI price war is on.
Google Gemini really cooked with this one.
This is next gen photo editing.
Google Gemini really cooked with this one.
This is next gen photo editing.
"Make the steak vegetarian"
"Make the bridge go away"
"Make the keyboard more colorful"
And my favorite
"Give the OpenAI logo more personality"
"Make the steak vegetarian"
"Make the bridge go away"
"Make the keyboard more colorful"
And my favorite
"Give the OpenAI logo more personality"
The Nature published that reasoning LLMs found errors in 1% of the 10,000 research papers it analyzed with 35% false positive rate for $0.15-1/paper.
Anthropic founder’s view of “a country of geniuses in a data center” is happening.
The Nature published that reasoning LLMs found errors in 1% of the 10,000 research papers it analyzed with 35% false positive rate for $0.15-1/paper.
Anthropic founder’s view of “a country of geniuses in a data center” is happening.
LADDER:
— Generate variants of problem
— Solve, verify, use GRPO (DeepSeek) to learn
TTRL:
— Do 1&2 when you see a new problem
New form of test time compute scaling!
LADDER:
— Generate variants of problem
— Solve, verify, use GRPO (DeepSeek) to learn
TTRL:
— Do 1&2 when you see a new problem
New form of test time compute scaling!
— Daytona, for general purpose sort. Above numbers are Daytona.
— Indy, which can be specific to the 100-byte records with 10-byte keys.
Not super useful in practice though.
Link: sortbenchmark.org/
Google experiments on it: sortbenchmark.org/
— Daytona, for general purpose sort. Above numbers are Daytona.
— Indy, which can be specific to the 100-byte records with 10-byte keys.
Not super useful in practice though.
Link: sortbenchmark.org/
Google experiments on it: sortbenchmark.org/
SortBenchmark, in distributed systems, measures this.
— How fast? 134s
— How cheap? $97
— How many in 1 minute? 370B numbers
— How much energy? ~59kJ or walking for 15mins
Every software engineer should know this.
SortBenchmark, in distributed systems, measures this.
— How fast? 134s
— How cheap? $97
— How many in 1 minute? 370B numbers
— How much energy? ~59kJ or walking for 15mins
Every software engineer should know this.
Revenue (/day): $562k
Cost (/day): $87k
Revenue (/yr): ~$205M
This is all while charging $2.19/M tokens on R1, ~25x less than OpenAI o1.
If this was in the US, this would be a >$10B company.
Revenue (/day): $562k
Cost (/day): $87k
Revenue (/yr): ~$205M
This is all while charging $2.19/M tokens on R1, ~25x less than OpenAI o1.
If this was in the US, this would be a >$10B company.
Fork a repo.
Select a folder.
Ask it anything.
It even shows you what %age of the context window each folder takes.
Here it visualizes yt-dlp's (Youtube downloader) flow:
Fork a repo.
Select a folder.
Ask it anything.
It even shows you what %age of the context window each folder takes.
Here it visualizes yt-dlp's (Youtube downloader) flow:
OpenAI: chatgpt.com/share/67a41...
Gemini: docs.google.com/document/d/...
OpenAI: chatgpt.com/share/67a41...
Gemini: docs.google.com/document/d/...
The winner was OpenAI.
It had the most detailed, high-quality and accurate answer, but you do pay $200/mo for it.
The winner was OpenAI.
It had the most detailed, high-quality and accurate answer, but you do pay $200/mo for it.
Excellence is boring. It's making the same boring "correct" choice over and over again. You win by being consistent for longer.
Our short attention spans tend to forget that.
Excellence is boring. It's making the same boring "correct" choice over and over again. You win by being consistent for longer.
Our short attention spans tend to forget that.
(Check out the detailed code submissions and scoring in the appendix)
(Check out the detailed code submissions and scoring in the appendix)
The model was NOT contaminated with this data and the 50 submission limit was used.
We will likely see superhuman coding models this year.
The model was NOT contaminated with this data and the 50 submission limit was used.
We will likely see superhuman coding models this year.
I'm surprised more people don't know about it. Benjamin Bycroft made this beautiful interactive visualization to show exactly how the inner workings of each of the weights of an LLM work.
Here's a link:
I'm surprised more people don't know about it. Benjamin Bycroft made this beautiful interactive visualization to show exactly how the inner workings of each of the weights of an LLM work.
Here's a link:
Perfect needle-in-the-haystack scores are easy—attention mechanisms can match the word. When you require 1-hop of reasoning, performance degrades quickly.
This is why guaranteeing correctness for agents is hard.
Perfect needle-in-the-haystack scores are easy—attention mechanisms can match the word. When you require 1-hop of reasoning, performance degrades quickly.
This is why guaranteeing correctness for agents is hard.