Thanks Tim! I would also mention though, even if you are rewriting 90% of the code, you can still get a 10% speed improvement just by using this feature. People slave away in the cuda mines for a 10% speedup, and here it is, sitting right in front of you, with 10% even in the WORST case.
There is already support for this in the OpenAI api specification, and this change brings it to vLLM in a much better form. OpenAI is actually the only other provider I'm aware of providing this feature, and it actually results in SLOWER performance, while ours is much faster.
Think: Speculative decoding, but instead of a draft model (slow, complicated, wrong) you have a static text prediction of the output, and a diff algorithm to keep it aligned when it diverges.
I would like to share some work we've been doing at cascadetech.ai: Predicted Outputs in vLLM. If you aren't familiar with PO, it allows you to dramatically speed up generation when you know something about the contents of the output (think: code modification).
These are INCREDIBLY complex simulations, including multi threading, physics simulations, and many billions of floating point operations that have to be deterministic down to the last bit of the mantissa over the course of hours of play. And they are.
As a former videogame developer, I can tell you that you can definitely build deterministic software on cpus! One really efficient way to do multiplayer is to replicate input across all nodes and then run a fully deterministic simulation on each node.
Not to mention the Linux tradition of everything being a text stream is very conducive to LLM integration. I just installed desktop Linux on my new computer, pretty happy with it so far. Mug smoother experience than the last time I tried.
But there isn't some OTHER festival down the road for people in their 20s. ALL music festivals are for gen x and older millennials. Rock and roll is dying, but even live music is dying with it.
There is no genre of music where at live shows you see a crowd aged under 30. I just went to a music festival La Route du Rock which would have been 20 year olds 20 years ago. Now it was mostly people over 40.
Had a long discussion about this with my ethnomusicologist friend last week who teaches history of rock to 18 year olds. Apparently not only are they not forming bands, but they also aren't even attending live music events at all.
or "Don't bother learning how to effectively use AI, we will still keep employing you even when other prospective employees will work more efficiently for the same salary"?
or "Don't worry, even as the employees of our competitors adopt AI to be more efficient, we'll just keep doing things the old way so that we don't have to lay anybody off?"
have you tried using gemini instead of anthropic? in my experience you can get better quality for a TINY fraction of the price. gemini 2.0 flash lite is 10x cheaper than haiku, and flash 2.0 is like 8x cheaper.
I have had such miserable results with anything cooking related. We did a cocktail night where we drank LLM cocktails and they were so very bad. I feel like llms are in letter counting territory with recipes.