Rust is just a tool, don't care too much about the views, the algorithm sucks and all platforms are filled with bots anyway.
Rust is just a tool, don't care too much about the views, the algorithm sucks and all platforms are filled with bots anyway.
Even in Rust, if I don't hand hold carefully the LLM it will tend to spiral out of control producing random crap.
But not unlike a junior, if you explain carefully what you want, it tends to get it correct, or self correct relatively OK. Just don't ask too much at once.
Even in Rust, if I don't hand hold carefully the LLM it will tend to spiral out of control producing random crap.
But not unlike a junior, if you explain carefully what you want, it tends to get it correct, or self correct relatively OK. Just don't ask too much at once.
Also there are no runtime checks for the types, happened too many times for me, that the culprit was not my codebase but my sanitation of browser data (which TS doesn't protect against)
.
Also there are no runtime checks for the types, happened too many times for me, that the culprit was not my codebase but my sanitation of browser data (which TS doesn't protect against)
.
That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments.
That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments.
On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead is ~5us. Thanks Daniel de kok for the beast data structure
On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead is ~5us. Thanks Daniel de kok for the beast data structure
By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime.
By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime.