arylwen.bsky.social
@arylwen.bsky.social
50000-100000 tokens? I am using Linux and Nvidia with lmstudio. I can fit about 64k tokens with a 14b model on a 3090. It takes a few seconds with the 4km quant to about a minute. For longer contexts the model would offload on the cpu and the inference time balloons by an order of magnitude.
March 9, 2025 at 6:24 PM
Reposted
I just tried it on deepseek-r1:32b and the full 607b (or whatever), and both stabilize at a consistent 85% confidence

going bigger/smarter doesn't seem to make it any more or less confident after some point. The sweet spot seems to be 7b-8b
February 9, 2025 at 6:28 PM