Not just the final models/code/data, but also negative results, toy experiments, and even spontaneous discussions.
That's what we're trying @ marin.community
Not just the final models/code/data, but also negative results, toy experiments, and even spontaneous discussions.
That's what we're trying @ marin.community
TL;DR: ⚠️ on spec in vanilla vllm. w/ a 75% token acceptance generation in batch-4 went from 150t/s (no speculative) -> 50t/s (w/ speculative)
I'd get 400t/s+ max tput w/out speculative, larger batch
TL;DR: ⚠️ on spec in vanilla vllm. w/ a 75% token acceptance generation in batch-4 went from 150t/s (no speculative) -> 50t/s (w/ speculative)
I'd get 400t/s+ max tput w/out speculative, larger batch
Azure: "We don't have any Ampere GPUs to turn up even."
Me: 😠
Azure: "We don't have any Ampere GPUs to turn up even."
Me: 😠
continue
continue
continue
continue, and btw, I think you were interrupted and may have already written some stuff you can't see
continue
continue
ifykyk
continue
continue
continue
continue, and btw, I think you were interrupted and may have already written some stuff you can't see
continue
continue
ifykyk
chatgpt.com/share/67aa59...
Coolest thing was GPT suggesting a hybrid approach which I had been thinking before I even got to the bottom
chatgpt.com/share/67aa59...
Coolest thing was GPT suggesting a hybrid approach which I had been thinking before I even got to the bottom