n0riskn0r3ward.bsky.social
@n0riskn0r3ward.bsky.social
To be clear, I don't mean the resulting loss is lower. I mean that after benchmarking models trained using different optimizers but same training data, the model I got from using schedule_free_adamw was the champ by enough of a margin that I think it's plausible it wasn't random chance
December 1, 2024 at 4:05 AM
static.googleusercontent.com
November 27, 2024 at 3:47 PM
Have you read the google paper about using an optimization algorithm to make the optimal cookie though... because I not only read it I baked dem cookies and can confirm, algo work real good
November 27, 2024 at 3:25 PM
Yeah, giant pain but I really wanted to know...
November 27, 2024 at 12:46 PM
So while it's a bit task specific, there's more than enough context provided to the LLM's in prompt to understand the task + how the output will be evaluated. Rubric is pass/fail on each dimension with room for the LLM to overweight failures like - leaving out key info in the final judgement
November 27, 2024 at 12:46 PM
Call it a summary/instruction following eval. The judge prompt has a 21 point custom rubric for grading the outputs and the original prompt for producing the summaries has a similarly lengthy description of the kinds of things I want included vs excluded, style guidelines, what to emphasize etc.
November 27, 2024 at 12:44 PM
Nice! All about the custom eval. A lot of work but so so worth it. I recently built my own eval as well (not for code, and primarily to evaluate performance of different fine tuning ablations/ideas).

bsky.app/profile/n0ri...
Sonnet is still King 👑 for summarization:

Sonnet 3.6 vs 4o 11-20 (n=210):
Claude Sonnet 3.6: 54% (113 wins)
GPT-4o (11/20): 44% (92 wins)
Ties: 2% (5)

Sonnet 3.6 vs Gemini Exp 11-21 (n=202):
Claude Sonnet 3.6: 60% (122 wins)
Gemini-exp-1121: 38% (76 wins)
Ties: 2% (4)
November 27, 2024 at 12:25 PM
For context these are o1-preview judgements from a custom LLM as a judge prompt I spent an unreasonable amount of time crafting. Posted this to that other site but going forwards I will share more here.
November 27, 2024 at 12:22 PM
Distillation is the way. Sample efficiency of larger model when training + inf cost of smaller distilled model + retain the option to quantized the smaller model to fp8 for speed boost on some GPUs + option for better spec decoding from extra small distilled model = really good practical option IMO
November 26, 2024 at 1:34 PM
@myexplodingpen.bsky.social is there any way to read your article without subscribing to medium? Haven't been able to find much from Microsoft on the topic or anyone else discussing this but can't read your piece...
November 24, 2024 at 4:38 PM
Instructions unclear but just in case all I have to do to get into the Stanford NLP phd program is reply to this thread I figure I better go ahead and reply.
November 22, 2024 at 12:38 AM