Lightnews — Scholar-powered news

n0riskn0r3ward.bsky.social

@n0riskn0r3ward.bsky.social

To be clear, I don't mean the resulting loss is lower. I mean that after benchmarking models trained using different optimizers but same training data, the model I got from using schedule_free_adamw was the champ by enough of a margin that I think it's plausible it wasn't random chance

December 1, 2024 at 4:05 AM

n0riskn0r3ward.bsky.social

@n0riskn0r3ward.bsky.social

Enjoy static.googleusercontent.com/media/resear...

static.googleusercontent.com

November 27, 2024 at 3:47 PM

n0riskn0r3ward.bsky.social

@n0riskn0r3ward.bsky.social

Have you read the google paper about using an optimization algorithm to make the optimal cookie though... because I not only read it I baked dem cookies and can confirm, algo work real good

November 27, 2024 at 3:25 PM

n0riskn0r3ward.bsky.social

@n0riskn0r3ward.bsky.social

Yeah, giant pain but I really wanted to know...

November 27, 2024 at 12:46 PM

n0riskn0r3ward.bsky.social

@n0riskn0r3ward.bsky.social

So while it's a bit task specific, there's more than enough context provided to the LLM's in prompt to understand the task + how the output will be evaluated. Rubric is pass/fail on each dimension with room for the LLM to overweight failures like - leaving out key info in the final judgement

November 27, 2024 at 12:46 PM

n0riskn0r3ward.bsky.social

@n0riskn0r3ward.bsky.social

Call it a summary/instruction following eval. The judge prompt has a 21 point custom rubric for grading the outputs and the original prompt for producing the summaries has a similarly lengthy description of the kinds of things I want included vs excluded, style guidelines, what to emphasize etc.

November 27, 2024 at 12:44 PM

n0riskn0r3ward.bsky.social

@n0riskn0r3ward.bsky.social

Nice! All about the custom eval. A lot of work but so so worth it. I recently built my own eval as well (not for code, and primarily to evaluate performance of different fine tuning ablations/ideas).

bsky.app/profile/n0ri...

n0riskn0r3ward.bsky.social @n0riskn0r3ward.bsky.social · Nov 27

Sonnet is still King 👑 for summarization:

Sonnet 3.6 vs 4o 11-20 (n=210):
Claude Sonnet 3.6: 54% (113 wins)
GPT-4o (11/20): 44% (92 wins)
Ties: 2% (5)

Sonnet 3.6 vs Gemini Exp 11-21 (n=202):
Claude Sonnet 3.6: 60% (122 wins)
Gemini-exp-1121: 38% (76 wins)
Ties: 2% (4)

November 27, 2024 at 12:25 PM

n0riskn0r3ward.bsky.social

@n0riskn0r3ward.bsky.social

For context these are o1-preview judgements from a custom LLM as a judge prompt I spent an unreasonable amount of time crafting. Posted this to that other site but going forwards I will share more here.

November 27, 2024 at 12:22 PM

n0riskn0r3ward.bsky.social

@n0riskn0r3ward.bsky.social

Distillation is the way. Sample efficiency of larger model when training + inf cost of smaller distilled model + retain the option to quantized the smaller model to fp8 for speed boost on some GPUs + option for better spec decoding from extra small distilled model = really good practical option IMO

November 26, 2024 at 1:34 PM

n0riskn0r3ward.bsky.social

@n0riskn0r3ward.bsky.social

@myexplodingpen.bsky.social is there any way to read your article without subscribing to medium? Haven't been able to find much from Microsoft on the topic or anyone else discussing this but can't read your piece...

November 24, 2024 at 4:38 PM

n0riskn0r3ward.bsky.social

@n0riskn0r3ward.bsky.social

Instructions unclear but just in case all I have to do to get into the Stanford NLP phd program is reply to this thread I figure I better go ahead and reply.

November 22, 2024 at 12:38 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news