Evals, metrics, multilinguality, multiculturality, multimodality, and (dabbling in) reasoning
https://saxon.me/
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
a 🧵 1/n
Drain: arxiv.org/abs/2511.04820
Strain: direct.mit.edu/qss/article/...
Oligopoly: direct.mit.edu/qss/article/...
More than anything my PhD taught me this.
More than anything my PhD taught me this.
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
We’re testing new systems to improve reply quality. See what’s coming: bsky.social/about/blog/1...
FYI the blog post for the updated policy is out. Our llm future is dire:/
Also, check out the cool bsky comment integration I've added to the blog! Engagement with this post will go under the blogpost on my site as comments!
saxon.me/blog/2024/gr...
Also, check out the cool bsky comment integration I've added to the blog! Engagement with this post will go under the blogpost on my site as comments!
saxon.me/blog/2024/gr...
Turning the replies to a bluesky post into the comment section for a blogpost is a small concrete way to support the ecosystem: future visitors who want to add comments incentivized to interact on the platform
Also, it's very easy to do:
Turning the replies to a bluesky post into the comment section for a blogpost is a small concrete way to support the ecosystem: future visitors who want to add comments incentivized to interact on the platform
Also, it's very easy to do:
youtu.be/6i2I3dkZ5-M
youtu.be/6i2I3dkZ5-M
Also, I am getting more and more indiewebpilled. Would any other NLPMLAI researcher-bloggers be interested in making a webring?
Also, I am getting more and more indiewebpilled. Would any other NLPMLAI researcher-bloggers be interested in making a webring?
Personally, I think idealized human vs average LM is most germane set to use to think about capabilities
Personally, I think idealized human vs average LM is most germane set to use to think about capabilities
Gee. Who Could Have Foreseen. *stares directly into the camera like in the office*
Gee. Who Could Have Foreseen. *stares directly into the camera like in the office*
www.404media.co/a16z-backed-...
www.404media.co/a16z-backed-...
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵
1️⃣Gender bias over-representation in AI bias research 👫
2️⃣Stable Diffusion's skin tone bias 🧑🏻🧑🏽🧑🏿
3️⃣Limitations of human oversight in AI hiring 👤🤖
Let's chat if you’re at AIES or read below/reach out for details!
#AIES25 #AcademicSky
1️⃣Gender bias over-representation in AI bias research 👫
2️⃣Stable Diffusion's skin tone bias 🧑🏻🧑🏽🧑🏿
3️⃣Limitations of human oversight in AI hiring 👤🤖
Let's chat if you’re at AIES or read below/reach out for details!
#AIES25 #AcademicSky