Besmira Nushi
@besmiranushi.bsky.social
610 followers 140 following 87 posts
AI/ML, Responsible AI, Technology & Society @MicrosoftResearch
Posts Media Videos Starter Packs
besmiranushi.bsky.social
Federal research funding works. It’s not an expense–it’s an investment. It’s not overhead–it’s a down payment on the future. - Eric Horvitz, Margaret Martonosi, Moshe Y. Vardi, and James Larus in CACM cacm.acm.org/opinion/keep... @erichorvitz.bsky.social
Keeping the Dream Alive: The Power and Promise of Federally Funded Research – Communications of the ACM
cacm.acm.org
besmiranushi.bsky.social
Our team in Zurich and EMEA is hiring Deep Learning Engineers for LLM Accuracy Evaluation and Analysis. Ideal candidates should have an inquisitive 🔬approach to evaluation and with best engineering practices for building reusable open source tools. www.linkedin.com/jobs/view/42...
NVIDIA hiring Deep Learning Engineer, LLM Accuracy Evaluation in Switzerland | LinkedIn
Posted 5:30:48 AM. We are seeking senior engineers to pioneer new methodologies for accurately assessing the…See this and similar jobs on LinkedIn.
www.linkedin.com
Reposted by Besmira Nushi
warmonitor.net
The Diary of Anne Frank is among the hundreds of books banned in Florida this year. When I was in school, it was required reading. (Guardian)
besmiranushi.bsky.social
…the list continues but point is that a company that hires the best talent in the field definitely knows how to chart. Problem arises when marketing drives and dominates the science, and it is not a single company problem today.
besmiranushi.bsky.social
…coloring new model releases boldly while leaving the older models as blank/white so newer models artificially stand out even if they’re not better, not providing worst case results, not standardizing the max value across charts presented at the same level horizontally…
besmiranushi.bsky.social
The problem with chart crimes is not just the distortion of the y axis. It is the erasure of all other competitors from charts (hence they don’t exist), lack of error bars, lack of transparency in tools and code being used for evals…
besmiranushi.bsky.social
I have a single question. Why doesn’t OpenAI compare with competitors in their evals? No Gemini, no Claude, no open source models…
Reposted by Besmira Nushi
jessica.bsky.social
hey wasn't this the same company that made a beautiful shiny "research" post about how AI evals should include error bars or something like that. or did they decide the CLT didn't apply here
Reposted by Besmira Nushi
hylandsl.bsky.social
New work from my team! arxiv.org/abs/2507.12950
Intersecting mechanistic interpretability and health AI 😎

We trained and interpreted sparse autoencoders on MAIRA-2, our radiology MLLM. We found a range of human-interpretable radiology reporting concepts, but also many uninterpretable SAE features.
Insights into a radiology-specialised multimodal large language model with sparse autoencoders
Interpretability can improve the safety, transparency and trust of AI models, which is especially important in healthcare applications where decisions often carry significant consequences. Mechanistic...
arxiv.org
Reposted by Besmira Nushi
hannawallach.bsky.social
If you're at @icmlconf.bsky.social this week, come check out our poster on "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" presented by the amazing @afedercooper.bsky.social from 11:30am--1:30pm PDT on Weds!!! icml.cc/virtual/2025...
ICML Poster Position: Evaluating Generative AI Systems Is a Social Science Measurement ChallengeICML 2025
icml.cc
Reposted by Besmira Nushi
feldera.bsky.social
📢 Webinar - 6/18 at 9am PST!
Stop re-running complex recursive queries when your graph data changes. Feldera incrementally evaluates recursive graph computations. Learn to easily build these mechanisms with #SQL, without the hassle of constant recomputation.
tinyurl.com/rb5my7d8
besmiranushi.bsky.social
I only got to listen to this today. A lot of people in my network including myself have felt exactly this, for years. The fear that for some obscure reason, your paperwork and you may not be enough for this country, even in “normal” times.

youtube.com/shorts/IF3bz...
let me explain what being on a student visa is actually like
YouTube video by Representative Pramila Jayapal
youtube.com
Reposted by Besmira Nushi
feldera.bsky.social
We’ll be at the #Databricks Data + AI Summit in SF next week (6/9–12).

If you’re around and want to chat about how incremental computing can make your #SparkSQL workloads go from hours to seconds — let’s connect.

Grab some time here: calendly.com/matt-feldera...

#DataAISummit #DataEngineering
besmiranushi.bsky.social
Ping us for questions on any of the above at [email protected].
besmiranushi.bsky.social
💡We hope this will help with advancing transparent practices in LLM evaluation and analysis. In addition, running extensive experimentation with frontier models can be expensive. Sharing end-to-end results, from code to actual experimentation logs, can make model analysis more accessible.
besmiranushi.bsky.social
🔍The logs include data provenance on data processing, raw model output, answer extraction, metric calculations, and aggregated reports. These are currently available for 10 conventional and reasoning models. For open-source reasoning models such as Phi-4 reasoning logs also include reasoning traces.
besmiranushi.bsky.social
📌You can now find all the evaluation logs from our inference-time scaling report and the Phi-4 reasoning technical report at huggingface.co/datasets/mic.... The evaluation code for the reasoning benchmarks can also be found in the main branch of Eureka ML Insights at github.com/microsoft/eu....
microsoft/Eureka-Bench-Logs · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
besmiranushi.bsky.social
Arindam Mitra, Besmira Nushi, @dimitrisp.bsky.social, Olli Saarikivi, @sytelus.bsky.social, Vaish Shrivastava, Vibhav Vineet, Yue Wu, Safoora Yousefi, Guoqing ZHENG
besmiranushi.bsky.social
Work done by an amazing group of people at @msftresearch.bsky.social : Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, @vidhishab.bsky.social, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, @suriyag.bsky.social, Mojan Javaheripi, Ph.D., Neel J., Piero Kauffmann, Yash Lara, Caio Mendes
besmiranushi.bsky.social
➡️ Phi-4 reasoning models on Hugging Face: huggingface.co/microsoft/Ph... and huggingface.co/microsoft/Ph...

➡️ Phi-4 reasoning models on Azure AI Foundry: ai.azure.com/explore/mode...

➡️ Technical report: aka.ms/phi-reasoning/techreport

➡️ Announcement blog: azure.microsoft.com/en-us/blog/o...
besmiranushi.bsky.social
🎉The Phi-4 reasoning models have landed on HF and Azure AI Foundry. The new models are competitive and often outperform much larger frontier models. It is exciting to see the reasoning capabilities extend to more domains beyond math, including algorithmic reasoning, calendar planning, and coding.
Reposted by Besmira Nushi
dimitrisp.bsky.social
Re: The Chatbot Arena Illusion

Every eval chokes under hill climbing. If we're lucky, there’s an early phase where *real* learning (both model and community) can occur. I'd argue that a benchmark’s value lies entirely in that window. So the real question is what did we learn?