Lightnews — Scholar-powered news

Besmira Nushi @besmiranushi.bsky.social · 27d

Federal research funding works. It’s not an expense–it’s an investment. It’s not overhead–it’s a down payment on the future. - Eric Horvitz, Margaret Martonosi, Moshe Y. Vardi, and James Larus in CACM cacm.acm.org/opinion/keep... @erichorvitz.bsky.social

Keeping the Dream Alive: The Power and Promise of Federally Funded Research – Communications of the ACM

cacm.acm.org

1 2

Besmira Nushi @besmiranushi.bsky.social · Sep 5

Our team in Zurich and EMEA is hiring Deep Learning Engineers for LLM Accuracy Evaluation and Analysis. Ideal candidates should have an inquisitive 🔬approach to evaluation and with best engineering practices for building reusable open source tools. www.linkedin.com/jobs/view/42...

NVIDIA hiring Deep Learning Engineer, LLM Accuracy Evaluation in Switzerland | LinkedIn

Posted 5:30:48 AM. We are seeking senior engineers to pioneer new methodologies for accurately assessing the…See this and similar jobs on LinkedIn.

www.linkedin.com

2

Reposted by Besmira Nushi

The War Monitor @warmonitor.net · Aug 24

The Diary of Anne Frank is among the hundreds of books banned in Florida this year. When I was in school, it was required reading. (Guardian)

460 2.7K 11K

Besmira Nushi @besmiranushi.bsky.social · Aug 9

…the list continues but point is that a company that hires the best talent in the field definitely knows how to chart. Problem arises when marketing drives and dominates the science, and it is not a single company problem today.

2

Besmira Nushi @besmiranushi.bsky.social · Aug 9

…coloring new model releases boldly while leaving the older models as blank/white so newer models artificially stand out even if they’re not better, not providing worst case results, not standardizing the max value across charts presented at the same level horizontally…

1

Besmira Nushi @besmiranushi.bsky.social · Aug 9

The problem with chart crimes is not just the distortion of the y axis. It is the erasure of all other competitors from charts (hence they don’t exist), lack of error bars, lack of transparency in tools and code being used for evals…

1 2

Besmira Nushi @besmiranushi.bsky.social · Aug 8

I have a single question. Why doesn’t OpenAI compare with competitors in their evals? No Gemini, no Claude, no open source models…

1 3

Reposted by Besmira Nushi

jessica dai @jessica.bsky.social · Aug 6

hey wasn't this the same company that made a beautiful shiny "research" post about how AI evals should include error bars or something like that. or did they decide the CLT didn't apply here

5 3 37

Reposted by Besmira Nushi

Stephanie Hyland @hylandsl.bsky.social · Jul 18

New work from my team! arxiv.org/abs/2507.12950
Intersecting mechanistic interpretability and health AI 😎

We trained and interpreted sparse autoencoders on MAIRA-2, our radiology MLLM. We found a range of human-interpretable radiology reporting concepts, but also many uninterpretable SAE features.

Insights into a radiology-specialised multimodal large language model with sparse autoencoders

Interpretability can improve the safety, transparency and trust of AI models, which is especially important in healthcare applications where decisions often carry significant consequences. Mechanistic...

arxiv.org

1 4 11

Reposted by Besmira Nushi

Hanna Wallach @hannawallach.bsky.social · Jul 15

If you're at @icmlconf.bsky.social this week, come check out our poster on "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" presented by the amazing @afedercooper.bsky.social from 11:30am--1:30pm PDT on Weds!!! icml.cc/virtual/2025...

ICML Poster Position: Evaluating Generative AI Systems Is a Social Science Measurement ChallengeICML 2025

icml.cc

1 10 32

Reposted by Besmira Nushi

Feldera @feldera.bsky.social · Jun 11

📢 Webinar - 6/18 at 9am PST!
Stop re-running complex recursive queries when your graph data changes. Feldera incrementally evaluates recursive graph computations. Learn to easily build these mechanisms with #SQL, without the hassle of constant recomputation.
tinyurl.com/rb5my7d8

4 7

Besmira Nushi @besmiranushi.bsky.social · Jun 11

I only got to listen to this today. A lot of people in my network including myself have felt exactly this, for years. The fear that for some obscure reason, your paperwork and you may not be enough for this country, even in “normal” times.

youtube.com/shorts/IF3bz...

let me explain what being on a student visa is actually like

YouTube video by Representative Pramila Jayapal

youtube.com

1 4

Reposted by Besmira Nushi

Feldera @feldera.bsky.social · Jun 5

We’ll be at the #Databricks Data + AI Summit in SF next week (6/9–12).

If you’re around and want to chat about how incremental computing can make your #SparkSQL workloads go from hours to seconds — let’s connect.

Grab some time here: calendly.com/matt-feldera...

#DataAISummit #DataEngineering

2 4

Reposted by Besmira Nushi

Melanie Mitchell @melaniemitchell.bsky.social · May 30

Tired: "BS"

Wired: "Vibe citing"

www.nytimes.com/2025/05/29/w...

White House Health Report Included Fake Citations

www.nytimes.com

7 46

Besmira Nushi @besmiranushi.bsky.social · May 27

Ping us for questions on any of the above at [email protected].

Besmira Nushi @besmiranushi.bsky.social · May 27

💡We hope this will help with advancing transparent practices in LLM evaluation and analysis. In addition, running extensive experimentation with frontier models can be expensive. Sharing end-to-end results, from code to actual experimentation logs, can make model analysis more accessible.

1

Besmira Nushi @besmiranushi.bsky.social · May 27

🔍The logs include data provenance on data processing, raw model output, answer extraction, metric calculations, and aggregated reports. These are currently available for 10 conventional and reasoning models. For open-source reasoning models such as Phi-4 reasoning logs also include reasoning traces.

1

Besmira Nushi @besmiranushi.bsky.social · May 27

📌You can now find all the evaluation logs from our inference-time scaling report and the Phi-4 reasoning technical report at huggingface.co/datasets/mic.... The evaluation code for the reasoning benchmarks can also be found in the main branch of Eureka ML Insights at github.com/microsoft/eu....

microsoft/Eureka-Bench-Logs · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1

Besmira Nushi @besmiranushi.bsky.social · May 4

www.scientificamerican.com/article/unde... at this point one just needs to cross their fingers and hope for more sanity.

Under Trump, National Science Foundation Cuts Off All Funding to Scientists

National Science Foundation staff were told to freeze outgoing funding days after NSF leadership introduced a new policy that requires that grants be screened for “alignment with agency priorities”

www.scientificamerican.com

Besmira Nushi @besmiranushi.bsky.social · May 1

Arindam Mitra, Besmira Nushi, @dimitrisp.bsky.social, Olli Saarikivi, @sytelus.bsky.social, Vaish Shrivastava, Vibhav Vineet, Yue Wu, Safoora Yousefi, Guoqing ZHENG

1

Besmira Nushi @besmiranushi.bsky.social · May 1

Work done by an amazing group of people at @msftresearch.bsky.social : Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, @vidhishab.bsky.social, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, @suriyag.bsky.social, Mojan Javaheripi, Ph.D., Neel J., Piero Kauffmann, Yash Lara, Caio Mendes

1 1

Besmira Nushi @besmiranushi.bsky.social · May 1

➡️ Phi-4 reasoning models on Hugging Face: huggingface.co/microsoft/Ph... and huggingface.co/microsoft/Ph...

➡️ Phi-4 reasoning models on Azure AI Foundry: ai.azure.com/explore/mode...

➡️ Technical report: aka.ms/phi-reasoning/techreport

➡️ Announcement blog: azure.microsoft.com/en-us/blog/o...

1 1 1

Besmira Nushi @besmiranushi.bsky.social · May 1

🎉The Phi-4 reasoning models have landed on HF and Azure AI Foundry. The new models are competitive and often outperform much larger frontier models. It is exciting to see the reasoning capabilities extend to more domains beyond math, including algorithmic reasoning, calendar planning, and coding.

1 8 21

Reposted by Besmira Nushi

Dimitris Papailiopoulos @dimitrisp.bsky.social · Apr 30

Re: The Chatbot Arena Illusion

Every eval chokes under hill climbing. If we're lucky, there’s an early phase where *real* learning (both model and community) can occur. I'd argue that a benchmark’s value lies entirely in that window. So the real question is what did we learn?

1 1 9

Besmira Nushi @besmiranushi.bsky.social · Apr 29

All Eureka inference-time scaling insights are now available here: www.microsoft.com/en-us/resear... It was fun sharing these and more together with Vidhisha Balachandran @vidhishab.bsky.social and Vibhav Vineet at #ICLR2025.

Eureka Inference-Time Scaling Insights: Where We Stand and What Lies Ahead - Microsoft Research

Understanding and measuring the potential of inference-time scaling for reasoning. The new Eureka study tests nine state-of-the-art models on eight diverse reasoning tasks.

www.microsoft.com

2 3