Dario
@dariogargas.bsky.social
660 followers 1.2K following 180 posts
Senior AI researcher at BSC. Random thinker at home.
Posts Media Videos Starter Packs
dariogargas.bsky.social
Are you into Chip Design, EDA or just like to do RTL code for fun? Check out the largest benchmarking of LLMs for Verilog generation: TuRTLe 🐢

It includes 40 open LLMs evaluated on 4 benchmarks, following 5 tasks. And its only growing!

huggingface.co/spaces/HPAI-...

arxiv.org/abs/2504.01986
dariogargas.bsky.social
So many healthcare LLMs, and yet so little information! Check out this table summarizing contributions, and find more details in our latest pre-print: arxiv.org/abs/2505.04388
dariogargas.bsky.social
Biology needs to be reduced to fundamental components so that mysticism and religion do not corrupt its might.
Same happens to LLMs. These are not thinking machines, in any definition of thinking we may agree on. And the NTP reduction clearly shows that.
dariogargas.bsky.social
Mostly agree here, though I would rather use the word mimic than persuade. Persuade entails a purpose, which I'm not sure LLMs have. That, is, does a mathematical function have a purpose?
dariogargas.bsky.social
Exactly! The most effective control measure, RAG, is still a patch that can provide no technical guarantee. Just a strong bias that models may not follow.

The sooner we understand the limits of LLMs, the sooner we'll learn to deploy them properly.
dariogargas.bsky.social
The Aloe Beta preprint includes full details on data & training setup.
Plus four different evaluation methods (including medical expert).
Plus a risk assessment of healthcare LLMs.

Two years of work condensed in a few pages, figures and tables.

Love open research!
huggingface.co/papers/2505....
dariogargas.bsky.social
We just opened two MLOps Engineer positions at @bsc-cns.bsky.social

Our active and young research team needs someone to help sustain and improve our services, including HPC clusters, automated pipelines, artifact managements and much more!

Are you up for the challenge?
www.bsc.es/join-us/job-...
350_25_CS_AIR_RE2
Reference: 350_25_CS_AIR_RE2 Job title: Research Engineer - AI Factory (RE2) About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-
www.bsc.es
dariogargas.bsky.social
Last week our team presented this at NAACL. Check out the beautiful poster they put together 😍
dariogargas.bsky.social
Though both data sources have the same origin (visual inspection of embryo change) I'd expect features found by humans and features found by a neural net to be complementary.

I guess the intrinsic variance is what dominates here. We can only know so much about an embryo by just looking at it.
dariogargas.bsky.social
Working on a project for evaluating embryo quality using in-vitro fertilization data.

A random forest using morphokinetic features of embryo evolution visually annotated by experts, and a CNN directly using static images get similar performance. Separately AND together.

I find it surprising...
dariogargas.bsky.social
There are quite a lot of researchers who a so preoccupied with whether or not they could get the funding, they don't stop to think if they should.

Being chased by dinosaurs and writing grants. Same thing.
dariogargas.bsky.social
The recipe is simple 🧑‍🍳 :
1. A good open model 🍞
2. A properly tuned RAG pipeline 🍒

And you will be cooking a five star AI system ⭐ ⭐ ⭐ ⭐ ⭐

See you on the AIAI 2025 conference, where we will be presenting this work, done at @bsc-cns.bsky.social and @hpai.bsky.social
dariogargas.bsky.social
How expensive 🫰 is it to get the best LLM performance? How much cash needs to burn 💸 to get reliable responses? Pareto optimal plots answer these questions.

Our research shows it is economically feasible and scalable to achieve O1 level performance at a fraction of the cost.
buff.ly/ji1VHiV
dariogargas.bsky.social
Our LLM safety project, Egida, reached 2K downloads 😀
It includes +60K safety questions expanded with jailbreaking prompts.
The four models trained (and released) show strong signs of safety alignment and generalization capacity. Check out the 🤗 HF page and the paper for details!
buff.ly/kxFVyl2
dariogargas.bsky.social
Today we release the TuRTLe leaderboard! 🐢

Are you in the Chip Design or EDA business? Wanna know which LLMs are best for the task? By integrating 4 benchmarks, TuRTLe evaluates:

* Syntax
* Functionality
* Synthesizability
* Power, Performance and Area metrics

huggingface.co/spaces/HPAI-...
TuRTLe Leaderboard - a Hugging Face Space by HPAI-BSC
A Unified Evaluation of LLMs for RTL Generation.
huggingface.co
dariogargas.bsky.social
Disclaimer: Only text questions were used to evaluate LLMs, unlike students. Student's score computed under the assumption that all questions were answered, which may not be the case.
buff.ly/3Xa9gFc
dariogargas.bsky.social
MIR is Spain's medical entrance exam. Best students reach an estimate accuracy of +90. Two or three every year.
We took MIR, '20-'24 to test open LLMs. Llama 3.1 based models, like Aloe, reach +80 in accuracy.
Deepseep R1 reaches +88. Boosted by a RAG system, 92.
buff.ly/4bbbXMw
buff.ly/4hLrhBV
HPAI-BSC/CareQA · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
buff.ly
dariogargas.bsky.social
After listening to the latest @fallofcivspod.bsky.social episode about the Mongolian Empire, by @paulcooper34.bsky.social , I realized Mongols and the Fremen from Dune share remarkable similarities.
Skilled warriors adapted to harsh environments, taking over a society they don't want to adopt.
dariogargas.bsky.social
I like the cell one 👍

I'm an empiricist, so we attack metrics by developing adversarial benchmarks that expose model shortcuts. Plus, its a lot of fun to show how fragile LLMs can be.
dariogargas.bsky.social
While writing a paper I consistently learn general insights that are too general or not tested enough to be sold as paper contributions, but are great for conversation :)
dariogargas.bsky.social
Human evaluation of LLMs is close to saturation. Models have been optimized so much for plausibility, that we are unable to tell good from bad. Only experts in expert domains can see a meaningful difference.
dariogargas.bsky.social
After a year working on LLM evaluation, our benchmarking paper is finally out (to be presented at NAACL 2025). Main lessons:
* All LLM evals are wrong, some are slightly useful.
* Goodhart's law. All the time. Everywhere.
* Do lots of different evals and hope for the best.
Automatic Evaluation of Healthcare LLMs Beyond Question-Answering
Current Large Language Models (LLMs) benchmarks are often based on open-ended or close-ended QA evaluations, avoiding the requirement of human labor. Close-ended measurements evaluate the factuality…
buff.ly
dariogargas.bsky.social
Evaluating LLMs is a bit like paleontology. Trying to understand the behavior of very complex entities by observing only noisy and partial evidence. How do paleontologists deal with the uncertainty and frustration? Do they also feel like doing alchemy instead of science?
dariogargas.bsky.social
Remarkable effort. Questionable motivation.
dariogargas.bsky.social
Wisdom from my 6y old daughter: "A king is a just person disguised as king."