Joe Stacey
@joestacey.bsky.social
2.5K followers 2.1K following 140 posts
NLP PhD student at Imperial College London and Apple AI/ML Scholar.
Posts Media Videos Starter Packs
Pinned
joestacey.bsky.social
We have a fun new #NLProc paper on arXiv about improving the robustness of fine-tuned NLI models!

Have a look :)
arxiv.org/abs/2505.20209
Reposted by Joe Stacey
lisaalaz.bsky.social
We have released #AgentCoMa, an agentic reasoning benchmark where each task requires a mix of commonsense and math to be solved 🧐

LLM agents performing real-world tasks should be able to combine these different types of reasoning, but are they fit for the job? 🤔

🧵⬇️
joestacey.bsky.social
Congratulations!! Awesome you will be in Europe!
joestacey.bsky.social
The bad:

- the chocolate here is terrible for no good reason
- hotel breakfasts never have any baked beans, which are way under appreciated here (they are delicious and add much needed moisture to a cooked breakfast)
- the temperature in summer is inhumane

Think that covers the main stuff 😍
joestacey.bsky.social
Here’s my review of the US after a few days here. Did I miss anything? 🤔

The good:

- Americans are the most charming, friendly and hospitable people
- it’s super fun how the country is split into states that all have different laws and stuff, with different vibes state to state
joestacey.bsky.social
Any chance Keir Starmer can reshuffle himself in as foreign secretary, and shuffle in another prime minister who actually has some vague idea about what they want to achieve? 🙏🤦‍♂️
joestacey.bsky.social
Finally the heatwave has ended, and the UK is once again a bearable place to be 😍😍

If you have any UK-based collaborations, their productivity is about to increase like 10 fold
joestacey.bsky.social
This work was really fun and a great last paper for my PhD. Check it out 🙂 Massive thanks to all my amazing collaborators!

arxiv.org/abs/2505.20209

P.S. if you know about a paper improving NLI model robustness not already in our related work appendix, I would love to hear about it 🥰
How to Improve the Robustness of Closed-Source Models on NLI
Closed-source Large Language Models (LLMs) have become increasingly popular, with impressive performance across a wide range of natural language tasks. These models can be fine-tuned to further improv...
arxiv.org
joestacey.bsky.social
5) The best way to improve performance on the hardest OOD data was to choose more challenging training examples

Our best method (Uncertainty Sampling) picked examples with the most uncertain predictions. This identified challenging examples, but without too much label noise
joestacey.bsky.social
4) Creating more complex synthetic data avoids a loss in performance on harder OOD datasets

We find that generating more challenging synthetic data (Long & Complex Generation) helps retain performance on harder OOD datasets, while still achieving gains on easier OOD data
joestacey.bsky.social
3) Replacing some training examples with LLM-generated data proved very effective on less challenging OOD data

See Standard-OOD scores below (avg), where the simplest LLM-generated data (Short & Simple Generation) performed best, with substantial improvements
joestacey.bsky.social
2) We experiment with 6+ ways for improving robustness:

This involved sampling methods to choose more complex examples in our training data, and generating new synthetic examples

Some methods were pretty fun, e.g. asking an LLM to assess the difficulty of training examples
joestacey.bsky.social
1) It's time to stop using fine-tuned encoder models:

We find that fine-tuned LLMs are substantially more robust than commonly used encoder models, despite being fine-tuned on x50 less data.

This is especially the case on challenging OOD datasets (see Challenge-OOD avg below)
joestacey.bsky.social
The paper tries to improve the robustness of closed-source LLMs fine-tuned on NLI, assuming a realistic training budget of 10k training examples.

Here's a 45 second rundown of what we found!
joestacey.bsky.social
We have a fun new #NLProc paper on arXiv about improving the robustness of fine-tuned NLI models!

Have a look :)
arxiv.org/abs/2505.20209
joestacey.bsky.social
I’d personally just love to see more negative results from nice ideas that didn’t quite work out. I feel like there’s probably a bunch of cool stuff people have tried out and discarded that could be made to work across multiple papers. Would be fun and interesting too
joestacey.bsky.social
Was worried it was just me hating on it so much 🤣
joestacey.bsky.social
I’d love to see more diversity in the field, what kind of things were you thinking?
joestacey.bsky.social
Should I use an LLM to help refine my paper writing for the ARR deadline? 🤔🤔

It will improve the paper for sure, but probably also making the tone a whole lot more annoying
Reposted by Joe Stacey
juand-r.bsky.social
If you're at #NAACL2025 and want to hear about similarity effects for property inheritance in LMs, please stop by!

I will be presenting this work on Wednesday at the 11-12:30 poster session on Interpretability & analysis for language models (Hall 3).

aclanthology.org/2025.naacl-l...
joestacey.bsky.social
Looks so cool! I’m insanely jealous
joestacey.bsky.social
I’m not a fan of musk, but imo there’s some really nice work here 🙂

Interested in the Washington post article, would you mind sharing a link?
Reposted by Joe Stacey
imperial-nlp.bsky.social
Excited to share our ICLR and NAACL papers! Please come and say hi, we're super friendly :)
joestacey.bsky.social
That’s an awesome paper 👍👍
joestacey.bsky.social
Wow, the old ITV Agatha Christie’s Poirot is brilliant. Some tv for 1989…

Gonna go binge watch the 13 seasons now 😍
joestacey.bsky.social
Congratulations! It’s definitely worth trying/experimenting with responses that are more concise in the future and see what kind of reaction you get.

Best of luck with your meta-reviews! 🤞