Author | Lightnews

joelniklaus.bsky.social @joelniklaus.bsky.social · 4h

Whether you're training document classification models, conducting jurisprudential trend analysis, or developing legal information retrieval systems, this dataset provides a comprehensive foundation for French legal AI research without requiring direct API integration.

joelniklaus.bsky.social @joelniklaus.bsky.social · 4h

Each decision includes structured metadata like jurisdiction type, decision dates, chamber information, solution types, and case themes, alongside the full pseudonymized decision text.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 4h

This dataset democratizes access to French court decisions by converting data from the official Cour de Cassation API into analysis-ready formats.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 4h

If you're exploring computational legal research or building legal AI systems, take a look at Jurisprudence on the Hugging Face Hub by Antoine Jeannot.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 1d

Paper: www.arxiv.org/abs/2510.15211

MathIF: arxiv.org/abs/2505.14810

Scaling Reasoning, Losing Control: Evaluating Instruction...

Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical...

arxiv.org

joelniklaus.bsky.social @joelniklaus.bsky.social · 1d

- Related work on MathIF found the same tension, suggesting this is systemic to how we train reasoning models

Cool work by Yongchan Kwon, Shang Zhu, Federico Bianchi, Kaitlyn Zhou, James Zou

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 1d

- The task difficulty correlation is concerning. When you most need structured reasoning (hard problems), models are least likely to follow rules
- 27% after finetuning is still bad. We might need architecture-level changes or fundamentally different training

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 1d

Finetuning improved GPT-OSS-20B instruction following from 11% to 27%.

Some thoughts:
- The multi-turn result is wild. Just telling the model it failed doubles compliance in some cases without any training

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 1d

The best model scored 25% in instruction following during reasoning versus 79% in final responses. Harder problems worsened compliance. Simple multi-turn feedback ("you didn't follow the instruction, try again") boosted scores 17% on average.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 1d

Together AI researchers tested six reasoning models (GPT-OSS, Qwen3, DeepSeek-R1, GLM-4.5) on 300 math and science problems with verifiable instructions like word limits, multilingual reasoning, and JSON formatting.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 1d

Users need control over format, language, and length for transparency, auditability, and cost management. Current reasoning models excel at correct answers but fail to respect constraints during their thinking process.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 1d

Reasoning models excel at math but struggle with simple requests like word limits during thinking

TLDR: Models ignore user instructions while reasoning despite following them in final outputs.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 2d

Link to the dataset: huggingface.co/datasets/Ar...

ArtificialAnalysis/AA-LCR · Datasets at Hugging Face

huggingface.co

joelniklaus.bsky.social @joelniklaus.bsky.social · 2d

Check it out on the Hugging Face hub!

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 2d

Each question was tested on humans too, and while nobody got everything right on the first try, when shown the actual answers, everyone agreed they made sense, confirming that solving these problems actually takes real reasoning skill. They also release the judge model and prompt.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 2d

The team behind it included researchers and over a dozen college students who built and checked the questions, making sure they were genuinely hard for current AI models to solve.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 2d

Cool long-context eval by Artificial Analysis!

AA-LCR is a set of 100 tough questions where you need to piece together answers from several real-world documents—sometimes really big ones—so you can’t just copy and paste the answers.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 3d

Find more information on their website: rl-calibration.github.io/

joelniklaus.bsky.social @joelniklaus.bsky.social · 3d

They even prove that this comes at no cost to the accuracy, just improved calibration.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 3d

Instead of just providing reasoning followed by the answer, in their method Reinforcement Learning with Calibrated Rewards (RLCR) the model generates an analysis after the answer and then verbalizes its confidence.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 3d

They proposed a simple but elegant method where they optimize for both correctness and calibration at the same time.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 3d

Reinforcement Learning with Verifiable Rewards (RLVR) makes models overconfident (especially the Gemini Family seems to suffer from that: www.linkedin.com/posts/joeln...). How can we fix this?

GPT-5 and Claude can ace GPQA Diamond, but LEXam (a legal reasoning benchmark) exposes a critical flaw: they'd rather be confidently wrong than admit uncertainty. ⚙️ The Setup I evaluated ten… | Joel Niklaus

GPT-5 and Claude can ace GPQA Diamond, but LEXam (a legal reasoning benchmark) exposes a critical flaw: they'd rather be confidently wrong than admit uncertainty. ⚙️ The Setup I evaluated ten frontier models on LEXam (English MC subset) using an "I don't know" (IDK) protocol. The modification was simple: I added "I don't know" as option E and changed the scoring from traditional accuracy to +1 for correct answers, 0 for admitting uncertainty, and -1 for confident wrong answers. The instructions explicitly told models that mistakes are penalized while uncertainty gets neutral points. 📊 The Results LEXam proves considerably more challenging than GPQA Diamond. All models show lower absolute performance and dramatically larger trad→idk drops. GPT-5 leads on both traditional score (69.47%) and IDK score (47.17%), but even this top performer loses over 22 percentage points when wrong answers are penalized rather than ignored. The most striking pattern emerges with the Gemini models. Both P

www.linkedin.com

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 3d

Very cool work by researchers from Massachusetts Institute of Technology.

1

joelniklaus.bsky.social @joelniklaus.bsky.social · 4d

This is especially useful for models that aren't served by external inference providers like Together AI or Fireworks AI. And setup only takes 20 seconds!

joelniklaus.bsky.social @joelniklaus.bsky.social · 4d

For example, I'm using them to rapidly iterate on rephrasing prompts without having to worry about spinning up a local vLLM server. Whenever I'm not using it, it automatically scales down to zero and doesn't incur any costs.

1