joelniklaus.bsky.social
@joelniklaus.bsky.social
26 followers 59 following 210 posts
Posts Media Videos Starter Packs
Whether you're training document classification models, conducting jurisprudential trend analysis, or developing legal information retrieval systems, this dataset provides a comprehensive foundation for French legal AI research without requiring direct API integration.
Each decision includes structured metadata like jurisdiction type, decision dates, chamber information, solution types, and case themes, alongside the full pseudonymized decision text.
This dataset democratizes access to French court decisions by converting data from the official Cour de Cassation API into analysis-ready formats.
If you're exploring computational legal research or building legal AI systems, take a look at Jurisprudence on the Hugging Face Hub by Antoine Jeannot.
- Related work on MathIF found the same tension, suggesting this is systemic to how we train reasoning models

Cool work by Yongchan Kwon, Shang Zhu, Federico Bianchi, Kaitlyn Zhou, James Zou
- The task difficulty correlation is concerning. When you most need structured reasoning (hard problems), models are least likely to follow rules
- 27% after finetuning is still bad. We might need architecture-level changes or fundamentally different training
Finetuning improved GPT-OSS-20B instruction following from 11% to 27%.

Some thoughts:
- The multi-turn result is wild. Just telling the model it failed doubles compliance in some cases without any training
The best model scored 25% in instruction following during reasoning versus 79% in final responses. Harder problems worsened compliance. Simple multi-turn feedback ("you didn't follow the instruction, try again") boosted scores 17% on average.
Together AI researchers tested six reasoning models (GPT-OSS, Qwen3, DeepSeek-R1, GLM-4.5) on 300 math and science problems with verifiable instructions like word limits, multilingual reasoning, and JSON formatting.
Users need control over format, language, and length for transparency, auditability, and cost management. Current reasoning models excel at correct answers but fail to respect constraints during their thinking process.
Reasoning models excel at math but struggle with simple requests like word limits during thinking

TLDR: Models ignore user instructions while reasoning despite following them in final outputs.
Check it out on the Hugging Face hub!
Each question was tested on humans too, and while nobody got everything right on the first try, when shown the actual answers, everyone agreed they made sense, confirming that solving these problems actually takes real reasoning skill. They also release the judge model and prompt.
The team behind it included researchers and over a dozen college students who built and checked the questions, making sure they were genuinely hard for current AI models to solve.
Cool long-context eval by Artificial Analysis!

AA-LCR is a set of 100 tough questions where you need to piece together answers from several real-world documents—sometimes really big ones—so you can’t just copy and paste the answers.
They even prove that this comes at no cost to the accuracy, just improved calibration.
Instead of just providing reasoning followed by the answer, in their method Reinforcement Learning with Calibrated Rewards (RLCR) the model generates an analysis after the answer and then verbalizes its confidence.
They proposed a simple but elegant method where they optimize for both correctness and calibration at the same time.
Very cool work by researchers from Massachusetts Institute of Technology.
This is especially useful for models that aren't served by external inference providers like Together AI or Fireworks AI. And setup only takes 20 seconds!
For example, I'm using them to rapidly iterate on rephrasing prompts without having to worry about spinning up a local vLLM server. Whenever I'm not using it, it automatically scales down to zero and doesn't incur any costs.