Whether you're training document classification models, conducting jurisprudential trend analysis, or developing legal information retrieval systems, this dataset provides a comprehensive foundation for French legal AI research without requiring direct API integration.
Each decision includes structured metadata like jurisdiction type, decision dates, chamber information, solution types, and case themes, alongside the full pseudonymized decision text.
If you're exploring computational legal research or building legal AI systems, take a look at Jurisprudence on the Hugging Face Hub by Antoine Jeannot.
- The task difficulty correlation is concerning. When you most need structured reasoning (hard problems), models are least likely to follow rules - 27% after finetuning is still bad. We might need architecture-level changes or fundamentally different training
The best model scored 25% in instruction following during reasoning versus 79% in final responses. Harder problems worsened compliance. Simple multi-turn feedback ("you didn't follow the instruction, try again") boosted scores 17% on average.
Together AI researchers tested six reasoning models (GPT-OSS, Qwen3, DeepSeek-R1, GLM-4.5) on 300 math and science problems with verifiable instructions like word limits, multilingual reasoning, and JSON formatting.
Users need control over format, language, and length for transparency, auditability, and cost management. Current reasoning models excel at correct answers but fail to respect constraints during their thinking process.
Each question was tested on humans too, and while nobody got everything right on the first try, when shown the actual answers, everyone agreed they made sense, confirming that solving these problems actually takes real reasoning skill. They also release the judge model and prompt.
The team behind it included researchers and over a dozen college students who built and checked the questions, making sure they were genuinely hard for current AI models to solve.
AA-LCR is a set of 100 tough questions where you need to piece together answers from several real-world documents—sometimes really big ones—so you can’t just copy and paste the answers.
Instead of just providing reasoning followed by the answer, in their method Reinforcement Learning with Calibrated Rewards (RLCR) the model generates an analysis after the answer and then verbalizes its confidence.
Reinforcement Learning with Verifiable Rewards (RLVR) makes models overconfident (especially the Gemini Family seems to suffer from that: www.linkedin.com/posts/joeln...). How can we fix this?
This is especially useful for models that aren't served by external inference providers like Together AI or Fireworks AI. And setup only takes 20 seconds!
For example, I'm using them to rapidly iterate on rephrasing prompts without having to worry about spinning up a local vLLM server. Whenever I'm not using it, it automatically scales down to zero and doesn't incur any costs.