danielrbramblett.bsky.social
@danielrbramblett.bsky.social
This is a step toward scalable, robust LLM evaluation — free from stale benchmarks, incomplete evaluation measures, and human bottlenecks.
Paper: aair-lab.github.io/Publications...
Catch us at poster session 4 on Friday April 25th at #ICLR (poster 318).
(3/3)
aair-lab.github.io
April 22, 2025 at 12:17 PM
Results show that truth-maintenance performance predicts performance in formal language tasks such as reasoning. Testing of both LLMs and LRMs reveals that SOTA models still struggle maintaining truth and are inaccurate semantic equivalence verifiers.
(2/3)
April 22, 2025 at 12:16 PM
We are looking to broaden the scope to encode more diverse user preferences and constraints for more real-world problems.
Paper: aair-lab.github.io/Publications...
Catch us at the poster session on Wednesday evening (poster 6505)!
aair-lab.github.io
December 10, 2024 at 11:38 PM
Our theoretical analysis indicates that the resulting problems are solvable within a finite number of evaluations, leading to an algorithm for finding optimal user-aligned policy. Theoretically, its probabilistically complete; empirically, it converges faster and more reliably than previous methods.
December 10, 2024 at 11:37 PM