Raphaël Avalos
raphael.avalos.fr
Raphaël Avalos
@raphael.avalos.fr
Fine-tuning LLMs @Cohere | PhD Candidate on RL @VUB
I’m not sure about 1, but you could look into results on belief MDPs.
For 2, consider an environment with two rooms where the agent needs to press different buttons to get the optimal reward. If there's a cheap way to determine which room the agent is in, that would be the optimal policy :)
November 25, 2024 at 11:00 PM
Bsky’s strength lies in being open-source and federated. This enables anyone to host servers, set moderation policies, create custom feeds, while avoiding incentives for allowing bots to survive. It’s a tough challenge, but there’s hope!
November 25, 2024 at 10:47 PM
IMO, UCB favors exploration, not information-seeking, as it adds an exploration bonus rather than aiming to reduce state uncertainty. However, effective exploration can uncover policies where gathering information leads to better outcomes.
Hope that helps!
November 25, 2024 at 9:25 PM
I am down !
November 25, 2024 at 6:45 AM
Would love to be added!
November 22, 2024 at 11:16 AM