Lightnews — Scholar-powered news

yanmingwan.bsky.social @yanmingwan.bsky.social · Jul 8

Overall, we propose CURIO for enhancing personalization in LLMs for multi-turn dialogs, which encourages LLMs to actively learn user traits and adapt its responses accordingly.
This work was done with my awesome collaborators:
Jiaxing Wu, Marwa Abdulhai, Lior Shani, and @natashajaques.bsky.social

2

yanmingwan.bsky.social @yanmingwan.bsky.social · Jul 8

Baselines and entropy-based rewards lead to "controlling behavior", where the model gets high rewards by convincing the user to adopt a particular preference that is easier to cater to, rather than adhering to the ground-truth. "Grounded" rewards stop this reward hacking. (8/9)

1 3

yanmingwan.bsky.social @yanmingwan.bsky.social · Jul 8

With a proper reward choice, CURIO models achieve personalization without compromising coherence and overall quality. The baseline is trained to optimize conversation quality using exactly the same prompt for eval. DiffLogAcc outperforms the baseline and all other models. (7/9)

1 2

yanmingwan.bsky.social @yanmingwan.bsky.social · Jul 8

The second task has a more complicated reward, where personalization is relevant to the model performance in conducting a dialog but not the ultimate goal. CURIO models with accuracy-based intrinsic rewards remain effective and significantly improve personalization ability. (6/9)

1 2

yanmingwan.bsky.social @yanmingwan.bsky.social · Jul 8

On the first task, the agent needs to elicit user information in multi-turn dialogs before making a choice at the end. CURIO can effectively enhance personalization and reduce the generalization gap by "learning how to learn" rather than memorizing superficial user details. (5/9)

1 2

yanmingwan.bsky.social @yanmingwan.bsky.social · Jul 8

We discuss the following intrinsic rewards. Intuitively, we encourage gain in prediction accuracy or reduction in entropy. For those which are Potential-based Reward Shaping, the optimality is not affected, and we hypothesize that they potentially accelerate the training. (4/9)

1 2

yanmingwan.bsky.social @yanmingwan.bsky.social · Jul 8

Intrinsic Motivation is well-studied in RL but applying it to LLMs is non-trivial. The policy and environment models engage in a multi-turn dialog, and a reward model gives an extrinsic reward. On each turn, a user model predicts the belief and computes an intrinsic reward. (3/9)

1 2

yanmingwan.bsky.social @yanmingwan.bsky.social · Jul 8

We leverage a user model to incorporate a curiosity reward into standard multi-turn RLHF. Rather than training an LLM only with the end-of-conversation sparse reward, we add a turn-based reward that is given by its improvement in belief over the user type after each action. (2/9)

1 2

yanmingwan.bsky.social @yanmingwan.bsky.social · Jul 8

Personalization methods for LLMs often rely on extensive user history. We introduce Curiosity-driven User-modeling Reward as Intrinsic Objective (CURIO) to encourage actively learning about the user within multi-turn dialogs.
📜 arxiv.org/abs/2504.03206
🌎 sites.google.com/cs.washingto...

2 1 3