Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

Published in EWRL 2025, 2025

Recommended citation: Barna Pasztor, Thomas Kleine Buening, Andreas Krause. Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game. EWRL 2025. https://openreview.net/pdf?id=If5eE5hCB5

Abstract: We propose Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader’s action. This formulation departs from prior approaches such as Reinforcement Learning from Human Feedback (RLHF), which rely on real-valued reward models, and Nash Learning from Human Feedback (NLHF), which seek to compute a Nash equilibrium. The sequential structure of SLHF naturally enables test-time improvement, as the Follower learns to best respond to the Leader’s action. We compare the solution concepts of SLHF, RLHF and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Our experiments demonstrate that SLHF effectively aligns large language models with diverse, potentially intransitive, human preferences, and its test-time improvement generalizes across models without further training.

Share on

Twitter Facebook LinkedIn

Barna Pásztor

Share on