Deep Reinforcement Learning from Human Preferences
Authors: Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei
Main Contributions
Its main contributions can be divided into the following parts:
-
Define how to express human perferences between trajectory segments, and
-
Fit the reward function to human preferences
Human Preferences
The definition of trajectory segment \( \sigma \): a sequence of observations and actions
\[\sigma = \big((o_0, a_0), (o_1, a_1), \dots, (o{k-1}, a_{k-1})\big) \in \big(\mathcal{O} \ \times \ \mathcal{A} \big)^k\]The human preference can be represented by
\[\sigma^1 \succ \sigma^2\]which means that humans prefer trajectory segment \( \sigma^1 \) rather than trajectory segment \( \sigma^2 \)
This paper also provides
-
Quantitative method for preferences generated by a reward function \( r \)
\[\big((o_0^1, a_0^1), \dots, (o_{k-1}^1, a_{k-1}^1)\big) \succ \big((o_0^2, a_0^2), \dots, (o_{k-1}^2, a_{k-1}^2)\big)\]whenever
\[r(o_0^1, a_0^1) + \dots + r(o_{k-1}^1, a_{k-1}^1) > r(o_0^2, a_0^2) + \dots + r(o_{k-1}^2, a_{k-1}^2)\] -
Qualitative method for there is no reward function
such that the extent how well the agent satisfies to human preferences
This is also a kind of partial order relation such that there exist
\[\sigma^1 = \sigma^2 \quad \parallel \quad \sigma^1 \succ \sigma^2 \quad \parallel \quad \sigma^2 \succ \sigma^1 \quad \parallel \quad \sigma^1, \sigma^2 \ \text{not comparable}\]The human preferences are recorded in the database \( \mathcal{D} \) with a three-element set
\[\big( \ \sigma^1, \sigma^2, \mu \ \big) \quad \text{where} \quad \mu \in \{\ 1, 2 \ \}\]-
\( \sigma^1 = \sigma^2: \ \mu \ \text{is uniform over the set} \ ( \ \mathbb{P} \, [\mu=1] = \mathbb{P} \, [\mu=2] \ ) \)
-
\( \sigma^1 \succ \sigma^2: \ \mu = 1 \)
-
\( \sigma^2 \succ \sigma^1: \ \mu = 2 \)
-
\( \sigma^1, \sigma^2 \ \text{not comparable}: \ \text{not recorded in the database} \ \mathcal{D} \)
Reward Learning
The neural networks have three asynchronous processes:
-
The policy \( \pi \) interacts with the environment to generate a set of trajectories
\[\{ \ \tau^1, \tau^2, \dots, \tau^i \ \}\]Note: The parameters of \( \pi \) are updated by a traditional RL algorithm (e.g., A2C and TRPO)
\[r_t = \hat{r}(o_t, a_t)\]
to maximise the sum of the predicted rewards -
Select pairs of segments \( ( \sigma^1, \sigma^2) \) from trajectories \( { \ \tau^1, \tau^2, \dots, \tau^i \ } \)
and then get preferences \( \mu \) with the help of human beings -
The parameters of the mapping \( \hat{r} \) are optimized via Supervised Learning
to fit the comparisons collected from the human
Flows
-
Trajectories flow: \( 1 \rightarrow 2 \)
-
Human comparisons flow: \( 2 \rightarrow 3 \)
-
Parameters for \( \hat{r} \) flow: \( 3 \rightarrow 1 \)
Reward Function Fitting
Assume that the probability of human preferring a segment depends exponentially on
假设人类偏好某个片段的概率指数取决于
the value of the latent reward summed over the length of the video clip
视频片段长度上的潜在奖励值之和
then a reward function estimate \( \hat{r} \) as a preference-predictor 偏好预测器
\[\hat{\mathbb{P}} \, [\sigma^1 \succ \sigma^2] = \frac{exp \sum \hat{r}(o_t^1, a_t^1)}{exp \sum \hat{r}(o_t^1, a_t^1) + exp \sum \hat{r}(o_t^2, a_t^2)}\]and then choosing \( \hat{r} \) to minimize the Cross Entropy Loss
between these predictions and the actual human
labels
Query Selection
Decide how to query preferences based on an approximation to the uncertainty in the reward function estimator
-
Sample a large number of pairs of trajectory segments of length \( k \)
-
Use each reward predictor in the ensemble of predictors to
predict which segment will be preferred from each pair -
Select those trajectories for which the predictions
have the highest variance across ensemble members
Ideally, query should be based on the expected value of information of the query
Background: Cross Entropy Loss
Informational Value
The information value of one event depends on the degree to which this event is surprising:
-
If a highly likely event occurs, the event carries very little information
-
If a highly unlikely event occurs, the event is much more informative
Therefore, given a discrete random variable \( X \) and sample space \( \mathcal{X} = { \ x_1, x_2, \dots, x_n \ } \)
the information value of event \( X = x_i \) can be defined as
\[I(X = x_i) := -log \, \mathbb{P}[X = x_i]\]Entropy
Entropy is the expectation of the information value:
\[H(X) := \mathbb{E}\big[ -log \, \mathbb{P}[X = x_i] \ \big] = - \sum_0^n \Big( \mathbb{P}[X = x_i] \ \cdot \ log \, \mathbb{P}[X = x_i] \Big)\]Note:
-
Distributions that are close to deterministic have low entropy
-
Distributions that are close to uniform have high entropy
Relative Entropy (Kullback-Leibler Divergence)
KL divergence is a type of statistical distance to measure
how one probability distribution \( P \) is different from a second probability distribution \( Q \)
where \( P \) and \( Q \) should be defined on the same sample space \( \mathcal{X} \)
This usually happens when using distribution \( Q \) as a model while the actual distribution is \( P \)
\[D_{KL}(P \ \parallel \ Q) \\ := - \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, Q[X = x_i] \Big) - (- \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, P[X = x_i] \Big)\] \[= \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, \frac{P[X = x_i]}{Q[X = x_i]} \Big)\]Note:
-
\( D_{KL}(P \ \parallel \ Q) \neq D_{KL}(Q \ \parallel \ P) \)
-
\( D_{KL}(P \ \parallel \ Q) \geq 0 \)
Cross Entropy
The relative entropy can be represented as another form
\[D_{KL}(P \ \parallel \ Q) := -H(P) - \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, Q[X = x_i] \Big)\]such that the cross entropy can be represented as
\[H(P,Q) = H(P) + D_{KL}(P \ \parallel \ Q) = - \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, Q[X = x_i] \Big)\]Compared with relative entropy, cross entropy removes the fixed value \( H(P) \) (reduce computation)