Deep Reinforcement Learning from Human Preferences

7 minute read

Authors: Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei

Main Contributions

Its main contributions can be divided into the following parts:

Define how to express human perferences between trajectory segments, and
Fit the reward function to human preferences

Human Preferences

The definition of trajectory segment \( \sigma \): a sequence of observations and actions

\[\sigma = \big((o_0, a_0), (o_1, a_1), \dots, (o{k-1}, a_{k-1})\big) \in \big(\mathcal{O} \ \times \ \mathcal{A} \big)^k\]

The human preference can be represented by

\[\sigma^1 \succ \sigma^2\]

which means that humans prefer trajectory segment \( \sigma^1 \) rather than trajectory segment \( \sigma^2 \)

This paper also provides

Quantitative method for preferences generated by a reward function \( r \)
\[\big((o_0^1, a_0^1), \dots, (o_{k-1}^1, a_{k-1}^1)\big) \succ \big((o_0^2, a_0^2), \dots, (o_{k-1}^2, a_{k-1}^2)\big)\]
whenever
\[r(o_0^1, a_0^1) + \dots + r(o_{k-1}^1, a_{k-1}^1) > r(o_0^2, a_0^2) + \dots + r(o_{k-1}^2, a_{k-1}^2)\]
Qualitative method for there is no reward function
such that the extent how well the agent satisfies to human preferences

This is also a kind of partial order relation such that there exist

\[\sigma^1 = \sigma^2 \quad \parallel \quad \sigma^1 \succ \sigma^2 \quad \parallel \quad \sigma^2 \succ \sigma^1 \quad \parallel \quad \sigma^1, \sigma^2 \ \text{not comparable}\]

The human preferences are recorded in the database \( \mathcal{D} \) with a three-element set

\[\big( \ \sigma^1, \sigma^2, \mu \ \big) \quad \text{where} \quad \mu \in \{\ 1, 2 \ \}\]

\( \sigma^1 = \sigma^2: \ \mu \ \text{is uniform over the set} \ ( \ \mathbb{P} \, [\mu=1] = \mathbb{P} \, [\mu=2] \ ) \)
\( \sigma^1 \succ \sigma^2: \ \mu = 1 \)
\( \sigma^2 \succ \sigma^1: \ \mu = 2 \)
\( \sigma^1, \sigma^2 \ \text{not comparable}: \ \text{not recorded in the database} \ \mathcal{D} \)

Reward Learning

The neural networks have three asynchronous processes:

The policy \( \pi \) interacts with the environment to generate a set of trajectories
\[\{ \ \tau^1, \tau^2, \dots, \tau^i \ \}\]
Note: The parameters of \( \pi \) are updated by a traditional RL algorithm (e.g., A2C and TRPO)
to maximise the sum of the predicted rewards
\[r_t = \hat{r}(o_t, a_t)\]
Select pairs of segments \( ( \sigma^1, \sigma^2) \) from trajectories \( { \ \tau^1, \tau^2, \dots, \tau^i \ } \)
and then get preferences \( \mu \) with the help of human beings
The parameters of the mapping \( \hat{r} \) are optimized via Supervised Learning
to fit the comparisons collected from the human

Flows

Trajectories flow: \( 1 \rightarrow 2 \)
Human comparisons flow: \( 2 \rightarrow 3 \)
Parameters for \( \hat{r} \) flow: \( 3 \rightarrow 1 \)

Reward Function Fitting

Assume that the probability of human preferring a segment depends exponentially on
假设人类偏好某个片段的概率指数取决于
the value of the latent reward summed over the length of the video clip
视频片段长度上的潜在奖励值之和

then a reward function estimate \( \hat{r} \) as a preference-predictor 偏好预测器

\[\hat{\mathbb{P}} \, [\sigma^1 \succ \sigma^2] = \frac{exp \sum \hat{r}(o_t^1, a_t^1)}{exp \sum \hat{r}(o_t^1, a_t^1) + exp \sum \hat{r}(o_t^2, a_t^2)}\]

and then choosing \( \hat{r} \) to minimize the Cross Entropy Loss
between these predictions and the actual human labels

\[loss(\hat{r}) = - \sum_{(\sigma^1, \sigma^2, \mu) \ \in \ \mathcal{D}} \mu(1) \ \cdot \ log \ \hat{\mathbb{P}} \, [\sigma^1 \succ \sigma^2] \ + \ \mu(2) \ \cdot \ log \ \hat{\mathbb{P}} \, [\sigma^2 \succ \sigma^1]\]

Query Selection

Decide how to query preferences based on an approximation to the uncertainty in the reward function estimator

Sample a large number of pairs of trajectory segments of length \( k \)
Use each reward predictor in the ensemble of predictors to
predict which segment will be preferred from each pair
Select those trajectories for which the predictions
have the highest variance across ensemble members

Ideally, query should be based on the expected value of information of the query

Background: Cross Entropy Loss

Informational Value

The information value of one event depends on the degree to which this event is surprising:

If a highly likely event occurs, the event carries very little information
If a highly unlikely event occurs, the event is much more informative

Therefore, given a discrete random variable \( X \) and sample space \( \mathcal{X} = { \ x_1, x_2, \dots, x_n \ } \)

the information value of event \( X = x_i \) can be defined as

\[I(X = x_i) := -log \, \mathbb{P}[X = x_i]\]

Entropy

Entropy is the expectation of the information value:

\[H(X) := \mathbb{E}\big[ -log \, \mathbb{P}[X = x_i] \ \big] = - \sum_0^n \Big( \mathbb{P}[X = x_i] \ \cdot \ log \, \mathbb{P}[X = x_i] \Big)\]

Note:

Distributions that are close to deterministic have low entropy
Distributions that are close to uniform have high entropy

Relative Entropy (Kullback-Leibler Divergence)

KL divergence is a type of statistical distance to measure
how one probability distribution \( P \) is different from a second probability distribution \( Q \)
where \( P \) and \( Q \) should be defined on the same sample space \( \mathcal{X} \)

This usually happens when using distribution \( Q \) as a model while the actual distribution is \( P \)

\[D_{KL}(P \ \parallel \ Q) \\ := - \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, Q[X = x_i] \Big) - (- \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, P[X = x_i] \Big)\] \[= \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, \frac{P[X = x_i]}{Q[X = x_i]} \Big)\]

Note:

\( D_{KL}(P \ \parallel \ Q) \neq D_{KL}(Q \ \parallel \ P) \)
\( D_{KL}(P \ \parallel \ Q) \geq 0 \)

Cross Entropy

The relative entropy can be represented as another form

\[D_{KL}(P \ \parallel \ Q) := -H(P) - \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, Q[X = x_i] \Big)\]

such that the cross entropy can be represented as

\[H(P,Q) = H(P) + D_{KL}(P \ \parallel \ Q) = - \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, Q[X = x_i] \Big)\]

Compared with relative entropy, cross entropy removes the fixed value \( H(P) \) (reduce computation)

Endong Liu