7 minute read

Authors: Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei


Main Contributions

Its main contributions can be divided into the following parts:

  • Define how to express human perferences between trajectory segments, and

  • Fit the reward function to human preferences


Human Preferences

The definition of trajectory segment \( \sigma \): a sequence of observations and actions

\[\sigma = \big((o_0, a_0), (o_1, a_1), \dots, (o{k-1}, a_{k-1})\big) \in \big(\mathcal{O} \ \times \ \mathcal{A} \big)^k\]

The human preference can be represented by

\[\sigma^1 \succ \sigma^2\]

which means that humans prefer trajectory segment \( \sigma^1 \) rather than trajectory segment \( \sigma^2 \)

This paper also provides

  • Quantitative method for preferences generated by a reward function \( r \)

    \[\big((o_0^1, a_0^1), \dots, (o_{k-1}^1, a_{k-1}^1)\big) \succ \big((o_0^2, a_0^2), \dots, (o_{k-1}^2, a_{k-1}^2)\big)\]

    whenever

    \[r(o_0^1, a_0^1) + \dots + r(o_{k-1}^1, a_{k-1}^1) > r(o_0^2, a_0^2) + \dots + r(o_{k-1}^2, a_{k-1}^2)\]
  • Qualitative method for there is no reward function
    such that the extent how well the agent satisfies to human preferences

This is also a kind of partial order relation such that there exist

\[\sigma^1 = \sigma^2 \quad \parallel \quad \sigma^1 \succ \sigma^2 \quad \parallel \quad \sigma^2 \succ \sigma^1 \quad \parallel \quad \sigma^1, \sigma^2 \ \text{not comparable}\]

The human preferences are recorded in the database \( \mathcal{D} \) with a three-element set

\[\big( \ \sigma^1, \sigma^2, \mu \ \big) \quad \text{where} \quad \mu \in \{\ 1, 2 \ \}\]
  • \( \sigma^1 = \sigma^2: \ \mu \ \text{is uniform over the set} \ ( \ \mathbb{P} \, [\mu=1] = \mathbb{P} \, [\mu=2] \ ) \)

  • \( \sigma^1 \succ \sigma^2: \ \mu = 1 \)

  • \( \sigma^2 \succ \sigma^1: \ \mu = 2 \)

  • \( \sigma^1, \sigma^2 \ \text{not comparable}: \ \text{not recorded in the database} \ \mathcal{D} \)


Reward Learning

The neural networks have three asynchronous processes:

  1. The policy \( \pi \) interacts with the environment to generate a set of trajectories

    \[\{ \ \tau^1, \tau^2, \dots, \tau^i \ \}\]

    Note: The parameters of \( \pi \) are updated by a traditional RL algorithm (e.g., A2C and TRPO)
    to maximise the sum of the predicted rewards

    \[r_t = \hat{r}(o_t, a_t)\]
  2. Select pairs of segments \( ( \sigma^1, \sigma^2) \) from trajectories \( { \ \tau^1, \tau^2, \dots, \tau^i \ } \)
    and then get preferences \( \mu \) with the help of human beings

  3. The parameters of the mapping \( \hat{r} \) are optimized via Supervised Learning
    to fit the comparisons collected from the human

Flows

  • Trajectories flow: \( 1 \rightarrow 2 \)

  • Human comparisons flow: \( 2 \rightarrow 3 \)

  • Parameters for \( \hat{r} \) flow: \( 3 \rightarrow 1 \)

Reward Function Fitting

Assume that the probability of human preferring a segment depends exponentially on
假设人类偏好某个片段的概率指数取决于
the value of the latent reward summed over the length of the video clip
视频片段长度上的潜在奖励值之和

then a reward function estimate \( \hat{r} \) as a preference-predictor 偏好预测器

\[\hat{\mathbb{P}} \, [\sigma^1 \succ \sigma^2] = \frac{exp \sum \hat{r}(o_t^1, a_t^1)}{exp \sum \hat{r}(o_t^1, a_t^1) + exp \sum \hat{r}(o_t^2, a_t^2)}\]

and then choosing \( \hat{r} \) to minimize the Cross Entropy Loss
between these predictions and the actual human labels

\[loss(\hat{r}) = - \sum_{(\sigma^1, \sigma^2, \mu) \ \in \ \mathcal{D}} \mu(1) \ \cdot \ log \ \hat{\mathbb{P}} \, [\sigma^1 \succ \sigma^2] \ + \ \mu(2) \ \cdot \ log \ \hat{\mathbb{P}} \, [\sigma^2 \succ \sigma^1]\]

Query Selection

Decide how to query preferences based on an approximation to the uncertainty in the reward function estimator

  • Sample a large number of pairs of trajectory segments of length \( k \)

  • Use each reward predictor in the ensemble of predictors to
    predict which segment will be preferred from each pair

  • Select those trajectories for which the predictions
    have the highest variance across ensemble members

Ideally, query should be based on the expected value of information of the query


Background: Cross Entropy Loss

Informational Value

The information value of one event depends on the degree to which this event is surprising:

  • If a highly likely event occurs, the event carries very little information

  • If a highly unlikely event occurs, the event is much more informative

Therefore, given a discrete random variable \( X \) and sample space \( \mathcal{X} = { \ x_1, x_2, \dots, x_n \ } \)

the information value of event \( X = x_i \) can be defined as

\[I(X = x_i) := -log \, \mathbb{P}[X = x_i]\]

Entropy

Entropy is the expectation of the information value:

\[H(X) := \mathbb{E}\big[ -log \, \mathbb{P}[X = x_i] \ \big] = - \sum_0^n \Big( \mathbb{P}[X = x_i] \ \cdot \ log \, \mathbb{P}[X = x_i] \Big)\]

Note:

  • Distributions that are close to deterministic have low entropy

  • Distributions that are close to uniform have high entropy

Relative Entropy (Kullback-Leibler Divergence)

KL divergence is a type of statistical distance to measure
how one probability distribution \( P \) is different from a second probability distribution \( Q \)
where \( P \) and \( Q \) should be defined on the same sample space \( \mathcal{X} \)

This usually happens when using distribution \( Q \) as a model while the actual distribution is \( P \)

\[D_{KL}(P \ \parallel \ Q) \\ := - \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, Q[X = x_i] \Big) - (- \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, P[X = x_i] \Big)\] \[= \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, \frac{P[X = x_i]}{Q[X = x_i]} \Big)\]

Note:

  • \( D_{KL}(P \ \parallel \ Q) \neq D_{KL}(Q \ \parallel \ P) \)

  • \( D_{KL}(P \ \parallel \ Q) \geq 0 \)

Cross Entropy

The relative entropy can be represented as another form

\[D_{KL}(P \ \parallel \ Q) := -H(P) - \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, Q[X = x_i] \Big)\]

such that the cross entropy can be represented as

\[H(P,Q) = H(P) + D_{KL}(P \ \parallel \ Q) = - \sum_0^n \Big( P[X = x_i] \ \cdot \ log \, Q[X = x_i] \Big)\]

Compared with relative entropy, cross entropy removes the fixed value \( H(P) \) (reduce computation)