

In particular, our 1.3 billion parameter (1.3B) model trained with human feedback outperforms our 12B model trained only with supervised learning. We found that RL fine-tuning with human feedback had a very large effect on quality compared to both supervised fine-tuning and scaling up model size. To evaluate each model, we had it summarize posts from the validation set and asked humans to compare their summaries to the human-written TL DR. We evaluated several different summarization models-some pre-trained on a broad distribution of text from the internet, some fine-tuned via supervised learning to predict TL DRs, and some fine-tuned using human feedback. Human feedback has also been used to train models in several other domains, such as dialogue, semantic parsing, translation, story and review generation, evidence extraction, and more traditional RL tasks. We push the technique further by scaling to larger models, collecting more feedback data, closely monitoring researcher-labeler agreement, and providing frequent feedback to labelers. There has also been other work on using human feedback to train summarization models. Our approach follows directly from our previous work on learning from human feedback. We find that this significantly improves the quality of the summaries, as evaluated by humans, even on datasets very different from the one used for fine-tuning. We then fine-tune a language model with reinforcement learning (RL) to produce summaries that score highly according to that reward model. We first train a reward model via supervised learning to predict which summaries humans will prefer.

We apply our method primarily to an existing dataset of posts submitted to the social network Reddit together with human-written “TL DRs,” which are short summaries written by the original poster. We focused on English text summarization, as it's a challenging problem where the notion of what makes a “good summary” is difficult to capture without human input. In the short term, we wanted to test if human feedback techniques could help our models improve performance on useful tasks.

As our models become more powerful, we believe aligning them with our goals will be very important to ensure they are beneficial for humans. For example, a model trained to predict what a human would say might make up facts when it is unsure, or generate sentences reflecting harmful social bias, both failure modes that have been well-documented.Īs part of our work on safety, we want to develop techniques that align our models’ objectives with the end behavior we really care about. This mismatch is clear when a model is trained to imitate low-quality human-written text, but it can also happen in more subtle ways. But this objective doesn’t capture exactly what we want usually, we don’t want our models to imitate humans, we want them to give high-quality answers. These models are usually trained with the objective of next word prediction on a dataset of human-written text. Large-scale language models are becoming increasingly capable on NLP tasks.
