DPO even impressed AI luminary and Google Brain founder Andrew Ng

Ben Wodecki, Jr. Editor

January 17, 2024

2 Min Read
Abstract image with the letters DPO in the middle
Getty Images / AI Business

At a Glance

  • DPO is a ew AI training technique that finetunes better than reinforcement learning from human feedback.
  • DPO was developed by researchers from Stanford University and the Chan Zuckerberg Biohub Network.

AI researchers from Stanford have come up with a new technique that could simplify training large language models.

The technique, Direct Preference Optimization (DPO), is a much simpler alternative to reinforcement learning from human feedback (RLFH) for aligning a model with human preferences, according to the paper, which also was co-authored by the Chan Zuckerberg Biohub Network.

“It is only rarely that, after reading a research paper, I feel like giving the authors a standing ovation. But I felt that way after finishing Direct Preference Optimization,” tweeted Google Brain founder and Stanford professor Andrew Ng.

Traditionally, model builders would employ RHLF to create a reward model from human preference data, then use reinforcement learning to optimize a policy to maximize the learned reward.

DPO, however, directly optimizes the policy to satisfy human preferences using a simple binary cross-entropy loss. In simple terms, DPO trains the model to make the reward function consistent with the human rankings – meaning developers do not need to separate the reward function aspect and can instead train the LLM directly to optimize the same objective.

DPO could save language model builders time but also money by reducing computing costs.

Related:Stanford’s Renowned AI Experts Agree, Clash on State of AI at CES 2024

“Although it’s still too early to be sure, I am cautiously optimistic that DPO will have a huge impact on LLMs and beyond in the next few years,” Ng said.

DPO trumps RLHF

RLHF can be a complex and even unstable process. Relying on the quality and consistency of human feedback, the gathering of which is resource-intensive and could lead to insertion of potential biases in human judgment.

To combat this, researchers built an algorithm that is more stable and computationally lightweight.

DPO can fine-tune models far better than RLHF, with greater control over the sentiment of generations, according to the paper. Deploying it can lead to improved response quality in summarization and single-turn dialogue, the researchers contend.

More work still needs to be done to test the DPO’s abilities. The researchers behind it may have recorded some impressive results, but they only demonstrated on models up to six billion parameters.

DPO is already being utilized in models available today, including Mixtral from Mistral, a multilingual language model that outperforms Meta’s Llama 2 70B on most benchmarks.

Mixtral is a combination of eight models together – totaling 46.7 billion parameters, so the scale of models DPO can optimize is still an open question.

Related:AI Startup Trumps Google in Stanford's Model Rankings

“That we can replace such fundamental building blocks of LLMs is a sign that the field is still new and much innovation lies ahead,” Ng wrote in a blog called The Batch.

“While it is always nice to have massive numbers of Nvidia H100 or AMD MI300X GPUs, this work is another illustration — out of many, I want to emphasize — that deep thinking with only modest computational resources can carry you far.”

Read more about:

ChatGPT / Generative AI

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like