Google DeepMind researchers discovered that using prompts similar to human interaction greatly improved math skills in large language models

Sascha Brodsky, Contributor

September 28, 2023

4 Min Read
Abstract image of a man's profile
Getty Images

At a Glance

  • Google DeepMind researchers discovered that human-like prompts make large language models like ChatGPT better at math.
  • Encouraging the language model to do better also yielded better results.

Just as people like hearing kind words, AI might benefit from advice that sounds human.

Researchers at Google DeepMind developed a method that significantly boosted math abilities in language models by using prompts similar to human interaction, according to their recently published paper, "Large Language Models as Optimizers."

The DeepMind scientists proposed a method called Optimization by PROmpting (OPRO) to enhance the performance of large language models like OpenAI’s ChatGPT. The approach uses everyday human speech to help guide these models in solving problems.

Usually, in machine learning, specific methods using step-by-step procedures help improve how well an AI model works. Instead of using formal math definitions to do this task, OPRO uses simple language to start the improvement process. The large language model creates possible solutions based on the problem’s description and past answers.

“LLMs are trained on human-generated content, and the way it works, roughly speaking, is to finish your sentences the way a good couple would,” Tinglong Dai, a professor of Operations Management and Business Analytics at Johns Hopkins University, who was not involved in the research, said in an interview. “So it's not surprising that human-like prompts lead to good results.”

Phrasing can influence AI output

The DeepMind study also found that certain phrases influenced the AI's output. Prompts like "let's think step by step" led the AI models to yield more accurate results when evaluated against math problem datasets.

The researchers discovered that the prompt "Take a deep breath and work on this problem step by step" was most effective with Google's PaLM 2 language model. This phrase reached the highest accuracy score of 80.2% when tested against GSM8K, a dataset of grade-school math word problems. In comparison, PaLM 2, without any special prompting, achieved only a 34% accuracy on GSM8K, while the classic prompt "Let’s think step by step" reached a 71.8% accuracy score.

LLMs respond well to human-like prompts because they are trained on human language conversational data such as Reddit threads and movie scripts, said Michael Kearns, a professor of Computer and Information Science at the University of Pennsylvania, who was not part of the DeepMind team, in an interview.

“In this sense, LLMs are good at modifying their output in response to requests and encouragement, such as asking for output in a particular style or genre,” he added. “In terms of math skills, it is generally reported that encouraging an LLM to break a math or logic problem into steps is very effective, as is training on data that includes mathematical proofs, computer programs, and other examples of formal reasoning.”

Stay updated. Subscribe to the AI Business newsletter.

Most LLMs have been trained and fine-tuned on a massive volume of data, so they have excellent natural language capabilities like paraphrasing or enriching a sentence, Chengrun Yang, one of the authors of the DeepMind paper, said in an interview.

“Further, people have been working on model alignment, which improves models' capability to understand and respond to human-like prompts just like a human, since, anyway, we define whether a model responds 'well' from a human's perspective,” he added.

Human-like prompts are often shaped as requests driving the AI model towards engaging in a dialog-style interaction, where the model is tasked with providing an accurate response based on familiar cues, said Olga Beregovaya, vice president of AI and Machine Translation at software translation company Smartling.

“LLMs perform best when given more context,” she added. “More verbose human-like prompts tend to give more context, descriptions, examples, making it easier for the model to perform the task, aligning its output with the context of the prompt.”

Encouraging Words

Sometimes, simple words of encouragement can push AI to do better. Dai said that LLMs can produce superior results when users respond to their output with "Come on, you can do better than that!” He noted that there are cases where users ask LLMs to pretend to be a Nobel Prize winner in Economics and comment on inflation and see better results than otherwise.

“In the case of medical diagnosis, asking LLMs to pretend to be a world-leading medical expert can sometimes produce more accurate and targeted results,” he added. “But I'm not aware of any solid evidence that such human-style encouragement leads to universal improvements across different types of tasks.”

Dai said it's important to note that LLMs can respond well to non-human prompts, depending on the task. “I've seen LLMs respond very effectively to prompts structured like computer code, e.g., If-Then-Else statements,” he added.

The new method could make engineering AI prompts easier, Yang said.
“The users can optimize prompts with their own metric: problem-solving accuracy in math reasoning, trigger rate in tool use, text vividness and length in creative writing, etc.,” he added. “Further, we hope our method can inspire new ways of using LLMs to optimize other types of tasks.”

Read more about:

ChatGPT / Generative AI

About the Author(s)

Sascha Brodsky

Contributor

Sascha Brodsky is a freelance technology writer based in New York City. His work has been published in The Atlantic, The Guardian, The Los Angeles Times, Reuters, and many other outlets. He graduated from Columbia University's Graduate School of Journalism and its School of International and Public Affairs. 

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like