Why Language Models Fail: Ways to Enhance AI for Effective Deployments

Researchers explore the potential for 'irreversible damage’ when AI models generate falsities, including issues around historical data and a lack of model evaluation

Ben Wodecki, Jr. Editor

October 19, 2023

5 Min Read

Getty Images

At a Glance

Researchers propose ways to improve large language model deployments, including suggesting tools to evaluate AI systems.
Also studied are multi-agent approaches – using multiple existing models to power one system, and domain-specific training.

The reliability of large language models (LLMs) is under scrutiny as a new study explores the ability of large language models like ChatGPT to produce factual, trustworthy content.

A group of U.S. and Chinese researchers, who hail from Microsoft, Yale and other universities, evaluated large language models across domains ranging from health care to finance to determine their reliability. In a research survey titled Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity, they found that problems with reasoning and the model’s misinterpreting of retrieved data are among the prime causes of factual errors.

Such errors could lead to a health care chatbot providing incorrect information to a patient. Or a finance-focused AI system could provide false reporting on stocks, leading to potentially bad investments. Such blunders could harm users and even cause reputational damage to companies using them, like when Google launched Bard, only for the chatbot to produce a factual error in one of its earliest demos.

Another issue affecting the reliability of large language models outlined in the research was using outdated information. Certain LLMs have datasets that only go up to a certain date, which forces businesses to continually update them.

Evaluate before deployment

The researchers warn that factual mistakes generated by LLMs could cause “irreversible damage.” For businesses looking to deploy such systems, the authors stress the need for careful evaluation of a model’s factuality before deployment.

They wrote that utilizing evaluation techniques like FActScore would allow businesses to measure the factual accuracy of LLM-generated content. FActScore was proposed by a group of researchers from Meta, the University of Washington and the Allen Institute for AI. It is an evaluation metric used to test the factual accuracy of large language models.

The researchers also referenced the idea of using benchmarks like TruthfulQA, C-EVAL and RealTimeQA that are capable of quantifying factuality. Such systems are largely open source and easily accessible via GitHub, meaning businesses can use free tools to check their models.

Other strategies to evaluate an LLM’s factuality included continual training and retrieval augmentation, to enhance the learning of long-tail knowledge in LLMs.

Multi-agent systems

The survey references the reliance on historical data used to train models. For a long time, the basic version of OpenAI’s ChatGPT was limited to data up until September 2021, though this was brought up to January 2022 for the basic in a recent update.

Having an AI system that provides outputs based on outdated information could prove detrimental to users. For example, having a system that is unable to provide relevant information could allow for an ineffective deployment. AI models rely on data to learn and make decisions, if the data they are trained on is largely outdated, it may not be able to accurately predict outcomes. Having outdated information powering a system could lead to historical biases in older data resurfacing in responses.

There are ways around this – like using API calls as seen in Meta’s Toolformer to improve a model's access to information. However, such systems don't produce real-time information.

The paper refers to the idea of using a multi-agent approach, where multiple AI systems are used to generate an output, instead of just one. A team from MIT and Google DeepMind recently proposed such a system, dubbing the concept a “Multiagent Society.”

The Chinese researchers were in favor of using a multi-agent approach to improve a system's factuality. They wrote that engaging multiple models collaboratively or competitively could enhance factuality “through their collective prowess and help address issues like reasoning failures or forgetting of facts."

Several multi-agent system concepts were explored by the researchers to help improve LLMs. Among them was multi-debate – where different LLM agents debate answers and iteratively refine responses to converge on a correct factual consensus, such an approach could improve mathematical and logical reasoning abilities.

There was also multi-role fact-checking – where separate LLM agents were tasked with generating statements or verifying outputs in a collaborative manner to detect potential factual inaccuracies.

Such an approach would prove model-agnostic, meaning users could use any existing LLM as one of the agents in their multi-faceted approach.

Domain-specific training

Using a more general AI model for a hyper-specific use case or sector was another issue highlighted in the research. The authors suggest that while effective at general tasks, a model like ChatGPT lacks domain-specific factual knowledge for medicine, for example.

Domain-specific public models do exist. There is Harvey, which is designed to automate tasks in the legal sector. Or Owl, built with IT tasks in mind, and BloombergGPT, which was trained on the vast trove of information from the financial data giant.

The research states that domain-specific LLMs provide factual improvements in their outputs compared to more general LLMs. They contend that models trained using knowledge-rich data tend to be more factual.

They suggest that domain-specific training and evaluation could “transform” deployments, like in the case of HuatuoGPT, a medical language model that uses data from ChatGPT and doctors, for clinical decision-making.

The survey paper discussed several promising methods for domain-specific training and evaluation of large language models. Among them was continual pretraining, where fine-tuned models are routinely fed a stream of domain-specific data to keep them up to date. And supervised finetuning, where labeled domain-specific datasets are used to refine a model’s performance on specialized tasks like question-answering legal issues.

The paper also outlined domain-specific benchmarks that businesses can use to evaluate domain-specific models, like CMB for health care or LawBench for legal use cases.

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

Why Language Models Fail: Ways to Enhance AI for Effective Deployments

At a Glance

Evaluate before deployment

Multi-agent systems

Domain-specific training

About the Author(s)

Latest News

Trending articles