Leaderboard: OpenAI’s GPT-4 Has Lowest Hallucination Rate

Vectara also released its open source model that lets anyone check the hallucination rates of their large language models

Ben Wodecki, Jr. Editor

November 21, 2023

3 Min Read

AI Business via DALL-E 3

At a Glance

GPT-4 ranks tops for document summarization in a new open source model evaluation from Vectara.
GPT-3.5 Turbo ranked second, Meta Llama was the highest-scoring non-OpenAI model and Google Palm ranked last.

OpenAI’s GPT-4 has the lowest hallucination rate of large language models when summarizing documents, a new leaderboard from Vectara suggests.

The Palo Alto-based company launched a leaderboard on GitHub that evaluates some of the biggest names in large language models on its Hallucination Evaluation Model, which gauges how often an LLM introduces hallucinations when summarizing a document.

GPT-4 and GPT-4 Turbo came out on top with the highest accuracy rate (97%) and lowest hallucination rate (3%) of any of the tested models.

Another OpenAI model scored the second highest: GPT 3.5 Turbo, the newest iteration of the model that powers the base version of ChatGPT. GPT 3.5 Turbo scored an accuracy rate of 96.5% and a hallucination rate of 3.5%.

The highest-scoring non-OpenAI model was the 70 billion parameter version of Llama 2 from Meta, with an accuracy score of 94.9 % and a hallucination rate of just 5.1 %.

The worst-performing models came from Google – Google Palm 2 had an accuracy rate of 87.9 % and a hallucination rate of 12.1 %. The chat-refined version of Palm scored even lower, achieving an accuracy rate of just 72.8 % and the highest hallucination score of any model on the board with 27.2%.

Google Palm 2 Chat generated the highest average amount of words per summary, with a whopping 221. In comparison, GPT-4 generated just 81 words per summary.

How were the models evaluated?

Vectara trained a model to detect hallucinations in large language model outputs using open source datasets. The company fed 1,000 short documents to each of the models via their public APIs and asked them to summarize a short document, using only the facts presented in the document.

Of the 1000 documents, only 831 were summarized by every model, the remaining documents were rejected by at least one model due to content restrictions. Using the documents accepted by every system, Vectara then computed the overall accuracy and hallucination rate for each model.

None of the content sent to the models contained illicit or 'not safe for work' content but the presence of trigger words was enough to launch some of the content filters.

Test your own models

The risk of hallucinations has held back many businesses from adopting generative AI, Shane Connelly, head of product at Vectara, wrote in a blog post.

“Some attempts have been made in the past to quantify or at least qualify when/how much a generative model is hallucinating. However, many of these have been too abstract and based on subjects that are too controversial to be useful to most enterprises.”

The company’s Hallucination Evaluation Model is open source – meaning companies can use it to evaluate the trustworthiness of their large language models in Retrieval Augmented Generation (RAG) systems. It can be accessed via Hugging Face, with users able to tune it for their specific needs.

“Our idea is to empower enterprises with the information they need to have the confidence they need to enable generative systems through quantified analysis,” Connelly wrote.

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

Leaderboard: OpenAI’s GPT-4 Has Lowest Hallucination Rate

At a Glance

How were the models evaluated?

Test your own models

About the Author(s)

Latest News

Trending articles