Leaderboard: OpenAI’s GPT-4 Has Lowest Hallucination Rate

Vectara also released its open source model that lets anyone check the hallucination rates of their large language models

Ben Wodecki, Jr. Editor

November 21, 2023

3 Min Read
Abstract image
AI Business via DALL-E 3

At a Glance

  • GPT-4 ranks tops for document summarization in a new open source model evaluation from Vectara.
  • GPT-3.5 Turbo ranked second, Meta Llama was the highest-scoring non-OpenAI model and Google Palm ranked last.

OpenAI’s GPT-4 has the lowest hallucination rate of large language models when summarizing documents, a new leaderboard from Vectara suggests.

The Palo Alto-based company launched a leaderboard on GitHub that evaluates some of the biggest names in large language models on its Hallucination Evaluation Model, which gauges how often an LLM introduces hallucinations when summarizing a document.

GPT-4 and GPT-4 Turbo came out on top with the highest accuracy rate (97%) and lowest hallucination rate (3%) of any of the tested models.

Another OpenAI model scored the second highest: GPT 3.5 Turbo, the newest iteration of the model that powers the base version of ChatGPT. GPT 3.5 Turbo scored an accuracy rate of 96.5% and a hallucination rate of 3.5%.

The highest-scoring non-OpenAI model was the 70 billion parameter version of Llama 2 from Meta, with an accuracy score of 94.9 % and a hallucination rate of just 5.1 %.

The worst-performing models came from Google – Google Palm 2 had an accuracy rate of 87.9 % and a hallucination rate of 12.1 %. The chat-refined version of Palm scored even lower, achieving an accuracy rate of just 72.8 % and the highest hallucination score of any model on the board with 27.2%.

Google Palm 2 Chat generated the highest average amount of words per summary, with a whopping 221. In comparison, GPT-4 generated just 81 words per summary.

Related:Red Pajama 2: The Public Dataset With a Whopping 30 Trillion Tokens

How were the models evaluated?

Vectara trained a model to detect hallucinations in large language model outputs using open source datasets. The company fed 1,000 short documents to each of the models via their public APIs and asked them to summarize a short document, using only the facts presented in the document.

Of the 1000 documents, only 831 were summarized by every model, the remaining documents were rejected by at least one model due to content restrictions. Using the documents accepted by every system, Vectara then computed the overall accuracy and hallucination rate for each model.

None of the content sent to the models contained illicit or 'not safe for work' content but the presence of trigger words was enough to launch some of the content filters.

Test your own models

The risk of hallucinations has held back many businesses from adopting generative AI, Shane Connelly, head of product at Vectara, wrote in a blog post.

“Some attempts have been made in the past to quantify or at least qualify when/how much a generative model is hallucinating. However, many of these have been too abstract and based on subjects that are too controversial to be useful to most enterprises.”

Related:How Elon Musk's AI Startup Built the Grok Chatbot in 4 Months

The company’s Hallucination Evaluation Model is open source – meaning companies can use it to evaluate the trustworthiness of their large language models in Retrieval Augmented Generation (RAG) systems. It can be accessed via Hugging Face, with users able to tune it for their specific needs.

“Our idea is to empower enterprises with the information they need to have the confidence they need to enable generative systems through quantified analysis,” Connelly wrote.

Read more about:

ChatGPT / Generative AI

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like