Salesforce Launches AI Benchmark to Evaluate CRM Deployments

Salesforce's new benchmark evaluates AI models on accuracy, cost, speed and trust for sales and service use cases

Ben Wodecki, Jr. Editor

July 3, 2024

3 Min Read
A customer service operative at a computer with a phone headset
Getty images

Salesforce has introduced a language model benchmark tailored for businesses to evaluate their AI models on customer relationship management (CRM) tasks.

Benchmarks are tools designed to evaluate the performance of a language model. They contain tools and tests that provide model owners with an assessment of their model's outputs on specific tasks, such as the MMLU benchmark for evaluating general knowledge.

Salesforce’s new test focuses on CRM, enabling model owners to evaluate their AI system's performance on sales and service use cases across four key metrics: accuracy, cost, speed and trust and safety.

The new benchmark test was developed by Salesforce’s AI research team. They argued that prior model benchmarks lack business relevance, failing to evaluate metrics enterprises would care about, such as running costs and trust considerations.

Salesforce said the new CRM test lets businesses make more strategic decisions on which AI systems to deploy for CRM use cases.

“Business organizations are looking to utilize AI to drive growth, cut costs and deliver personalized customer experiences, not to plan a kid’s birthday party or summarize Othello,” said Clara Shih, Salesforce AI’s CEO. “This benchmark is not just a measure; it’s a comprehensive, dynamically evolving framework that empowers companies to make informed decisions, balancing accuracy, cost, speed and trust.”

Related:Exclusive: Salesforce EVP on Deploying Einstein and CRM AI

Model owners can compare their results from Salesforce’s benchmark in a public leaderboard.

Upon launch, Salesforce’s benchmark ranks OpenAI’s GPT-4 Turbo as the most accurate model for CRM, while Anthropic’s Claude 3 Haiku ranks among the cheapest to deploy.

Mixtral 8x7B from French AI startup Mistral was ranked as the fastest model. The top speediest systems were all small language models, with the highest-ranking large-scale system for speed being GPT-3.5 Turbo, some five places below the Mistral model.

The model with the highest trust and safety score was Google’s Gemini Pro 1.5, which achieved a score of 91%. Meta’s two new Llama 3 models, 8B and 70B followed Gemini Pro with scores of 90%.

OpenAI’s GPT-4 Turbo and the new GPT-4o could only manage safety scores of 89% and 85%, respectively.

The least trustworthy model was an OpenAI model: GPT-3.5 Turbo, which achieved a safety score of just 60%, with poor results on tests to evaluate its privacy and truthfulness.

Salesforce plans to add new CRM use case scenarios into the benchmark and add support for models that have been fine-tuned.

“As AI continues to evolve, enterprise leaders are saying it’s important to find the right mix of performance, accuracy, responsibility and cost to unlock the full potential of generative AI to drive business growth,” said Silvio Savarese, Salesforce AI Research’s executive vice president and chief scientist. 

Related:AI News Roundup: Salesforce Expands its Einstein AI Offering

“Salesforce’s new LLM Benchmark for CRM is a significant step forward in the way businesses assess their AI strategy within the industry. It not only provides clarity on next-generation AI deployment but also can accelerate time to value for CRM-specific use cases.”

About the Author

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like