Salesforce Launches AI Benchmark to Evaluate CRM Deployments
Salesforce's new benchmark evaluates AI models on accuracy, cost, speed and trust for sales and service use cases
Salesforce has introduced a language model benchmark tailored for businesses to evaluate their AI models on customer relationship management (CRM) tasks.
Benchmarks are tools designed to evaluate the performance of a language model. They contain tools and tests that provide model owners with an assessment of their model's outputs on specific tasks, such as the MMLU benchmark for evaluating general knowledge.
Salesforce’s new test focuses on CRM, enabling model owners to evaluate their AI system's performance on sales and service use cases across four key metrics: accuracy, cost, speed and trust and safety.
The new benchmark test was developed by Salesforce’s AI research team. They argued that prior model benchmarks lack business relevance, failing to evaluate metrics enterprises would care about, such as running costs and trust considerations.
Salesforce said the new CRM test lets businesses make more strategic decisions on which AI systems to deploy for CRM use cases.
“Business organizations are looking to utilize AI to drive growth, cut costs and deliver personalized customer experiences, not to plan a kid’s birthday party or summarize Othello,” said Clara Shih, Salesforce AI’s CEO. “This benchmark is not just a measure; it’s a comprehensive, dynamically evolving framework that empowers companies to make informed decisions, balancing accuracy, cost, speed and trust.”
Model owners can compare their results from Salesforce’s benchmark in a public leaderboard.
Upon launch, Salesforce’s benchmark ranks OpenAI’s GPT-4 Turbo as the most accurate model for CRM, while Anthropic’s Claude 3 Haiku ranks among the cheapest to deploy.
Credit: Salesforce
Mixtral 8x7B from French AI startup Mistral was ranked as the fastest model. The top speediest systems were all small language models, with the highest-ranking large-scale system for speed being GPT-3.5 Turbo, some five places below the Mistral model.
The model with the highest trust and safety score was Google’s Gemini Pro 1.5, which achieved a score of 91%. Meta’s two new Llama 3 models, 8B and 70B followed Gemini Pro with scores of 90%.
OpenAI’s GPT-4 Turbo and the new GPT-4o could only manage safety scores of 89% and 85%, respectively.
The least trustworthy model was an OpenAI model: GPT-3.5 Turbo, which achieved a safety score of just 60%, with poor results on tests to evaluate its privacy and truthfulness.
Credit: Salesforce
Salesforce plans to add new CRM use case scenarios into the benchmark and add support for models that have been fine-tuned.
“As AI continues to evolve, enterprise leaders are saying it’s important to find the right mix of performance, accuracy, responsibility and cost to unlock the full potential of generative AI to drive business growth,” said Silvio Savarese, Salesforce AI Research’s executive vice president and chief scientist.
“Salesforce’s new LLM Benchmark for CRM is a significant step forward in the way businesses assess their AI strategy within the industry. It not only provides clarity on next-generation AI deployment but also can accelerate time to value for CRM-specific use cases.”
About the Author
You May Also Like