Mistral AI’s New Language Model Aims for Open Source Supremacy

The French AI startup said its new Mixtral 8X7B, with open weights, outperforms Llama 2 and GPT-3.5 on most benchmarks

Sascha Brodsky, Contributor

December 19, 2023

4 Min Read
Mistral AI logo from X

At a Glance

  • French startup Mistral AI has unveiled Mixtral 8X7B, a language model that it says outperforms Llama 2 and GPT-3.5.
  • Mixtral 8X7B uses a Sparse Mixture of Experts technique, akin to getting committee assent rather than a sole decision-maker.
  • The model uses only a "fraction" of total parameters per token to control costs, the startup claims.

French AI startup Mistral AI has unveiled its latest language model, Mixtral 8x7B, which it claims sets new standards for open source performance.

Released with open-weights, Mixtral 8x7B outperforms the 70 billion-parameter model of Llama 2 on most benchmarks with six times faster inference, and also outpaces OpenAI’s GPT-3.5 on most metrics, according to the startup.

Mixtral 8x7B has a context length of 32k tokens (roughly 24,000 words) and is multilingual, supporting English, Spanish, French, Italian, and German. It also has code generation capabilities and is adept at answering queries in a deeply coherent way, scoring 8.3 on the MT-Bench, comparable to GPT-3.5.

“Mixtral is an open-weight model, which is exciting compared to ‘black box’ models like the ChatGPT family of models,” Jignesh Patel, a computer science professor at Carnegie Mellon University and co-founder of DataChat, a no-code, generative AI platform, said in an interview.

“One can use an open-weight model in a broader range of applications, including ones in which packaging the model with a bigger system in a single environment is essential for privacy considerations, including protecting leaking data to the model when using it and not disclosing the access pattern of usage.”

Related:AI Startup Roundup: OpenAI Rival Mistral AI Set to Raise $485 Million

Mixtral 8x7B was trained on data from the open internet. It is distributed under the Apache 2.0 license, meaning it can be used commercially for free. Developers also are allowed to alter, copy or update the source code and distribute it along with a copy of the license.

Access the model in beta.

‘Sparse Mixture of Experts’

Mixtral 8x7B model employs a unique architectural approach that has been a topic of discussion for many decades, yet it is only now being implemented on a large scale in large language models, Patel said. Its internal architecture comprises a limited number of experts, each specialized in certain tasks.

Called Mixture of Experts (MoE), this blend of expert techniques produces smooth, human-like responses. This method contrasts with the conventional LLM approach, which typically relies on a single, comprehensive expert. The analogy is akin to decision-making by a well-informed and diversely skilled committee instead of relying on a sole decision-maker in an organization.

“In Mixtral, this mixture-of-experts model allows for the selective use of a small subset of these experts (often just two out of eight) for individual decisions,” Patel added. “This approach has several technical benefits: It is more cost-effective in terms of computational resources to develop and deploy these models. Considering the currently high expenses associated with building and operating GenAI models, this cost reduction is crucial for the broad adoption of this technology.”

In more technical terms, Mixtral 8x7B employs a sparse MoE network and is a decoder-only model where the “feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a route network chooses two of these groups (experts) to process the token and combine their output additively," according to the startup.

This increases the number of parameters of a model while still controlling cost and latency, the startup said, since it only uses a “fraction” of the total set of parameters per token. For instance, Mixtral 8x7B has 46.7 billion parameters but only uses 12.9 billion per token.

Biases and accuracy

The French startup said Mixtral 8x7B is more truthful (73.9% vs. 50.2% on the TruthQA benchmark) and less biased than Llama 2 (BBQ/BOLD metrics). However, it asks developers to add system prompts to ban toxic outputs. Without these guardrails, the model will simply follow instructions.

While Mixtral 8x7B does well compared with GPT-3.5, OpenAI’s GPT-4 continues to lead in most performance categories, noted Bob Brauer, the CEO of Interzoid, a data usability consultancy, in an interview. Both GPT models are close-sourced.

However, a significant advantage of Mixtral 8x7B’s approach is that it increases the model's capacity without proportionally increasing computational requirements, ultimately obtaining high performance in terms of speed of response. “This is crucial for organizations running open-source models on their own infrastructure, as it offers a more resource-efficient way to handle large-scale AI tasks,” Brauer said.

Blending models

Mixtral 8x7B is a  blend of business models, Brauer said. It adopts an open-source-like approach in the sense that the neural network weights that constitute its model are accessible, allowing the model to be downloaded to one's own hardware for experimentation and use, similar to Meta's LLaMa models, which also offer free use.

However, Mixtral also provides pay-as-you-go API access, catering to those who want to quickly and easily access its capabilities without the requirement of managing the infrastructure to support it, similar to OpenAI's ChatGPT and Anthropic's Claude models.

“The Mistral AI strategy here clearly aims to be a hybrid, ‘best of both worlds’ approach,” he added.

Open-source software has been a boon to the computing community for decades, Patel said. For instance, Linux, the operating system that powers most cloud-based machines, is also the backbone for training large language models (LLMs).

“Without that foundational building block being open-sourced, a lot of progress in the overall field of computer science would likely have been significantly slower,” Patel said. “With open source, there is a lot more competition, and the speed of innovation is higher. Also, the barrier to entry for someone to get into the field is much lower if there is high-quality open-source software available. That type of democratization in creation and learning for new entrants to the field is also critical.

Read more about:

ChatGPT / Generative AI

About the Author(s)

Sascha Brodsky

Contributor

Sascha Brodsky is a freelance technology writer based in New York City. His work has been published in The Atlantic, The Guardian, The Los Angeles Times, Reuters, and many other outlets. He graduated from Columbia University's Graduate School of Journalism and its School of International and Public Affairs. 

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like