Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!
February 12, 2024
There is a delectable new open source model for English and French workloads - and it is snackable enough in size to run on mobile devices.
The goal is to make French on par with English in AI models. “With CroissantLLM, we aim to train a model in which English is not the dominant language and go for a 1:1 ratio of English and French data!” he wrote.
The model is just 1.3 billion parameters in size but was trained on three trillion tokens – more tokens than the Llama 2 models − and included a dataset comprised of high-quality French content including legal documents, business data, cultural content and scientific information. It uses the Llama model architecture.
For example, you can prompt the model to explain French terms; Croissant’s deep linguistic knowledge brings out the nuances of the language − et voilà!
Credit: "CroissantLLM: A Truly Bilingual French-English Language Model" https://arxiv.org/pdf/2402.00786.pdf
The model and the underlying datasets were created by researchers from mainly French universities and businesses, including CentraleSupélec from the Université Paris Saclay, lluin Technology in Neuilly-Sur-Seine, France, Sorbonne Universitè in Paris, and others.
Faysse said a big challenge was to get enough high-quality French content for the training dataset. The team collected, filtered and cleaned data from varied sources and modalities, whether they are webpages, transcriptions, movie titles and others.
They collected more than 303 billion tokens of monolingual French data and 36 billion tokens of French-English high-quality translation data. “We craft our final 3 trillion token dataset such that we obtain equal amounts of French and English data after upsampling,” Faysse said.
He said the team purposely made CroissantLLM small after noticing that one of the biggest hurdles to widespread adoption of AI models is the difficulty in getting them to run on consumer-grade hardware.
Notably, the most downloaded models on Hugging Face were not the best performers, like Llama 2-70B or Mixtral 8x7B, but smaller models like Llama 2-7B or Mistral 7B, which are “easier and cheaper to serve and finetune,” he said.
CroissantLLM small size lets it run “extremely quickly on lower end GPU servers, enabling for high throughput and low latency” as well as CPUs and mobile devices at “decent speeds,” Faysse wrote.
The trade-off, he said, is that it is not as good in generalist capabilities in reasoning, math and coding compared to larger models. But the team behind CroissantLLM believe it will be “perfect” for specific industrial applications, translations and chat functionality where larger models are not necessarily needed.
The researchers also introduced a new French benchmark to assess non-English language models: FrenchBench. FrenchBench Gen assesses tasks like title generation, summarization, question generation, and question answering − relying on the high-quality French Question Answering dataset, FQuaD. The Multiple Choice section of FrenchBench tests reasoning, factual knowledge, and linguistic capabilities of models.
When tested, CroissantLLM came out among the best-performing for its size − in French − and even was competitive with Mistal 7-B.
Read more about:ChatGPT / Generative AI
Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.
You May Also Like