Hugging Face Launches New Code Generation Models

Nvidia was brought in to help train these compact yet powerful new coding models

Ben Wodecki, Jr. Editor

March 5, 2024

2 Min Read
starcoder logo image

At a Glance

  • StarCoder2 is here – and it now comes in three sizes, with the smallest as powerful as the original model.

Hugging Face has unveiled the latest version of its code generation model StarCoder – enlisting the help of Nvidia to bring it to life.

The original StarCoder, built in tandem with ServiceNow, launched last May. This new version, StarCoder2, can generate code across over 600 programming languages.

StarCoder2 comes in three sizes but is designed to be small – the largest version stands at 15 billion parameters – so developers can run it more efficiently on their PCs.

The new versions of StarCoder are more powerful too, with the smallest of the bunch matching the performance of the original StarCoder 15 billion parameter model. StarCoder2-15B is the best in its size class and matches models double its size. Read the technical paper.

Enter Nvidia

A new addition to the StarCoder project was Nvidia. The AI chipmaking giant’s infrastructure was used to train the 15 billion parameter version. ServiceNow trained the 3B model while Hugging Face took responsibility for the 7B version.

Nvidia also utilized its NeMo framework used in the largest StarCoder2 model’s development. NeMo allows users to build custom generative AI models and services.

Jonathan Cohen, vice president of applied research at Nvidia, said its involvement on the StarCoder project “introduces secure, responsibly developed models and supports broader access to accountable generative AI that we believe will benefit the global community.”

Related:Hugging Face, ServiceNow Launch Open-Source Coding LLM

New underlying dataset

The three- and seven-billion parameter models were trained on three trillion tokens, while the 15 billion model was trained on over four trillion tokens.

StarCoder2 was built using The Stack v2, a sizable new dataset to power code generation models.

The Stack v2 is larger than The Stack v1, standing at 67.5 terabytes, compared to just 6.4TB.

The Stack v2 is derived from the Software Heritage archive, a public archive of software source code. The new dataset boasts improved language and license detection procedures and better-filtering heuristics. The data is also better filtering heuristics, which Hugging Face said allows for the training of models with repository context.

To access the dataset, head to Hugging Face. To download it in bulk, users need to get permission from Software Heritage and Inria.

Since The Stack v2 is made up of multiple source codes, there are various licenses to contend with, so it may not be clear as to whether the whole dataset can power commercial applications. Hugging Face has compiled a list of the relevant licenses to ensure compliance.

Related:AI Code Generation Models: The Big List

Read more about:

ChatGPT / Generative AI

About the Author

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like