Hugging Face Launches New Code Generation Models

Nvidia was brought in to help train these compact yet powerful new coding models

March 5, 2024

2 Min Read

At a Glance

StarCoder2 is here – and it now comes in three sizes, with the smallest as powerful as the original model.

Hugging Face has unveiled the latest version of its code generation model StarCoder – enlisting the help of Nvidia to bring it to life.

The original StarCoder, built in tandem with ServiceNow, launched last May. This new version, StarCoder2, can generate code across over 600 programming languages.

StarCoder2 comes in three sizes but is designed to be small – the largest version stands at 15 billion parameters – so developers can run it more efficiently on their PCs.

The new versions of StarCoder are more powerful too, with the smallest of the bunch matching the performance of the original StarCoder 15 billion parameter model. StarCoder2-15B is the best in its size class and matches models double its size. Read the technical paper.

Enter Nvidia

A new addition to the StarCoder project was Nvidia. The AI chipmaking giant’s infrastructure was used to train the 15 billion parameter version. ServiceNow trained the 3B model while Hugging Face took responsibility for the 7B version.

Nvidia also utilized its NeMo framework used in the largest StarCoder2 model’s development. NeMo allows users to build custom generative AI models and services.

Jonathan Cohen, vice president of applied research at Nvidia, said its involvement on the StarCoder project “introduces secure, responsibly developed models and supports broader access to accountable generative AI that we believe will benefit the global community.”

New underlying dataset

The three- and seven-billion parameter models were trained on three trillion tokens, while the 15 billion model was trained on over four trillion tokens.

StarCoder2 was built using The Stack v2, a sizable new dataset to power code generation models.

The Stack v2 is larger than The Stack v1, standing at 67.5 terabytes, compared to just 6.4TB.

The Stack v2 is derived from the Software Heritage archive, a public archive of software source code. The new dataset boasts improved language and license detection procedures and better-filtering heuristics. The data is also better filtering heuristics, which Hugging Face said allows for the training of models with repository context.

To access the dataset, head to Hugging Face. To download it in bulk, users need to get permission from Software Heritage and Inria.

Since The Stack v2 is made up of multiple source codes, there are various licenses to contend with, so it may not be clear as to whether the whole dataset can power commercial applications. Hugging Face has compiled a list of the relevant licenses to ensure compliance.

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

Hugging Face Launches New Code Generation Models

At a Glance

Enter Nvidia

New underlying dataset

About the Author(s)

Latest News

Trending articles