At a Glance
- StarCoder2 is here – and it now comes in three sizes, with the smallest as powerful as the original model.
Hugging Face has unveiled the latest version of its code generation model StarCoder – enlisting the help of Nvidia to bring it to life.
The original StarCoder, built in tandem with ServiceNow, launched last May. This new version, StarCoder2, can generate code across over 600 programming languages.
StarCoder2 comes in three sizes but is designed to be small – the largest version stands at 15 billion parameters – so developers can run it more efficiently on their PCs.
The new versions of StarCoder are more powerful too, with the smallest of the bunch matching the performance of the original StarCoder 15 billion parameter model. StarCoder2-15B is the best in its size class and matches models double its size. Read the technical paper.
Enter Nvidia
A new addition to the StarCoder project was Nvidia. The AI chipmaking giant’s infrastructure was used to train the 15 billion parameter version. ServiceNow trained the 3B model while Hugging Face took responsibility for the 7B version.
Nvidia also utilized its NeMo framework used in the largest StarCoder2 model’s development. NeMo allows users to build custom generative AI models and services.
Jonathan Cohen, vice president of applied research at Nvidia, said its involvement on the StarCoder project “introduces secure, responsibly developed models and supports broader access to accountable generative AI that we believe will benefit the global community.”
New underlying dataset
The three- and seven-billion parameter models were trained on three trillion tokens, while the 15 billion model was trained on over four trillion tokens.
StarCoder2 was built using The Stack v2, a sizable new dataset to power code generation models.
The Stack v2 is larger than The Stack v1, standing at 67.5 terabytes, compared to just 6.4TB.
The Stack v2 is derived from the Software Heritage archive, a public archive of software source code. The new dataset boasts improved language and license detection procedures and better-filtering heuristics. The data is also better filtering heuristics, which Hugging Face said allows for the training of models with repository context.
To access the dataset, head to Hugging Face. To download it in bulk, users need to get permission from Software Heritage and Inria.
Since The Stack v2 is made up of multiple source codes, there are various licenses to contend with, so it may not be clear as to whether the whole dataset can power commercial applications. Hugging Face has compiled a list of the relevant licenses to ensure compliance.
Read more about:
ChatGPT / Generative AIAbout the Author
You May Also Like