Microsoft Unveils Tiny AI Coding Model, Beats GPT-3.5

Phi-1 has one billion parameters and took just four days to train

Ben Wodecki, Jr. Editor

June 23, 2023

3 Min Read
Chesnot/Getty Images

At a Glance

  • Microsoft researchers showcase phi-1, a new code generation model that’s just one billion parameters.
  • Microsoft also unveils ZeRO++, an improved way for GPUs to communicate that’ll boost AI training and finetuning.

AI researchers from Microsoft have published a new code generation model, phi-1, that’s designed to be lightweight - and it outperforms GPT-3.5, the large language model behind ChatGPT.

The Transformer-based model boasts just 1.3 billion parameters – in comparison, Codex, the OpenAI model that formed the basis of what would become GitHub Copilot, had 12 billion parameters.

It took Microsoft’s researchers just four days to train phi-1 using eight A100 chips from Nvidia. The model was trained on six billion tokens from the web as well as a further one billion tokens generated using GPT-3.5, one of the underlying models used to build OpenAI’s ChatGPT.

In performance, phi-1 scored a pass@1 accuracy of 50.6% on the HumanEval benchmark. The Microsoft model beat StarCoder from Hugging Face and ServiceNow (33.6%), OpenAI’s GPT-3.5 (47%) and Google’s PaLM 2-S (37.6%) despite being substantially smaller in size.

On the MBPP pass@1 test, phi-1 fared better, achieving a 55.5% score. A lot of the aforementioned models have yet to publish results on this benchmark, but WizardLM's WizardCoder scored 51.5% in a test conducted earlier this month. WizardCoder is a 15 billion parameter model vs. 1.3 billion for phi-1.

High-quality data makes the difference

Microsoft's researchers argue that it's the "power of high-quality data" why phi-1 performs so well. To bring the point home, the researchers named their model's paper, 'Textbooks Are All You Need.’

“Just as a comprehensive, well-crafted textbook can provide a student with the necessary knowledge to master a new subject, our work demonstrates the remarkable impact of high-quality data in honing a language model’s proficiency in code-generation tasks," they wrote.

“By crafting ‘textbook quality’ data we were able to train a model that surpasses almost all open-source models on coding benchmarks such as HumanEval and MBPP despite being 10x smaller in model size and 100x smaller in dataset size.”

Phi-1 is limited to Python coding, compared to other coding models available. They said the model is also limited in that it lacks the domain-specific knowledge of larger models such as programming with specific APIs.

To expand on their work, Microsoft’s researchers have suggested using GPT-4 rather than GPT-3.5 to generate synthetic data for the model’s training.

The researchers would also look to improve diversity and non-repetitiveness in its dataset, although the team said they would have to find ways to “inject randomness and creativity into the data generation process, while still maintaining the quality and the coherence of the examples.”

ZeRO++: Accelerating large model fine-tuning

Microsoft’s researchers also announced this week ZeRO++, a new method designed to improve large model pre-training and fine-tuning.

Large AI models like ChatGPT and GPT-4 require vast memory and computing resources to train and fine-tune.

Sometimes when training on a large number of GPUs relative to the batch size, it results in a small per-GPU batch size, requiring frequent communication.

To address this, Microsoft introduced ZeRO++, a system that leverages quantization - the process of mapping continuous infinite values to a smaller set of discrete finite values – combined with data, and communication remapping, to reduce total communication volume by 4x compared with ZeRO, without impacting model quality.

Stay updated. Subscribe to the AI Business newsletter

Effectively, ZeRO++ is designed to improve communication between the model you’re trying to train and the GPUs if the hardware you’re using is too small relative to the model’s size.

According to Microsoft’s researchers, ZeRO++ enables low-bandwidth clusters to achieve similar throughput as those with 4x higher bandwidth.

The team behind the system claims it offers up to 2.2x higher throughput compared to ZeRO, Microsoft’s earlier training optimization system.

ZeRO++ is available for anyone in the AI community and can be accessed via GitHub. The researchers announced that a version for chat will be released “in the coming weeks.”

Read more about:

ChatGPT / Generative AI

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like