April 14, 2023
At a Glance
- Databricks has launched Dolly 2.0, an open-source language model organizations can use to build applications.
- The dataset was built using responses from Databricks employees and is small, but of high quality, the company asserts.
Databricks has launched Dolly 2.0, an instruction-following large language model. It comes just two weeks after the company unveiled Dolly, an open-source version of ChatGPT trained for just $30.
Dolly 2.0, named after the world’s first cloned mammal, a sheep, boasts 12 billion parameters. It is built off the EleutherAI Pythia model family and was fine-tuned on a human-generated, instruction-following dataset that was crowdsourced among Databricks employees.
The new large language model is designed for research and commercial use, with Databricks open sourcing the entirety of Dolly 2.0, including the training code, the dataset and the model weights. Databricks showcased the model’s capabilities for summarizing internal documents and writing content for tweets.
By allowing open access to the model, Databricks said any organization could “create, own and customize powerful LLMs that can talk to people, without paying for API access or sharing data with third parties.”
Databricks opted to create a new dataset because it was inundated with requests to use the tech commercially, a company blog post reads.
The team behind the model was reportedly inspired by InstructGPT, one of the models behind ChatGPT, and how it was trained on a dataset of some 13,000 demonstrations of the instruction-following behavior.
The dataset for Dolly 2.0, named databricks-dolly-15k, contains 15,000 human-generated prompt and response pairs for instruction-tuning. Employees were incentivized to submit responses for the dataset, with the top 20 labelers given “a big award.”
Databricks gathered responses for the dataset from its own employees during March and April of 2023, covering a range of behaviors, from brainstorming and content generation to information extraction and summarization.
The company asserts that the dataset used to train Dolly 2.0 is “substantially smaller” but of higher quality than the one used to train Alpaca, the open-source model from Stanford University researchers made for just $600.
In terms of shortcomings, however, the dataset was affected by individuals some of whom are not native English speakers. Also, Databricks admits that it used some Wikipedia data, meaning some anomalies may have crept in.
Dolly 1.0 is a casual language model that cost $30 to train. However, Databricks’ blog post announcing Dolly 2.0 failed to say how much it cost to train this latest version.
AI Business has contacted Databricks for comment.
About the Author(s)
You May Also Like