Together, the developer, claims it is the largest public dataset specifically for language model pre-training

Ben Wodecki, Jr. Editor

November 9, 2023

2 Min Read
Image of a llama in red pajamas
AI Business via DALL-E 3

At a Glance

  • AI startup Together unveiled a 30-trillion token open source dataset that aims to make large language model training easier.
  • The second version of RedPajama expedites model training by providing preprocessed data.
  • Together claims it is the largest public dataset yet for language model pre-training.

AI startup Together has unveiled a dataset with a whopping 30-trillion tokens – or roughly 20 trillion words – and claims it is the largest public dataset yet for language model pre-training.

It is the latest version of RedPajama, which was first unveiled in April with a 1.2 trillion token dataset for building open source large language models.

The new RedPajama consists of trillions of filtered and deduplicated tokens from 84 CommonCrawl dumps covering five languages.

RedPajama v2 is the largest public dataset released specifically for large language model training, according to Together. It covers English, French, Spanish, German and Italian and contains over 40 pre-computed data quality annotations for further filtering and weighting.

Together claims that most publicly available data sources like CommonCrawl are “generally low quality” and “not ideal for direct use for large language model training due to artifacts arising from the conversion of HTML to plain text.”

The latest RedPajama is designed to reduce time-consuming and energy-intensive tasks like filtering raw data to make model development easier. The improved dataset contains annotations to make it simpler for developers to filter data to create their own pre-training dataset.

While there are other similar projects around, such as C4, RedPajama-1T, Refinedweb (Falcon), Dolma (AI2) and SlimPajama, "many of them only cover a small portion of the CommonCrawl crawls," the startup said in a blog post. "Moreover, they represent a very specific way in which data are filtered."

With version 2 of RedPajama, "our goal is to lift this burden off the community and provide a pool of web data serving as a base from which high quality datasets for LLM training can be extracted and based (and) on which LLM training data can be thoroughly researched," the startup said.

Moreover, Together plans to add more quality annotations beyond the current 40 as part of a “living” project.

RedPajama is open source – the dataset is covered by an Apache License v2, meaning it is suitable for commercial use. The data processing scripts are available on GitHub with all data available on Hugging Face.

The team behind it encouraged developers to enrich data mixtures with the Stack by BigScience for code generation and s2orc by AI2 for scientific articles.

The original version of RedPajama was downloaded more than 190,000 times, Together said. There are over 500 examples of its implementation on Hugging Face, with the classic RedPajama used in projects including Alibaba’s Data-Juicer and chat concepts from the Analytics Club at ETH Zürich, a Swiss academic institution.

Read more about:

ChatGPT / Generative AI

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like