November 9, 2023
At a Glance
- AI startup Together unveiled a 30-trillion token open source dataset that aims to make large language model training easier.
- The second version of RedPajama expedites model training by providing preprocessed data.
- Together claims it is the largest public dataset yet for language model pre-training.
AI startup Together has unveiled a dataset with a whopping 30-trillion tokens – or roughly 20 trillion words – and claims it is the largest public dataset yet for language model pre-training.
It is the latest version of RedPajama, which was first unveiled in April with a 1.2 trillion token dataset for building open source large language models.
The new RedPajama consists of trillions of filtered and deduplicated tokens from 84 CommonCrawl dumps covering five languages.
RedPajama v2 is the largest public dataset released specifically for large language model training, according to Together. It covers English, French, Spanish, German and Italian and contains over 40 pre-computed data quality annotations for further filtering and weighting.
Together claims that most publicly available data sources like CommonCrawl are “generally low quality” and “not ideal for direct use for large language model training due to artifacts arising from the conversion of HTML to plain text.”
The latest RedPajama is designed to reduce time-consuming and energy-intensive tasks like filtering raw data to make model development easier. The improved dataset contains annotations to make it simpler for developers to filter data to create their own pre-training dataset.
While there are other similar projects around, such as C4, RedPajama-1T, Refinedweb (Falcon), Dolma (AI2) and SlimPajama, "many of them only cover a small portion of the CommonCrawl crawls," the startup said in a blog post. "Moreover, they represent a very specific way in which data are filtered."
With version 2 of RedPajama, "our goal is to lift this burden off the community and provide a pool of web data serving as a base from which high quality datasets for LLM training can be extracted and based (and) on which LLM training data can be thoroughly researched," the startup said.
Moreover, Together plans to add more quality annotations beyond the current 40 as part of a “living” project.
RedPajama is open source – the dataset is covered by an Apache License v2, meaning it is suitable for commercial use. The data processing scripts are available on GitHub with all data available on Hugging Face.
The original version of RedPajama was downloaded more than 190,000 times, Together said. There are over 500 examples of its implementation on Hugging Face, with the classic RedPajama used in projects including Alibaba’s Data-Juicer and chat concepts from the Analytics Club at ETH Zürich, a Swiss academic institution.
Read more about:ChatGPT / Generative AI
About the Author(s)
You May Also Like