Image credit : Together
RedPajama, an ambitious project to create leading fully open-source foundation models, has recently completed the first step by reproducing the LLaMA training dataset, consisting of over 1.2 trillion tokens. Collaborating partners include Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute.
Current foundation models like GPT-4 have driven rapid advancements in AI, but most powerful models are either closed commercial models or only partially open. RedPajama aims to bridge the quality gap between open and closed models, thereby removing limitations on research, customization, and usage with sensitive data.
The RedPajama project has three key components: pre-training data, base models, and instruction tuning data and models. The first component, pre-training data, has now been released. The RedPajama base dataset is a fully open, 1.2 trillion token dataset created by following the LLaMA paper's recipe. The dataset and a smaller random sample can be downloaded through Hugging Face.
RedPajama-Data-1T includes seven data slices: CommonCrawl, C4, GitHub, arXiv, Books, Wikipedia, and StackExchange. All data pre-processing and quality filters are openly available on Github, allowing anyone to reproduce the dataset.
In collaboration with the Meerkat project, a Meerkat dashboard and embeddings for exploring the GitHub subset of the corpus are being released. Users can interactively explore the dataset and view matching records using the Meerkat dashboard.
The next step in the RedPajama project is to train a strong base model, with the first models becoming available in the coming weeks through the INCITE program and Oak Ridge Leadership Computing Facility (OLCF) support. With a strong base model, the team plans to use OpenChatKit's hundreds of thousands of high-quality natural user instructions to release instruction-tuned versions of the RedPajama models. Source