Common Corpus Goes Global

Common Corpus Goes Global

Common Corpus Goes Global

Feb 19, 2026

One year after its initial release, Common Corpus has become a critical component of AI infrastructure: more than a million downloads on HuggingFace, subsequent reuse by research and industry leaders such as Nvidia, Anthropic, StepFun, IBM and ElasticSearch and an oral acceptance to ICLR.

We achieved our initial objectives, building the largest multilingual corpus of permissively licensed content with clear provenance. Yet, Common Corpus remained intently biased on content from Europe (where Pleias is based) and the United States (where a lot of past data work had already been performed). It was not yet the universal infrastructure we wanted to build.

For the Indian Impact Summit, we release the first global update of Common Corpus, more than 270 billion tokens with 53% coming from non-western countries. This new Common Corpus (2.5) amounts now to 2,267,302,720,836 tokens, with significant content coming from China, Japan, Korea, Brazil, as well as India, Africa and Southeast Asia. 

  • HuggingFace Repo: https://huggingface.co/datasets/PleIAs/common_corpus

Fixing the open data paradox

When we started building Common Corpus, the general assumption was that open data training was not viable: only a small sample of web pages crawled are under open license. Our own perspective changed radically as we realized a widespread open data paradox: the vast majority of content in open data is not included in the standard pretraining data mix. To date, the referenced crawl-based dataset, Fineweb, has less than 2% of common content in the open with Common Corpus.

In short, rather than removing existing copyrighted data, we mostly added new data to the pretraining commons. Simply because obsessing over provenance, traceability and releasability requires to care about what is missing: all the content that should be open and isn’t properly indexed. To be currently findable in the crawled datasets, a piece of text has to fit all these conditions:

  • Digitized. That already excludes a large amount of relevant content in developing countries. Major efforts have to be undertaken to collect oral data on very low resource languages.

  • Accessible. Even in developed countries with advanced open policies, digital infrastructure is more fragile than is commonly thought. We had to renounce integrating most of the case law from Taiwan, as the one corresponding zipped file simply disappeared. Another major source we’re releasing today was only secured thanks to… two individual seeders.

  • Readable. Crawl infrastructure is mostly designed for web pages. Due to added costs of processing and storage, PDFs had long been excluded. This is changing but not retroactively.

  • Findable. This last part is very underrated: crawlers only index content they can find, so preferably retro-linked or listed on a sitemap. In contrast, open data is highly concentrated on large infrastructures with very sparse indexations. On multiple occasions we found that less of a handful of urls were backed up on Internet Archive

This is also a point where our work has an actual archiving dimension: pretraining corpora do create lasting copies that may not perfectly substitute for the original (especially when they come from PDFs documents which are much costlier to save), but still ensure an additional layer of preservation that cannot currently be ensured by the existing web infrastructures.

Building the infrastructure for open data research

The turn to synthetic data has made open data significantly more valuable for training, not simply more ethical. Synthetic pipelines and agent traces require seeds: high-quality texts with advanced contextualization, provenance, and structure. They in fact call for actual data research in the open with reproducible experiments and releasable samples at every stage of synthetic data transformation. 

We tried to exemplify this new approach with SYNTH, already our most successful data release after Common Corpus. But this is also a wider transformation in the field: Wikipedia and Wikidata have already become leading data sources for synthetic pretraining pipelines (in Qwen, StepFun, and others). Open data with clear provenance is becoming a source of data innovation, not just data compliance.

The anticipation of synthetic transformation changed our selection criteria. Even small collections can now become highly valuable, as we have mechanisms for indefinite amplification. Open synthetic pipelines might prove especially critical for low resource languages, as rare samples can now be with conceptual information about the language itself, similarly to the actual language learning process. And then, truly, no language would be left behind.

At this point the open data paradox is far from resolved. For this release, we intently selected easy sources with full text available in the open. Yet we are scaling up our internal processing capacities and plan to integrate over the next months a large amount of PDF documents, mostly coming from open government collections.

At this point the open data paradox is far from resolved. For this release, we intently selected easy sources with full text available in the open. Yet we are scaling up our internal processing capacities and plan to integrate over the next months a large amount of PDF documents, mostly coming from open government collections. The combination of improved OCR, new generation of strong non-western models (like Sarvam and others for Indic scripts) and our provenance-first approach should allow us to unlock a substantial new layer of open content in future releases.

With each release, Common Corpus demonstrates that the gap between what could be open and what is accessible keeps shrinking, and that building the infrastructure to close it has consequences far beyond training data.

CONTACT@PLEIAS.FR

CONTACT@PLEIAS.FR

CONTACT US