
Nov 27, 2025
We’ve spent the last two years proving a simple idea: open, well-structured data beats brute force. First with Common Corpus, the largest fully open, permissively licensed multilingual pretraining set (≈2T tokens, provenance-clean). Then with Pleias 1.0, a family of small, efficient models trained only on open data—strong baselines for retrieval and reasoning without hyperscaler budgets. And most recently with SYNTH, our fully synthetic reasoning dataset and pipeline showing how curated knowledge + constrained generation + verification loops can rival trillion-token habits—and do it cleanly.
Today we’re taking all those lessons and shipping them as a product:
Pleias Stratum, the data layer your AI stack is missing
Stratum turns messy enterprise content into agent-ready datasets for training, RAG, and autonomous workflows. Think of it as a prep + delivery layer for data: chunking, privacy, enrichment, harmonization, and continuous indexing—opinionated and automated.
What Stratum does
Advanced Document Processing
Precision extraction from complex PDFs, scans, tables, and forms via our purpose-built solution for document workflows, with strong support for European formats. This is the same “documents-in-the-wild” muscle we built while assembling Common Corpus and training small models that actually understand “legaleese”, not just score on the benchmarks.
Privacy-Preserving Pipelines
Automatic detection and pseudonymization of personal data engineered for GDPR environments and real-world messiness. Last week we’ve written about the approach and why many generic redactors fail on live corpora.
Intelligent Data Enhancement
Automated classification, context-aware augmentation, semantic linking, and smart indexing so agents retrieve the right fact at the right time. This borrows directly from SYNTH’s “synthetic playgrounds”: grounding → constraints → verification → iteration.
Data Harmonization
Normalize legacy and unstructured sources to shared schemas and ontologies; emit structured outputs that drop into your lakehouse, vector stores, and search indexes. Yes, we learned this the hard way building a 2-trillion-token corpus.
Book a call to learn about Stratum
Why this matters now
Most AI projects stall after pilot. Most AI pilots fail. The problem isn’t “the model”. It’s the data. Raw dumps in the context window won’t give you resilient agents. Stratum gives you curated chunks, privacy, enrichment and verification so your data-science team works on actual use-case you want to implement, not on tons of messy siloed data.
Under the hood: what we learned and shipped
Common Corpus proved you can train serious models on clean, permissible data if you enforce standards and provenance from day one. Stratum brings that discipline to your internal corpora.
Pleias 1.0 performs because the input pipeline is ruthless about noise, formats, and retrieval. Stratum encodes that ruthlessness so your teams don’t have to reinvent it.
Minimal models we shipped together with SYNTH are trained on the constrained generation anchored in vital knowledge, with verification loops. Stratum uses the same pattern to enrich your real data.
Our PII work focuses on utility-preserving pseudonymization and auditability.
What you can build with Stratum
Search + RAG that doesn’t crumble when formats get weird and privacy rules get strict.
Agentic workflows that rely on verified, enriched chunks instead of brittle prompts.
Training sets for small, efficient models—faster iterations, lower bills, clearer audits.
If you read our post The Model Is the Product, you know where we’re headed: reasoning models + disciplined data is how you ship durable value in this cycle. Stratum is the missing half of that equation for most teams.
Stratum is our first box product. It’s opinionated because reality is. And it exists so your AI doesn’t just demo well, but ships, at scale, on budget, with a paper trail.
Book a call to learn about Stratum
CONTACT US