Synth Beta: Frontier Data Efficiency

May 5, 2026

What a language model knows is decided long before training starts: by what enters the corpus, and by how often. Web crawls take that decision in advance, overweighting what is already overrepresented online, underweighting what your domain actually depends on, with no handle on either. Yet a model cannot reliably recall what it was never reliably trained to remember. This compounds through every step of deployment. Lack of memorization and calibration on specialized knowledge result in unnecessary frictions, loops, hallucinations. An efficient agent should simply know, most of the time.

The Synth methodology is experimental proof that this ceiling can be moved by engineering the corpus rather than scaling it. We have recently shown that an entire pretraining environment could be built from a small collection of 50,000 Wikipedia articles through systematic synthetic amplifications under constraints over query, persona, style, language, and reasoning trace. In our recent runs, across three dense parameter tiers and one Mixture-of-Experts variant, the resulting models calibrate epistemic uncertainty, abstain on questions outside their seeds, and match or exceed open baselines that saw 10× to 140× more pre-training tokens (technical report coming soon).

What we found is a scalable recipe, controlled memorization through engineered synthesis, that allows to selectively retain in the model weights themselves the information worth knowing.

Meet Synth Beta

Synth, at synth.pleias.dev, is our early productised pipeline. A workspace for designing and running our back-translation logic, constraint grammar, reasoning-trace adapters, and multilingual amplification on your own seed material: PDFs, regulatory codes, internal documentation, structured knowledge bases. Same recipe; your axes of control. The same method covers pretraining, mid-training and post-training, since the distinction dissolves once you are designing the corpus directly.

What the beta opens up

- Design and run synthetic pre- or mid-training corpora from your own seeds or our existing seed resources, with inspection at every stage of the pipeline.

- Compliant by design: building on Common Corpus, we provide a large collection of open seeds allowing for commercial reuse at every step.

- Configure the constraint grammar — task mix, persona distribution, reasoning syntax, target languages — for the capabilities you want amplified.

- Multilingual and cultural amplification: following on the success of French-Personas co-released with Nvidia we plan to gradually include seed resources reflecting localized cultural realities.

- On-prem deployment for data that cannot leave your infrastructure.

The beta is open to teams, researchers, AI practitioners. Apply at synth.pleias.dev.

CONTACT@PLEIAS.FR