What the Community Built with SYNTH

Apr 2, 2026

SYNTH Community Spotlight

Open data is only as valuable as what people do with it. We released SYNTH with a bet: fully synthetic training works for small models, and synthetic playgrounds beat standard pretraining sources on data efficiency making them a credible primary training infrastructure built on open, attributable data.

Here's what happened when other people got their hands on it.

SYNTH as Research Infrastructure

Before we get to the community projects, a quick flex: three independent research papers adopted SYNTH as a core training component. No coordination between them. Each uses SYNTH differently. All three converge on the same finding: SYNTH works as a reliable base distribution for reasoning-oriented training.

Step-DeepResearch (StepFun, 2025) plugs SYNTH into its mid-training pipeline for report generation and information-seeking. The 32B model hits 61.4% on Scale AI Research Rubrics trading blows with OpenAI DeepResearch and Gemini DeepResearch at a fraction of the inference cost. Think David vs. two Goliaths, but David brought a synthetic sling.

IMU-1 (Grigorev, 2026) feeds 5.8B SYNTH tokens across stages 2 and 3 of a three-stage pretraining recipe. The resulting 430M model approaches SmolLM2-360M performance on 56× fewer total training tokens. Efficiency zealots, rejoice.

Reasoning Core (Lacombe et al., Inria/Univ. Lille, 2026) uses SYNTH as one of two primary pre-training baselines in its experiments. The paper's Figure 3 explicitly shows "D = SYNTH (Pre-Training)" curves, demonstrating that mixing procedurally generated symbolic data into SYNTH consistently improves downstream reasoning while preserving language modeling quality.

Three labs. Three use cases. Zero coordination. Same underlying resource: open, copyright-free, fully attributable. This is what open data infrastructure actually looks like: a shared foundation people build on in directions you didn't plan for.

New Architectures Trained on SYNTH

The most technically ambitious work didn't start from fine-tuning. It started from scratch. (We didn't expect that either.)

Mariusz Kurman (@mkurman88) trained two fully custom architectures on SYNTH at scale, both deviating substantially from standard transformer designs. When someone trusts your dataset enough to bet an entire novel architecture on it, that's a true compliment.

NeuroBLAST V3 is a novel architecture with SYNTH as primary pretraining corpus.

NeuroBLAST-V3-SYNTH-EC-150000: https://huggingface.co/mkurman/NeuroBLAST-V3-SYNTH-EC-150000
NeuroBLAST v3 collection: https://huggingface.co/collections/mkurman/neuroblast-v3
NeuroBLAST v3 GitHub: https://github.com/mkurman/neuroblast-v3

ConvGPT ia second architectural direction, trained on 250B tokens of SYNTH. Proves SYNTH's distribution is general enough to support architectures beyond standard self-attention.

ConvGPT-0.2B-SYNTH-250B-EC: https://huggingface.co/mkurman/ConvGPT-0.2B-SYNTH-250B-EC

Fine-Tuning & Language Adaptation

Two contributors took Baguettotron (Pleias's 321M-parameter reasoning model) and stretched it in directions we hadn't.

**@darrenangle (on X)** fine-tuned Baguettotron on poetry using reverse-engineered SYNTH reasoning traces in his SFT pipeline. Teaching a reasoning model to write poetry is either beautiful or cursed (possibly both). It opened a longer research thread on preference learning and reasoning format stabilization.

https://x.com/darrenangle/status/1995892856926429477

Pieter Delobelle (who has since joined Pleias as Lead AI Scientist) went linguistic: he fine-tuned Baguettotron on the Dutch subset of SYNTH and released a filtered Dutch dataset independently. Contributing back to the infrastructure. This is how open ecosystems are supposed to work.

Dutch Baguettotron: https://huggingface.co/pdelobelle/baguettotron-nl
Dutch SYNTH dataset: https://huggingface.co/datasets/pdelobelle/synth-nl

The Dutch experiment points at a bigger question: can synthetic pretraining data, built on open and attributable sources, serve as a base for language-specific adaptation? without relying on web-scraped national corpora whose licensing is somewhere between "ambiguous" and "don't ask"?

Learning from SYNTH

Three projects treated SYNTH as a substrate for structured learning about data, scale, and what it actually takes to train a reasoning model from scratch.

CodeLion published sampled SYNTH subsets at three scales — 10M, 100M, and 1B tokens — specifically designed to lower the barrier of entry. The accompanying analysis of optimal dataset mixing adds real methodological value on top.

synth-10M: https://huggingface.co/datasets/codelion/synth-10M
synth-100M: https://huggingface.co/datasets/codelion/synth-100M
synth-1B: https://huggingface.co/datasets/codelion/synth-1B

The 10M subset fits on a single GPU and runs in a few hours. If you're a student, an indie researcher, or someone whose cloud budget is "hopes and dreams" this is your on-ramp.

Brendan Hogan used filtered SYNTH as the primary data source for Day 20 of his Advent of Small ML series: training a compact reasoning model from scratch as a self-contained educational experiment. SYNTH made this possible because its provenance is transparent. No licensing roulette, no opaque filtering decisions, no data contamination surprises.

Shane Caldwell published Twenty Billion Tokens of What, Exactly? In a detailed independent analysis he manually reviewed hundreds of samples from C4, FineWeb, and FineWeb-Edu. The verdict: web-scraped corpora, even the carefully filtered ones, remain heavily skewed toward ads, boilerplate, and incoherent fragments. The article ends with a dedicated section on SYNTH as the most promising alternative - coherent reasoning traces, no ads, grounded in verifiable sources. When someone with no stake in your project digs through the data and comes away impressed, that hits different from a citation.

What This Tells Us

The range of what's been built on SYNTH reflects a property that's paramount for us: breadth of applicability. A corpus that only works at scale, or only for fine-tuning, or only for transformers, is infrastructure for one use case. SYNTH appears to work across a genuinely diverse set of technical contexts.

The multilingual dimension deserves a spotlight. Dutch Baguettotron is the clearest example so far of language-specific adaptation from a fully open synthetic base and it raises the question of whether models trained exclusively on open, copyright-free data can go toe-to-toe with models trained on data nobody's allowed to look at.

The lab-level adoption confirms the trajectory: structured synthetic data has graduated from "interesting experiment" to standard pretraining ingredient. This is also reflected in the Synthetic Pretraining Tracker maintained by Eric Tramel (Principal Research Scientist, NVIDIA) — a running record of synthetic data usage in open-weight LLM pretraining since January 2024, last updated March 15, 2026. It covers flagship open-weight models across all major labs. Among all entries, the vast majority report either undisclosed synthetic data usage or partial percentages - Phi-4 at 55%, Trinity Large at 47.1%, Nemotron-3 Nano at 10%. Monad and Baguettotron are the only two models in the entire tracker at 100% synthetic data. Everyone else is blending; we went all in. Full transparency is possible precisely because the data is built on open sources rather than scraped from the web.

Built something with SYNTH? We want to hear about it. Open a discussion or reach out directly - bragging rights are free and fully attributable.

CONTACT@PLEIAS.FR