
Mar 17, 2026
Across Europe, a growing number of industries need AI trained on realistic data, but the data they actually need is too sensitive to use, too regulated to share, or simply doesn't exist in usable form. Synthetic data is seen a a promising path to simulate data we don’t have and following on the release of SYNTH this has become a core focus of our ongoing applied R&D. Yet, to be useful, synthetic data needs to be grounded on realistic social data, which includes people with plausible names, occupations, locations, and demographics.
In collaboration with Nvidia, we're releasing Nemotron-Personas-France, the first European dataset in the Nemotron Personas series: a comprehensive, statistically grounded resource for generating realistic French synthetic personas.
Grounded personas for regulated sectors and beyond
Nemotron-Personas-France addresses a practical data friction we've encountered across multiple industry projects in France and Europe. Most organizations in regulated sectors do have internal data, but using it for AI training is rarely straightforward. Personal identifiers are typically stripped or masked before any dataset can be shared, even between departments, resulting in partially redacted files that are difficult to train on effectively. In many cases, data simply cannot be communicated at all due to regulatory or contractual constraints, and the approval process to access it can take months.
Synthetic data generation offers a way to bypass these bottlenecks entirely: rather than working around redacted records, we can simulate realistic documents from scratch, with plausible and demographically consistent profiles built in from the start. This is already what we deployed in production in various sectors, like healthcare (clinical conversations), transportation (traveler feedback), banking (filings/claims) or telecommunications (as we contributed to the recent GSMA data initiative for AI). In all of these projects, the generated data needs to refer to people.. If those profiles are generic or implausible, the trained models inherit that lack of realism. Nemotron-Personas-France provides the demographic foundation to generate profiles that are statistically consistent with the actual population.
The same logic extends well beyond regulated industries. Use cases that require realistic, population-grounded personas - from model evaluations in realistic settings and red-teaming to conversational AI benchmarks and user simulation - will benefit from the same demographic consistency that makes synthetic data credible in production.
From census data to synthetic personas
The Nemotron-Personas-France dataset covers the full demographic profile of the country at the commune level: population, age-sex pyramids, occupational categories (the 8 French CSP), education levels, household types, median income.
This breadth was made possible thanks to the extensive French open data program, all made reusable under Licence Ouverte. Since 2004, INSEE, the national statistics agency, has operated a permanent rolling census covering all 35,000 communes, producing detailed data on age, sex, occupation, education, household structure, and income. For first names, we have annual records going back to 1900. For surnames, a département-level file covering births from 1891 to 2000.
A particular challenge has been to anticipate people born abroad that later become French, about 10% of the population, 7 million people and yet mechanically absent from multiple official statistics, like names or given names. Without correction, any persona generation based on these files would produce a skewed representation, significantly underestimating the real diversity of the French population. We cross-referenced multiple public sources — INSEE population data, INED's Trajectoires et Origines survey — to correct this bias and ensure that generated personas faithfully reflect the demographic composition of each département.
Seeding as data infrastructure
Nemotron-Personas-France deliberately mirrors the schema of the US Nemotron Personas, making it interoperable and comparable, but the challenges it solves are distinctly European. The commune-level granularity of French public statistics, the complexity of demographic dynamics, the need to reconstruct indirectly what the census doesn't measure: these required building something new, not porting an existing approach.
For Pleias, this work is part of a broader trajectory. Our synthetic pipelines, from SYNTH, our generalist pretraining dataset, to the specialized environments we build for industrial partners, increasingly rely on personas as a grounding mechanism. Not just for realism, but for diversity: ensuring that synthetic data reflects the actual range of people who will use, appear in, or be affected by the systems trained on it.
We believe this kind of demographic grounding will become standard practice in synthetic data generation. And we believe the open, auditable, EU AI Act-compatible approach we've taken here (building on public statistics, releasing methodology and data) is the right way to build it.
The dataset is available now on Hugging Face: https://huggingface.co/datasets/nvidia/Nemotron-Personas-France
CONTACT US


