SYNTH, a practical pipeline for synthetic data in the e-commerce
Nov 12, 2025

SYNTH, a practical pipeline for synthetic data in the e-commerce
Customers don’t churn because they hate your brand. They churn because answers are slow, product data is messy, searches are failing, support is inconsistent across languages, or nobody trusts what the model says when it can’t show its work. This is a data problem first, not a model problem.
We are building Pleias Stratum, an AI-Ready Data Layer, bringing automation to your biggest genAI application painpoints. Stratum became possible thanks to our cutting-edge research on the synthetic playground and the generation of all the needed data artifacts at scale. Benchmarking, data augmentation, indexation, retrieval: it can all be automated at scale.
As an initial demonstration of the power of the internal tooling supporting Stratum, we released SYNTH, a huge, license-clear synthetic dataset that solves the typical data problems. So, this week we released three things:
SYNTH dataset. A grounded, license-clear synthetic corpus built from a curated open seed of more than 50,000 high-value pages. Expanded into millions of examples across tasks. With provenance attached to every row. Multilingual by design, with strong coverage in major European languages and room to grow.
Baguettotron (321M). A deep small reasoner tuned to follow evidence, explain itself, and play nicely with retrieval. Built to serve real business flows where quality and latency both matter.
Monad (56M). An ultra-small, deep architecture for tight budgets and strict latency and reliability requirements. Handy for edge, on-device, or high-throughput server paths where cost per request must be predictable.
Use any piece on its own. Combine them for synergies. Drop us a line if you want to test the full power of Pleias Stratum.
Now, let us demo how one can use the ideas we developed in SYNTH and can apply them to the real market use-cases. Today, we’ll talk about e-commerce platforms as an example.
Recommendation systems. Modern rec-sys models often fail because product data is inconsistent across brands and languages. Our pipelines solve this by generating structured examples that connect product attributes, reviews, and usage context into a clean, multilingual training signal. Imagine a dataset where “color,” “fit,” “material,” and “occasion” fields are harmonized automatically, linked back to evidence from specs and verified descriptions. Your rec-sys learns relationships, e.g., “users who bought waterproof shoes also viewed breathable jackets” — grounded in facts, not click noise. The result: better personalization, less cold-start pain, and cleaner signals for new SKUs.
Selling chatbots on landing pages. We can also generate conversation pairs that tie customer intents to verified business rules: returns, sizing, shipping, and warranty. Each chatbot answer traces back to real documentation or product metadata, so it’s safe to show and easy to audit. For example, when a user asks “Can I return this item after 30 days?”, the bot retrieves and cites the correct clause from your policy dataset. The result is explainable automation that drives higher conversion and reduces escalations — something you can deploy both in-market and on-device with our small reasoners, such as Baguettotron or Monad.
Why this matters
Current teams don’t lack ideas, they lack clean, harmonized, attributable data. However, it keeps changing: feeds are messy, reviews are unreliable, policies vary by country, and privacy rules never stay still. Your team needs examples that teach the right behavior. In every category and in every language that matters. Moreover, it should include citations and licenses. That is what SYNTH supplies. It is a factory, not a notebook.
Every SYNTH answer ties back to a source paragraph. We generate queries, retrieve supporting evidence, draft an answer that cites where it came from, keeping the license and provenance with the row. We publish versioned snapshots your data team can track. As a result, grounding reduces hallucination, provenance reduces risk, and versioning makes audits real.
What the numbers mean for you
Massive amount of curated data pages: broad, dependable coverage of core facts and concepts. Your model learns from structured, important knowledge, not random web noise.
Millions of examples: enough breadth for complex workflows. Rare cases stop being rare. You can train specialized behaviors instead of hoping for them.
Multilingual split: with a focus on European languages: the same quality in Paris, Berlin, Madrid, and beyond. Not just English. Your policies, your tone, your markets.
Numbers are only useful when they change outcomes. SYNTH delivers measurable wins fast: from cleaner product attributes and explainable recommendations to policy-true returns and multilingual size guidance. It helps you fix marketplace quality, power grounded Q&A, and cut return rates in one go. Every use case ties to hard metrics like faster answers, better search, and improved conversion.
Why now, and why us
Open ecosystems need sovereign capabilities. The alternative is renting core knowledge from a black box and hoping it behaves. We have spent years building on open foundations and releasing work people actually use. That experience lives in SYNTH and will surface in Stratum: grounded data beats smart prompts, and small models punch above their size because the signal is right.
Use our models and data today. If you want a guided pilot, write to us. We’ll help you pick the first domain, ship a snapshot, and measure the lift.
SYNTH turns open knowledge and your policies into teachable moments your team can trust, your counsel can sign, and your customers can feel. That is how you grow.
We will share more on our approach to domain packs, evaluation, and deployment on top of SYNTH.