Seek No More: European Answer to the Global Problem

Jan 28, 2025

DeepSeek has just shown to the world it's possible to train Frontier AI models under budget.

This has not surprised us at pleias.

pleias 1.0: AI can be scaled on budget, with open data

Last month we have released the first LLMs exclusively trained on free and non-copyrighted data. They were also likely the some of the least expensive models in the world: less than 100,000 hours of GPU for three models. And a very small team of 4-5 persons creating an entirely new pretraining set of two trillion tokens, checking rights conformity, designing entirely new data pipelines from scratch, training state of the art tokenizers for European languages.

Our first models were not conceived to be chatbots, but to process data at scale. They had all the desired features to be used at scale, locally and for sensitive use cases:

trained on ethical data with full conformity to the AI Act,
full multilingual support of all the major European languages,
trained on a large variety of non-web sources (including noisy PDFs),
optimized for fast inference even on constrained infrastructure.

Over the past months we have been finalizing the first applied versions for major tasks of text and processing: OCR correction, translation, text annotation, GDPR conformity, toxicity filtering. For all theses tasks our models are unusually robust and verifiable, and always output structured data or texts.

Yet we have been surprised by the unexpected performance of our model for Retrieval Augmented-Generation. It turned out that including a reasoning step similar to O1 or R1 has a dramatic impact on the quality of very small models. Our 300 million parameters SLM (as small as the original GPT-2 medium) manage to answer accurately, analyze sources, provide citations in a large variety of European languages.

An industrial plan for Europe

We now feel a responsibility. We are currently one of the few AI labs that may contribute to an European DeepSeek. Back in January 2024, our first research project, unfortunately unfunded, aimed to create Mixture of Expert model with enhanced reasoning capacities. This is basically the core recipe of V3. Currently we are one of a hanful of European model trainer, without any US funds or involvement to be audited by the AI office.

It's still time to create a competitive European ecosystem for AI.

In Europe, we have the expertise. We have talent. We increasingly have the required compute thanks the investments of our scientific public infrastructure. We just need an actual industrial policy at the European scale.