WE DESIGN AND PRETRAIN WORLD’S MOST EFFICIENT LLMS TO PROVE THAT AI CAN BE SCALED
ON BUDGET - WITH OPEN DATA

WE DESIGN AND PRETRAIN WORLD’S MOST EFFICIENT LLMS TO PROVE THAT AI CAN BE SCALED ON BUDGET - WITH OPEN DATA

WE DESIGN AND PRETRAIN WORLD’S MOST EFFICIENT LLMS TO PROVE THAT AI CAN BE SCALED ON BUDGET - WITH OPEN DATA

OUR FOUNDATION MODELS

OUR FOUNDATION MODELS

PROFICIENT AT KEY ENTREPRISE TASKS

PROFICIENT AT KEY ENTREPRISE TASKS

PROFICIENT AT KEY ENTREPRISE TASKS

LEAN & FAST

LEAN & FAST

LEAN & FAST

COMPLIANT & TRANSPARENT

COMPLIANT & TRANSPARENT

COMPLIANT & TRANSPARENT

TEST PLEIAS 1.0 TODAY

TEST PLEIAS 1.0 TODAY

Get Started

FOUNDING TEAM

PIERRE-CARL LANGLAIS

01

PhD Information science

02

Associate Researcher at Sorbonne Center for Artificial Intelligence and Sciences Po Médialab

IVAN YAMSHCHIKOV

01

PhD Financial Mathematics

02

Research Professor at CAIRO, Technical University of Applied Sciences Würzburg for Artificial Intelligence

03

Ex-Yandex, R&D Lead in Human-Machine interaction, Ex-ABBYY, AI Evangelist

ANASTASIA STASENKO

01

PhD Philosophy, ENS Ulm

02

Associate Senior Lecturer at Sorbonne-Nouvelle University

03

Ex-Hachette Publishing

TEAM

PAVEL CHIZHOV

LLM SCIENTIST

PhD University of Würzburg University of Tartu

Ex-Yandex

PAVEL CHIZHOV

LLM SCIENTIST

PhD University of Würzburg University of Tartu

Ex-Yandex

PAVEL CHIZHOV

LLM SCIENTIST

PhD University of Würzburg University of Tartu

Ex-Yandex

HANNA SHCHARBAKOVA

AI ENGINEER

M.S. University of Groningen/University of Lorraine

B.S. Higher School of Economics

HANNA SHCHARBAKOVA

AI ENGINEER

M.S. University of Groningen/University of Lorraine

B.S. Higher School of Economics

HANNA SHCHARBAKOVA

AI ENGINEER

M.S. University of Groningen/University of Lorraine

B.S. Higher School of Economics

GABRIEL ABENHAIM

AI ENGINEER

M.Eng. CentraleSupélec

B.S. Paris-Saclay

GABRIEL ABENHAIM

AI ENGINEER

M.Eng. CentraleSupélec

B.S. Paris-Saclay

GABRIEL ABENHAIM

AI ENGINEER

M.Eng. CentraleSupélec

B.S. Paris-Saclay

YANNICK DETROIS

LLM SCIENTIST

M.Eng. EPFL

YANNICK DETROIS

LLM SCIENTIST

M.Eng. EPFL

YANNICK DETROIS

LLM SCIENTIST

M.Eng. EPFL

ANTON CHANGALIDI

LEAD AI ENGINEER

M. Eng. Maastricht University

ANTON CHANGALIDI

LEAD AI ENGINEER

M. Eng. Maastricht University

ANTON CHANGALIDI

LEAD AI ENGINEER

M. Eng. Maastricht University

CARLOS ROSAS

SENIOR DATA SCIENTIST

PhD ENS ULM

M.S. Sorbonne Université

CARLOS ROSAS

SENIOR DATA SCIENTIST

PhD ENS ULM

M.S. Sorbonne Université

CARLOS ROSAS

SENIOR DATA SCIENTIST

PhD ENS ULM

M.S. Sorbonne Université

IAROSLAV NEVEROV

FULL-STACK AI ENGINEER

M.Eng. Ecole 42

ex-Sberbank

IAROSLAV NEVEROV

FULL-STACK AI ENGINEER

M.Eng. Ecole 42

ex-Sberbank

IAROSLAV NEVEROV

FULL-STACK AI ENGINEER

M.Eng. Ecole 42

ex-Sberbank

MOHAMED HADJ RABAH

Communications Lead

M.A. Sorbonne-Nouvelle

MOHAMED HADJ RABAH

Communications Lead

M.A. Sorbonne-Nouvelle

MOHAMED HADJ RABAH

Communications Lead

M.A. Sorbonne-Nouvelle

PANDORA LANGLAIS

PROJECT MANAGER

M.A Ecole du Louvre

PANDORA LANGLAIS

PROJECT MANAGER

M.A Ecole du Louvre

PANDORA LANGLAIS

PROJECT MANAGER

M.A Ecole du Louvre

OUR PARTNERS

Open Trusted Data Initiative Lead

Member of the Inception Program

Local AI Builders

RESEARCH

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Pavel Chizhov, Catherine Arnett, Elizaveta Korotkova, Ivan P. Yamshchikov

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that our method does not reduce the downstream performance, and in several cases improves it.

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that our method does not reduce the downstream performance, and in several cases improves it.

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Catherine Arnett, Eliot Jones, Ivan P. Yamshchikov, Pierre-Carl Langlais

Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make language models safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.

Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make language models safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR).

Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence).

We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.

CONTACT@PLEIAS.FR

CONTACT@PLEIAS.FR

CONTACT US