(01)
We develop unique multilingual synthetic data capacity
Through novel approaches of LLM-driven rephrasing, refining and redocumentarising of the original content, massive high-quality synthetic datasets will be established and routinely expanded for customer use cases.
(02)
We build and open corpus mining pipelines
Numerous untapped training data sources exist beyond the typical web archives and copyrighted material. We develop innovative pipelines for corpus preparation, along with models capable of recognizing various layouts, allowing for the integration of overlooked open data, open science and cultural heritage resources, particularly those in PDF format.
(03)
We integrate and support semantic data
We build an extensive collection of semantic web for pretraining and alignment with a large diversity of standards matching use cases: XML, XLBR, RDF.
PIERRE-CARL-LANGLAIS
/01
Associate Researcher at Sorbonne Center for Artificial Intelligence and Sciences Po Médialab
/02
Previously opsci.ai - developed pioneering LLM assistant for French Public Services (Albert) and for Ministry of Education (Cassandre
/03
Co-author of OA Diamond Study
ANASTASIA STASENKO
/01
Associate Researcher at Sorbonne Center for Artificial Intelligence and Sciences Po Médialab
/02
Previously opsci.ai - developed pioneering LLM assistant for French Public Services (Albert) and for Ministry of Education (Cassandre
/03
Co-author of OA Diamond Study
/04
Former Hachette Livre - digital learning product manager - editorial heritage transformation into digital products
CONTACT US