French-Science-Commons: The Largest Open Corpus for Sciences in French

Mar 19, 2026

On the Week of the French Language and Francophonie, we are releasing French-Science-Commons on Hugging Face: the largest open, structured corpus entirely dedicated to scientific production in French.

Developed by Pleias in collaboration with OPERAS and the Quebec Research Chair on the Discoverability of Scientific Content in French, with support from the General Delegation for the French Language and the Languages of France (DGLFLF), French-Science-Commons brings together 1,248,860 scientific documents published between 2007 and 2026 under permissive licences and referenced in OpenAlex, HAL and theses.fr.

Why a French Scientific Corpus?

Scientific production today is overwhelmingly dominated by the English language. This creates a structural problem: francophone research becomes harder to find, harder to cite, and harder to build on. The concept of discoverability captures exactly this challenge: how do you make content that exists genuinely findable?

French-Science-Commons is a direct response. By curating a large-scale, permissively-licensed corpus of French scientific texts, we provide material for tools and systems that can surface francophone research where it matters: in search engines, in language models, in the workflows of researchers, students, and institutions.

What's in the Corpus

The corpus contains 1,189,628 articles and 59,232 theses across all major disciplinary fields: natural sciences, engineering and technology, medical and health sciences, agricultural and veterinary sciences, social sciences, humanities and arts. This balanced multidisciplinarity makes it relevant far beyond any single field.

Built for Downstream AI Use

Scientific documents — especially theses and journal articles — are typically locked in PDF format, where the visual layout carries meaning: headings, sections, footnotes, tables, and formulas are all structural elements that a flat text extraction pipeline would lose or mangle. To preserve this structure, the documents in French-Science-Commons were processed using vision-language models (VLMs) that interpret the visual layout of each page and produce clean, structured text that retains headings, formatting hierarchies, and document organisation. This goes well beyond conventional OCR or text extraction: the result is a corpus where document structure is a first-class citizen, not an afterthought.

Each document is accompanied by rich, normalised metadata — disciplinary classification across 689 categories and 7 high-level fields, DOIs, licence types, source provenance, language tags, and word counts — all drawn from OpenAlex, HAL, and theses.fr and harmonised into a consistent schema. The corpus is released in Parquet format under a CC-BY licence, making it immediately consumable by standard data tooling and ready for integration into training, retrieval, and analysis pipelines without additional preprocessing.

Use Cases

The corpus was built and structured with multiple use cases in mind:

Training and fine-tuning language models. French-Science-Commons can serve as a high-quality pre-training or fine-tuning resource for specialised language models in French, filling a gap that general-purpose corpora rarely address.

RAG and agentic search systems. The full-text content and rich metadata make it well-suited for retrieval-augmented generation pipelines and agentic search workflows that need to ground answers in verified scientific sources.

Semantic exploration. The disciplinary annotations and full-text availability open the door to interactive semantic mapping, enabling researchers to navigate thematic landscapes across French-language science.

Indexing and classification. The structured fields support automated indexing, topic modelling, and cross-disciplinary classification tasks.

A Step Toward Francophone Digital Commons

French-Science-Commons is also an initial result of a wider ambition: building digital commons within the Francophonie. With a view to supporting linguistic and cultural sovereignty, as well as traceability, transparency and scientific integrity, the objective is to create shared specialist resources curated by language and disciplinary experts.

This opens new avenues for the discoverability of francophone scientific content across multiple use cases - indexing and classification, writing and translation, training, public dissemination of research findings - and lays the groundwork for a more equitable representation of French-language knowledge in the AI ecosystem.

Access the Corpus

The corpus is available now on Hugging Face: PleIAs/French-Science-Commons

Find out more about the corpus, its composition and its technical specifications on our dedicated page (in French).

CONTACT@PLEIAS.FR