
Apr 28, 2026
An Open Language Identification Model for Multilingual Corpus Curation
We release CommonLingua, a SOTA 2-million-parameter, byte-level language identification model covering 334 languages. Built for the first step of any serious multilingual data pipeline, and released as the opening component of our partnership with the GSMA's "AI Language Models in Africa, by Africa, for Africa" initiative.
CommonLingua achieves 82% equivalent accuracy on the CommonLID benchmark.
Why language identification is the bottleneck
Every multilingual corpus project begins with the same deceptively simple question: what language is this text written in? Getting that answer wrong at scale is the single most common way a training dataset gets quietly corrupted. Documents end up in the wrong bucket. Minority-language content is discarded as noise. Evaluation suites inherit the same upstream errors and end up measuring the wrong thing.
Language identification (LID) is the first step of the data pipeline, before filtering, deduplication, augmentation, or training. When LID is reliable, everything downstream becomes easier. When it is not - which is the current state of affairs for a large portion of the world's written languages - the rest of the pipeline is quietly degraded.
This matters most acutely for the languages that have historically been least well served by the field. It is why CommonLingua is being released as the opening component of our partnership with the GSMA's AI Language Models in Africa, by Africa, for Africa initiative: African languages are where the existing toolchain breaks down most severely, and where a correct LID layer unlocks the most downstream work.
But the problem generalises. A corpus curator working on South and Southeast Asian scripts, or low-resource European languages runs into the same failure modes.
CommonLingua is built for that entire class of data problems.
What CommonLingua is
CommonLingua is a 2-million-parameter language identification model. It classifies text into 334 language labels and runs at roughly 20 texts per second on a CPU and 3,000 per second on a single GPU. It is trained exclusively on open-licensed and public domain content from Common Corpus - Wikipedia, VOA Africa, WaxalNLP ASR transcripts, Cultural Heritage collections, and Pralekha - and released under Apache 2.0 with no restrictions on commercial or governmental use.
The model is designed around four requirements that corpus curation work in practice demands:
Accuracy across a long tail of languages, not just the top forty.
Script-agnostic behaviour, so Latin, Arabic, Ethiopic, N'Ko, and Tifinagh scripts receive equal treatment.
Small enough to run anywhere, including on the infrastructure that research groups and public-sector institutions already own.
Fully open, so the training corpus can be audited and the model can be extended.
Architecture
CommonLingua operates directly on raw UTF-8 bytes ngrams. There is no tokeniser, and therefore no vocabulary bias toward scripts that happen to be well represented in a pretrained tokeniser. The pipeline is:
Raw UTF-8 bytes → 3× Causal Conv1D (k=15) → Attention + RoPE → SwiGLU FFN → 334 logits
Model dimension is 256. Maximum input is 512 bytes. Total parameter count is approximately 2 million. The convolutional front-end captures local byte patterns - script, orthographic regularities, character n-gram signatures - and the attention block with rotary positional embeddings handles longer-range dependencies. The output head produces a distribution over 334 language labels.
The design is deliberate. A byte-level front-end means the model does not need separate preprocessing per script. A small model means it can be deployed by anyone. A tokeniser-free pipeline means that adding new languages does not require re-training a vocabulary.
Benchmark results
The CommonLID benchmark (Ortiz Suarez et al., 2026) is the first large-scale LID evaluation across linguistically diverse web text. Every model is evaluated under identical conditions: same preprocessing, same ISO 639-3 normalisation, same equivalence-class collapsing.
Model | Params | Labels | Strict Acc. | Equiv. Acc. | Macro F1 |
|---|---|---|---|---|---|
OpenLID v2 | ~600M | 200 | 55.77% | 70.19% | 0.6390 |
fastText-218 (NLLB) | ~600M | 218 | 59.53% | 71.64% | 0.6590 |
GlotLID v3 | ~600M | 2,102 | 57.69% | 71.26% | 0.6729 |
CommonLingua | 2M | 334 | 77.63% | 82.92% | 0.7879 |
CommonLingua improves on the next best model by more than ten points of accuracy with roughly one three-hundredth of the parameter count. Validation accuracy on held-out data is 97%. The gap is consistent across language families but widens on lower-resource languages, which is where LID has historically been weakest and where accurate classification matters most for downstream curation.
Language coverage
CommonLingua covers 334 languages across every writing system in widespread contemporary use. The language inventory was built to match the practical coverage needs of open multilingual corpus work, rather than to maximise the headline label count.
Within that inventory, 61 African languages are supported across eight groupings - Bantu (21), Niger-Congo / West Africa (18), Semitic and Afro-Asiatic (7), Cushitic and Chadic (4), Berber (3), Nilo-Saharan (3), Pidgins and Creoles (3), and two additional languages in African geography (Malagasy, Afrikaans). Several of these languages have had no meaningful support in production LID systems until now. The full list is available in the model card.
Open data, open weights, open benchmark
CommonLingua is trained exclusively on open-licensed and public domain content aggregated through the Common Corpus project. We are releasing the training dataset alongside the model so that the community can reproduce the results, extend the language inventory, and use the same methodology for adjacent tasks - script detection, code-mixing identification, dialect classification.
CommonLingua is a model that sits at the front of the corpus curation pipeline. On its own it does one thing well. Combined with the rest of the Common Corpus stack - deduplication, quality filtering, synthetic augmentation - it enables us, and anyone else, to produce training corpora that include the long tail of the world's languages at a quality level that is competitive with English-only pipelines.
Our partnership with the GSMA's AI Language Models in Africa, by Africa, for Africa initiative is where CommonLingua first goes to work at scale. The initiative's agenda spans data, compute, and talent; the data workstream depends on being able to partition text by language correctly, and CommonLingua is the first brick in that foundation. The next releases will extend coverage, improve performance on short and code-mixed inputs, and integrate with the rest of the end-to-end curation stack.
Try it
CommonLingua is available today on Hugging Face: https://huggingface.co/PleIAs/CommonLingua (Apache 2.0).
The training dataset is released alongside the model: https://huggingface.co/datasets/PleIAs/CommonLingua-Train
CONTACT US


