What is open source AI?

Feb 13, 2025

The Paris AI Summit has in many ways the summit of actually open AI. Common Corpus is one of the main initiatives officially endorsed at the end of the summit to create a healthy and ethical sharing ecosystem.

Despite the important clarification, there are still uncertainties about what is really open in AI. Commercial players take advantage of these ambiguities to position themselves as "open source" leaders while not really committing meaningfully to open research.

Open source AI is now commonly differentiated from open weights models, where only the model parameters are open but under an uncompromisingly open license. During the Summit, an Open Weight Definition has been proposed by the Open Source Alliance. Initiatives like this have much more weight (no pun intended) in the wake of DeepSeek: when one of the top frontier is under an uncompromising MIT license, they intently discredit weird schemes from lesser models like the "open" llama 3.2 being unavailable in Europe.

The official definition of "Open Source AI" from the Open Source Initiative requires fully open weights with sufficiently detailed information over the training data. This is definitely a marked improvement over the opacity of research information from major labs. Yet, I don't see many examples of models barely fitting the definition which keeps me wondering if it is that relevant. Commercial labs will simply not even consider disclosing even broad details about the training data.

Fully open models with available training data, regardless of licensing. This is the most common solution for "open everything" models, including the one from Eleuther, Allen AI, HuggingFace, OpenLLMFrance, all trained with substantial amounts of data coming from copyrighted web archives. I personally consider any of these models to be fully open source as they provide the adequate amount of recipes to meet the four liberties of free software. Substantial legal uncertainties have impeded over the development of open training datasets: Eleuther had to drop several parts of The Pile already. Not long ago, a Dutch continuous pretrain of llama had to be completely deleted due to using data under license. Licensing uncertainty is a continuous sword for committed open science projects.

Fully open models with releasable training data, with accessible documented training data either under free licensed or uncopyrighted. This is the ecosystem we have been trying to build with Common Corpus in accordance with emerging global standards of traceability from the EU AI Code of conduct, the Trusted open data initiative from the AI Act. I would not claim these models are more open than any model trained on released data, just that they bring actual legal security to the entire process and meet the requirements of the EU text & data mining exception: lawful sources, releasable and no opt-out. On top of the Pleias models, multiple models are currently being trained on Common Corpus. In the US Eleuther is coordinating a similar initiative with the Common Pile.

Finally some projects call for a license continuation between the data and the model. Basically, if you train a model on Wikipedia, the model should stick to the same license. This approach goes beyond the requirements of existing case law in the US and the EU and makes it very hard to combine data from different sources. Even the best lawyer in the world would not know how to license a composite work with license continuation from CC-BY-SA, GFDL and the French “licence ouverte”. In practice, license continuation is only feasible if you limit yourself to public domain works or happen to live in a country with a long time distrust of copyright (US federal public domain).

Overall the most impactful action at this moment is to make a stable open ecosystem happen and provide enough contextual information to let different actors decide according to legal or market constraints. This is what we tried to build with Common Corpus 2.

We also hope to see more legal clarification over existing open science and open data programs to expand this initial pool of training data commons beyond the currently available 2 trillion tokens. There are currently more than 100 million scientific publications in open access but only a bit more than 10% under clear licenses. The commons are not limited to what currently exists: they can still be built.