
It Turns Out You Can Train AI Models Without Copyrighted Material
A recent study has revealed a surprising finding in the world of artificial intelligence (AI): it is possible to train AI models without using copyrighted material. This discovery could have significant implications for the industry, as many companies claim that their tools cannot exist without training on copyrighted content.
The researchers behind this project collaborated with 14 institutions, including universities and nonprofits, to create an unprecedentedly large dataset of public domain and openly licensed materials. This dataset, totaling a massive 8 TB in size, was used to train a seven-billion-parameter large language model (LLM).
According to the study, the resulting LLM performed similarly to Meta’s Llama 2-7B model from 2023, which is a significant achievement given the limitations of the dataset. However, it’s important to note that the team did not release benchmarks comparing their results to those of current top models.
While the performance of this less powerful LLM may not be comparable to today’s industry standards, it does demonstrate that it is possible to create AI models without relying on copyrighted material. This finding directly contradicts claims made by OpenAI and other companies, which have stated that training AI models without copyrighted content would be “impossible.”
The study’s authors acknowledge that creating such a model is not only less powerful but also significantly more difficult due to the manual annotation process required to prepare the dataset. The researchers had to manually sift through large amounts of data, often having to determine the licensing terms for each individual piece of content.
In light of this discovery, it will be interesting to see how AI companies respond and adapt their claims in light of this new information.
Source: www.engadget.com