A new study indicates the content of the preserved Openai models.

Photo of author

By [email protected]


A New study It seems that it gives credibility to the allegations that Openai has trained at least some AI models on copyright content.

Openai is involved in the lawsuits filed by authors, programmers and other rights holders who accuse the company of using their works-books, codebases, etc.-to develop their models without permission. Openai has long claimed a Adel use The defense, but the prosecutors in these cases argue that there is no bus in the law of copyright in the United States for training.

The study, which was co -authored by researchers at the University of Washington, Copenhagen University and Stanford, proposes a new way to determine the training data “saved” by models behind the application programming interface, such as Openai’s.

Models are prediction engines. Train on a lot of data, learn patterns – so they can create articles, pictures and more. Most outputs are not literal copies of training data, but because of the way they “learned”, some of them are inevitable. Photo forms were found for Regurgitate Screen shots from the films they were trained inWhile the language models were observed Effective international news articles.

The study method depends on the words that the participating authors call “high”-that is, the words that are not common in the context of a larger set of work. For example, the word “radar” in the sentence will be considered “Jack and I am still with the tinnar radar” high because it is less statistics of words such as “the engine” or “radio” to appear before the “tinnitus”.

The authors participating in many Openai models, including GPT-4 And GPT-3.5, for memorization marks by removing high words from scraps from fictional books and New York Times Pieces and making models try to “guess” that have been hidden. If the models are able to properly guess, they are likely to keep the excerpt during training, the participating authors conclude.

Openai copyright study
An example of a “guessing” word “highly advanced word” model.Image credits:Openai

According to the results of the tests, GPT-4 showed signs of memorizing parts of common imagination books, including books in a collection of data containing e-books protected by copyright called Bookmia. The results also indicated that the model preserves parts of the New York Times articles, albeit at a relatively fewer rate.

Abhlasha Ravechander, a doctorate student at Washington University and a participant author of the study, told Techcrunch that the results that shed light on the “controversial data” models may have been trained.

“In order to have great linguistic models worthy of confidence, we need to have models that we can investigate, review and examine them scientifically,” said Ravenicander. “Our work aims to provide a tool to investigate large language models, but there is a real need for the transparency of larger data in the entire ecosystem.”

Openai has long called The imposed restrictions On developing models using copyright data. Although the company has some content licensing deals in place and provides the mechanisms of canceling the subscription that allows copyright owners to the container the content they prefer to use for training purposes, it contains I pressed many governments To write down the rules of “fair use” about the method of artificial intelligence training.



https://techcrunch.com/wp-content/uploads/2025/03/GettyImages-1466243153.jpg?resize=1200,705

Source link

Leave a Comment