Openai accused by a lot Artificial intelligence training parties to copyright content without permission. Now new paper Through the Amnesty International Observatory, AI assumed a serious accusation that the company has increasingly relied on non -public books that it had not authorized to train the most advanced artificial intelligence models.
Artificial intelligence models are basically complex prediction engines. Train on a lot of data – books, movies, TV programs, etc. – they learn the simple patterns and methods of extrapolation. When a “writing” model writes an article on a Greek tragedy or “drawing” the images of a observation-unseen, it simply pulls from its wide knowledge to approximate. It does not reach anything new.
While a number of artificial intelligence laboratories, including Openai, have begun to embrace the AI’s data to train artificial intelligence because they exhaust the sources of the real world (mainly on the public Internet), it has avoided a few of the realistic data completely. This is probably because training in purely artificial data comes with risks, such as exacerbation of the model performance.
The new paper, from the project to disclose artificial intelligence, a non -profit institution that participated in 2024 by the pole of the media, Tim Aurelli and the economic, Ilan Strauss, concluded that Openai has probably trained it GPT-4O Model on Paywalled Books from O’Railly Media. (O’reillly is the CEO of O’Railly Media.)
in Chatgpt,, GPT-4O is the default model. The paper says that O’Railly does not have a license agreement with Openai.
“GPT-4O, the most modern and capable Openai model, shows a strong recognition of the content of the O’Railly Bookd book (…) compared to the former Openai model GPT-3.5 Turbo”, the authors participating in the newspaper wrote. “On the contrary, GPT-3.5 Turbo shows a larger relative recognition of the audience’s O’Reilly book samples.”
The paper used a method called BIt was first presented in an academic paper in 2024, designed to discover the copyright content in the language models training data. Also known as the “organic reasoning attack”, the method tests whether the model can distinguish the texts that are reliably composed of man from the versions created from artificial intelligence from the same text. If he can, it indicates that the model may have prior knowledge of his training data.
The authors participating in the paper-Auraili, Strauss, and the artificial intelligence researcher Surul Rosenblatt-say they looked for GPT-4O, GPT-3.5 TurboAnd know the other Openai models with the published O’Railly Media Books before and after the date cutting dates. They used 13,962 excerpts from the paragraph of 34 books to estimate the possibility of inserting a specific excerpt in the model training set.
According to the results of the paper, the GPT-4O is “recognized” much more than the content of the O’Railly Book than Openai’s older models, including GPT-3.5 Turbo. The authors said that even after calculating potential confusing factors, such as improvements in the ability of newer models to know if the text has been composed of man.
The participating authors wrote: “GPT-4O (most likely), as well as pre-knowledge of many non-public O’Railly books published before the date of cutting training.”
It is not a smoking pistol, as the participants are keen to refer to it. They acknowledge that their experimental method is not guaranteed, and that Openai may have collected book excerpts restricted by users who copy and paste it into ChatGPT.
An enemy of water, the authors participating did not evaluate the latest set of models from Openai, which include GPT-4.5 and “Thinking” models such as O3-MINI and O1. These models may not have been trained in the trial O’Reilly book data, or have been trained in an amount less than GPT-4O.
However, it is not a secret that Openai, which he called for The imposed restrictions On developing models using copyright data, looking for high -quality training data for some time. The company has gone to the extent Journalists rent to help control the outputs of their models. This is a trend through the broader industry: Artificial intelligence companies recruit experts in areas such as science and physics These experts actually make their knowledge of artificial intelligence systems.
It should be noted that Openai pays at least some training data. The company has licensing deals with news publishers, social networks, media libraries and others. Openai also provides subscription cancellation mechanisms- And if it is incomplete – Which allows copyright owners to suspend their content who prefer not to use the company for training purposes.
However, since Openai’s battles are many claims for training data practices and the treatment of copyright law in American courts, the O’reillly paper is not the most tempting.
Openai did not respond to a request for comment.
https://techcrunch.com/wp-content/uploads/2024/12/GettyImages-2021258442.jpg?resize=1200,800
Source link