Harvard University is launching a free massive dataset for AI training with funding from OpenAI and Microsoft

Photo of author

By [email protected]


Harvard University announced Thursday that it will release a high-quality dataset of nearly 1 million public domain books that anyone can use to train large language models and other artificial intelligence tools. The dataset was created through the newly formed Enterprise Data Initiative at Harvard University with funding from both Microsoft and OpenAI. Contains books scanned as part of the Google Books project that are no longer protected by copyright.

About five times its size The Books3 dataset is notorious Used to train AI models such as Meta’s Llama, the Enterprise Data Initiative database covers genres, decades and languages, with classics included by Shakespeare, Charles Dickens and Dante alongside obscure Czech mathematics books and Welsh pocket dictionaries. The project is an attempt to “level the playing field” by giving the general public, including small players in the AI ​​industry and individual researchers, access to a type of highly sophisticated and accurate information, says Greg Lippert, executive director of the Enterprise Data Initiative. Curated repositories of content that typically only established tech giants have the resources to assemble. “It underwent rigorous review,” he says.

Lippert believes the new public domain database could be used alongside other licensed materials to build AI models. “I think it’s a bit like the way Linux has become the primary operating system for a large part of the world,” he says, noting that companies will still need to use additional training data to differentiate their models from those of their competitors.

Burton Davis, Microsoft Vice President and Deputy General Counsel for Intellectual Property, confirmed that the company’s support for the project was in line with… “Her broader beliefs about the value of creativity”“accessible data sets” for use by AI startups that are “managed in the public interest.” In other words, Microsoft does not necessarily plan to replace all the AI ​​training data it has used in its own models with public domain alternatives like the books in the database. New Harvard Data “We use publicly available data for the purposes of training our models,” Davis says.

like Dozens Lawsuits filed regarding use Copyrighted data To train artificial intelligence winds On their way through the courts, the future of how AI tools are built is at stake. If AI companies win their cases, they will be able to keep them Internet scraping Without having to enter into licensing agreements with copyright holders. But if they lose, AI companies may have to overhaul how they make their models. A wave of projects, such as the Harvard database, is moving forward on the assumption that — no matter what happens — there will be an appetite for public-domain datasets.

In addition to its treasured book collection, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from various newspapers now in the public domain, and says it is open to forming similar collaborations in the future. The exact manner in which the book dataset will be released has not been determined. The Enterprise Data Initiative has asked Google to work together on public distribution, but the search giant has not publicly agreed to host it yet, though Harvard says it is optimistic it will. (Google did not respond to WIRED’s requests for comment.)



https://media.wired.com/photos/6758c4f76e9b08fbb37d00b2/191:100/w_1280,c_limit/Harvard-LLM-Database-Business-1487509661.jpg

Source link

Leave a Comment