New York Times suit may test copyright law’s constraints on AI

Meghan McCarty Carino Dec 28, 2023

Heard on:

New York Times suit may test copyright law’s constraints on AI

Meghan McCarty Carino Dec 28, 2023

Heard on:

Most AI models are trained on data sets scraped from the internet. OpenAI trained its chatbot, ChatGPT, with data that included Times content, the lawsuit says. Sebastien Bozon/AFP via Getty Images

The New York Times has sued OpenAI and Microsoft for allegedly copying and regurgitating millions of articles via ChatGPT, a product the paper says is now acting as a competitor. It’s the latest in a slew of lawsuits from content creators and publishers — like comedian Sarah Silverman and “Game of Thrones” author George R.R. Martin — who say tech companies are violating copyright law by training AI models on their work without permission or compensation.

At the heart of these cases is the practice of “scraping”: ingesting vast amounts of data from the internet to train artificial intelligence models like ChatGPT.

The internet is crawling with bots — scanning website by website, downloading and indexing everything. Many of these web crawlers help retrieve information for search engines from that sea of data, but they increasingly feed generative AI models, which are hungry for human content to learn from.

“You know, GPT is an overeager college intern who’s read the entire internet,” said Cecilia Ziniti, an AI entrepreneur and former counsel for tech companies who has read the entire New York Times lawsuit.

We’ll save you the technicalities. Basically, the Times alleges that its content was heavily weighted in a data set — known as Common Crawl — that OpenAI has acknowledged using to train earlier versions of ChatGPT.

“Common Crawl was a nonprofit organization put together to basically make a copy of the internet,” Ziniti said.

Professor Daniel Gervais, the Milton R. Underwood chair in law at Vanderbilt University, said using that copy of the internet could fall under the doctrine of fair use, which allows for the reproduction of copyrighted works in some circumstances. But there are limits.

“One of the factors to consider is whether use is commercial,” Gervais said.

He said many AI companies start out as nonprofits when developing their technology, then go on to build for-profit products.

Meanwhile, an increasing number of websites have begun blocking web crawlers from copying their content, said computer science professor Arvind Narayanan at Princeton.

“That just means putting an instruction on your websites that every web crawler or robot is supposed to honor,” he said.

But it’s a little bit of a gentleman’s — or gentlebot’s — agreement.

“Other ways of blocking bots are kind of more adversarial. They would use technology to recognize the ways in which bots behave suddenly differently from regular people,” Narayanan said, noting that might include actions like flitting from page to page very quickly. Then sites might kick those users off.

An internet less accessible to crawling and scraping, he added, could help ensure that content creators are fairly compensated for their work, but it could also impede other uses of crawling — like research.

Narayanan said he has used scraped data to track companies’ privacy policies over time and analyze whether they’re keeping their word.

Associate professor Kayla Harris, director of the Marian Library at the University of Dayton, said she and other archivists rely more and more on scraped data. “Web crawlers are used to archive websites as kind of a snapshot in time,” Harris said.

Libraries and archives typically would save things like letters and journals and diaries, she explained. When such primary source information is posted online, “we use web crawling to capture it and preserve history in the same way that we would have done with physical materials.”

Stories You Might Like

Your donation today powers the independent journalism that you rely on. For just $5/month, you can help sustain Marketplace so we can keep reporting on the things that matter to you.

Also Included in

Tags in this Story

Share this Story

Latest Episodes From Our Shows

5:21 PM PDT