Selena Deckelmann, chief product and technology officer at Wikimedia Foundation, says AI crawlers have been scraping their sites and collecting content and data, straining the organization's internet infrastructure.
When President Jimmy Carter died late last year, the foundation that runs Wikipedia noticed something unusual: the flood of interest in the late president created a content bottleneck, slowing load times for about an hour.
Wikipedia is built to handle spikes in traffic like this, according to the Wikimedia Foundation, but it's also dealing with a surge of bots scraping the site to train AI models, and clogging up its servers in the process, the organization’s chief product and technology officer Selena Deckelmann told Marketplace’s Meghan McCarty Carino. The following is an edited transcript of their conversation:
Selena Deckelmann: This is traffic that doesn't necessarily follow what other people are looking at. So we have a very sophisticated caching system so that when something like someone passes away, or there's some important event happening in the world that many people want to learn about, we cache that result. But what the crawlers do is they just look at everything. They're not as interested in what other human beings in that moment are interested in. And so what that does is it causes all of the systems to load a lot more data than they normally would.
Meghan McCarty Carino: Web crawlers have been a part of the internet since the internet came online. That's how we find web pages when we search Google. But how has AI really changed the volume and intensity of traffic from bots?
Deckelmann: What we've seen is just a massive increase in interest in crawling the entire internet and creating like a treasure trove of everything that's on the internet. And what that's for is teaching large language models about what is on the internet, and giving them the ability to answer questions that people might have, such as, someone who's using a chatbot. You've interacted with ChatGPT, some other kind of a chatbot, you ask it a question, and its ability to respond is based on its exposure to all this training data. So over time, as these models have become more popular, they've been deployed at all of the major websites that someone might encounter. Those bots, they need training data. They need to know, be taught about, the world through this collection of data. So that's largely what we think is driving it. And as part of that, what we've noticed is that the data that comes from Wikipedia, and other projects that we support, because it's generated by human beings it's even more valuable to be trained on because it is very good at answering the kinds of questions that human beings ask.
McCarty Carino: What are the implications of all this for the Foundation's infrastructure?
Deckelmann: The most important thing for us right now is to communicate with the people who are operating scrapers and to ask them to collaborate with us. We actually believe that the data that has been collected by all of these incredible volunteers that it should be part of the global information ecosystem that training on it is within the licenses that we have. They're called Creative Commons licenses. It's openly licensed content. But our requests are that the companies that are relying on this information that they make every effort to support the continued existence of it, which is supporting these editors, and it's also following a few other aspects of these licenses, which are that they include some kind of attribution. We think that responsible product design choices like properly attributing Wikimedia content and other openly licensed content, it'll help in sharing back and ensuring that other people consider participating in these Commons projects. And we also ask that companies think about paying to support the future of Wikipedia. Commercial companies, they can use something like Wikimedia Enterprise, which is our paid product that enables them to reuse the content and supports the infrastructure in more effective ways.
McCarty Carino: Yeah, because ultimately, what does this increased strain mean for the usability of Wikimedia sites?
Deckelmann: I think the main effect for us when we exceed our capacity is that it impacts human access to knowledge. People rely on these information sources every day. So our job is to try to find ways to make sure that they have access to it, even as commercial companies, and even just researchers, will be accessing this data, just find ways for us to coexist. And one of those ways is to learn about more responsible ways of accessing the data other than just massively scraping the sites.
McCarty Carino: As you noted, Wikipedia has this really unique model: a lot of the content is volunteer-generated or volunteer-edited, and then you have these web crawlers, often in service of for-profit companies training their AI, which also comes with some ethical concerns of its own. Are there tensions there?
Deckelmann: Like I said, I believe that both the commercial and noncommercial uses of this support our mission, which is to distribute free knowledge in perpetuity, worldwide. We think that the internet itself, it's a place for exploration, connecting with other people, sharing knowledge, and it's also a place for commerce. So the license was designed from the beginning to support those use cases. And I can't deny that there is tension there. But I think for us, where we think we can support whatever evolution is coming for the internet in all of these changes with AI, where we can best collaborate on that is thinking of our systems as our content being free, but the infrastructure is not. And these large scale commercial reusers, they really need to recognize that the value of their products, it depends on this knowledge, on this human generated knowledge, which then supports a wider information ecosystem, and a system that ultimately we think can be used for bettering humanity, can be used for helping people know more. And that fits, I think, well within our mission.
McCarty Carino: So what does the Foundation need in order to be able to scale up to meet this demand?
Deckelmann: The primary thing that we need right now is for folks who are writing scrapers to think a little bit about how they're doing that, to use our best practices, to communicate with us and identify themselves, so that in a moment where sometimes these scrapers that go haywire, it might just actually be a mistake. So giving us a way of contacting them is really important. And then, like I said, finding ways of supporting the future of Wikipedia, through attribution, through working with us on Wikimedia Enterprise, those are the best ways right now.