The challenges of archiving the internet
May 25, 2023

The challenges of archiving the internet

Cultural and historical content on the internet can be preserved, but technical barriers and getting people to care can get in the way.

The internet is where so much of what happens in our world gets archived. But where does the internet get archived?

There are projects around the world, like the Internet Archive, to try to preserve some content online.

Marketplace’s Meghan McCarty Carino spoke with Kayla Harris, a professor and director of the Marian Library at the University of Dayton, about whether the current archiving work is enough.

The following is an edited transcript of their conversation.

Kayla Harris: I would say no, because it’s a mix of both technical challenges to archiving this massive quantity, but also, I think, it’s about the human side of things and getting people to care about why we would want to preserve that stuff in the first place. And I think there’s this very widespread misconception that like, well, if it’s on the internet, it’s there forever. And so there’s not the understanding that no, it’s not necessarily there forever and someone or something has to save it if you want it to be there forever.

Meghan McCarty Carino: Can you think of any sort of glaring examples of something not being there forever on the internet that kind of you wish were still there?

Harris: I think part of it is, even though sometimes a website might still be there, websites often have dynamic content. So even though the website itself might still be there, like say a news website that’s constantly changing with the latest headlines, even though the site itself might still be there, maybe the kind of flash-in-the-pan news story isn’t. And, you know, especially during COVID for example, a lot of institutions, a lot of organizations, you know, they’d put things up on their website, like, “now closed,” “we’re closed indefinitely.” And then when things would open up, they would update that, right, because you want people to have on your website the most up-to-date information. But if the page wasn’t archived, when it said something else, then it’s gone. And that part’s a little bit harder, I think, for people to understand. Web pages and websites are so dynamic, the whole thing might still be there, but not the individual pieces.

McCarty Carino: I’ve actually thought about that a lot. When I think about kind of, you know, historically documenting the pandemic and watching, you know, Ken Burns documentaries, where there’s so much written material about these different passages of history, and so much of our pandemic documentation is just digital stuff that might not always be around.

Harris: Yes, I mean, archivists, cultural heritage professionals, especially during the pandemic, a lot of them were making comparisons to the 1918 pandemic and the sorts of materials and records and even personal accounts that we had then. But how is that stuff being communicated today? And if it’s online, we have to actively save it or else it will not be there for the people of the future to be able to compare to the 2020 pandemic.

McCarty Carino: What are some of the technical barriers that you mentioned, to archiving internet content?

Harris: One just kind of simple one is websites are meant to be dynamic. Unlike archiving or collecting other material — cultural heritage material, whether that’s books, artifacts, etc., they’re stable, they have a kind of a “fixity” to them — websites are constantly changing. That can be the content on the homepage, it could be the design style — you know, early websites that we’re using really great, flashy HTML gifs and that sort of thing. And then we update and now we make our websites more accessible and adaptive. But also things like URL. Part of the kind of trickiness is there’s not really a clear consensus of what constitutes a website. Is it its content? Or is it the URL, the domain that it lives at? And on the Internet Archive’s Wayback Machine, which people can use to navigate and see these previous iterations of websites that are captured, it’s by domain. So sometimes those change. So something like, you know, might have always been CNN. But if there was another URL that used to be owned by someone else, then that content is going to be there, and it’s harder to trace that history.

McCarty Carino: I can also see sort of a challenge of, you know, whose domain is this? Because one of the things that makes the internet what it is, is it’s just sort of this open-ended network that no one is in charge of. So who’s in charge of archiving it?

Harris: Exactly. And that’s, I think, where again that human side comes in, and unfortunately, or maybe fortunately, the human side also presents biases. So just like in a physical archive, there is an archivist or many archivists who select which materials they believe to be historically valuable, that they preserve cultural heritage in some way. And so there’s going to be bias inherent in that, because what I think is important for future generations may not be the exact same thing that someone else thinks is important. There’s not really the ability to just archive everything, so what is being archived is often selective. And that could be selected by a person, by groups of individuals. But there’s going to be some biases introduced about whose cultural heritage is worth saving on the internet.

McCarty Carino: What kinds of ethical concerns does this all raise?

Harris: I think because of the way we think of the web as being this dynamic, changing thing, not everyone who creates web content — whether that’s a website or social media, for example — expects it to be permanent. And so, you know, it gets into some ethically dicey situations when thinking about things that people intend to post online as sort of ephemeral, and then by someone choosing to archive it without permission — that’s a whole other thing of, you know, finding the creator asking for permission, etc. Is it really right for that person to save someone else’s content or that organization to save another community’s content? I think this comes out in some protest movements. Sometimes archivists and others included, get caught up in this idea of, “Well, we have to document, we have to preserve.” But for things like protests or rallies, people who are there physically in person aren’t necessarily counting on their picture being taken and put online and then archived forever.

McCarty Carino: What sort of content are you most worried about losing forever?

Harris: There’s a collective right now called Saving Ukrainian Cultural Heritage Online, or SUCHO. And when we think of cultural heritage destruction, sometimes it’s easy to kind of understand like, well, if someone is bombing another country and they’re destroying these World Heritage Sites, that’s tangible and easy to see. But what also happens when websites are taken offline and the things online that make up their cultural heritage? Another one that also kind of stands out to me is local journalism. And there was a study done by the Tow Center for Digital Journalism at Columbia University a few years ago, and they called it “The Dire State of News Archiving in the Digital Age,” because they talked to these smaller news organizations about what they had in place for archiving their content. And most of them, a) didn’t know what that meant, or b) thought that if they had a Google Doc and they backed it up somewhere, that that was archiving. And so, I think that brings in a lot of concerns. Local news is really important to understand in a community. And again, it goes back to the the human issue: People have to want to care to preserve that content.

McCarty Carino: So what do we lose if we lose that kind of content?

Harris: I think, you know, our cultural heritage, our humanity. I’m sure everyone’s written a social media post or a tweet or something that they’re, like, “Yeah, this doesn’t need to live on in perpetuity to document what it was like living in this time and age.” But there’s lots of other things on the internet that are worth saving. There’s a quote from someone, an academic, Megan Sapnar Ankerson. And she said, “It is far easier to find an example of a film from 1924 than a website from 1994.” And so that physical media, you know, the arts, the humanities, that’s all captured in this physical, in books and movies, in plays, operas and that sort of stuff. That cultural heritage that happens online now, if we don’t save it, what are the future generations going to look to?

Speaking of losing content, Google recently announced updates to its inactive accounts policy.

Basically, if an account has been inactive for two years or more, there’s a chance it could get deleted entirely.

That got people worried that the same would happen to inactive accounts on YouTube, which is owned by Google, meaning videos more than a decade old — some of which were defining content of YouTube’s early days — would also be deleted.

A Google representative later clarified that this policy wouldn’t apply to YouTube accounts, so for now, it looks like those early genre-defining videos like “Zombie Kid Likes Turtles” or “Keyboard Cat” won’t be lost to posterity just yet.

The future of this podcast starts with you.

Every day, the “Marketplace Tech” team demystifies the digital economy with stories that explore more than just Big Tech. We’re committed to covering topics that matter to you and the world around us, diving deep into how technology intersects with climate change, inequity, and disinformation.

As part of a nonprofit newsroom, we’re counting on listeners like you to keep this public service paywall-free and available to all.

Support “Marketplace Tech” in any amount today and become a partner in our mission.

The team

Daisy Palacios Senior Producer
Daniel Shin Producer
Jesús Alvarado Associate Producer