The cost of losing government webpages and public data

404: Page Not Found. That error message has become a more common sight on government websites. Many — reportedly thousands — of federal government webpages were recently taken down, ranging from Census Bureau research on depression among LGBT adults to Food and Drug Administration guidance for making clinical trials more diverse.
These erasures come after President Donald Trump signed executive orders cracking down on diversity, equity and inclusion initiatives and what he calls gender ideology.
Marketplace’s Stephanie Hughes spoke with Jack Cushman, director of the Harvard Library Innovation Lab and a contributor to the End of Term archive project, which works to preserve government sites before a new administration takes over. They discussed his recent work archiving those sites and data sets and what’s lost when these digital artifacts are not properly archived.
The following is an edited transcript of their conversation.
Jack Cushman: I think one way to put it is that a great deal has been archived, and especially [by] the End of Term archive. Another way to put it is that a great deal has not been captured. We’re talking about a government of 2 million people who generate data in the course of doing their work, and whatever their work is, they’re going to report out to us as their ultimate employers. Here’s what I saw, here’s what I recorded, here’s how you can get access to a copy of it. So we’re talking about an infrastructure that is much larger than there’s any ability for the external community to make a copy of, so if we, in a serious way, if we shut off access to all of the data that the government has created for us, that we paid for, then we’re going to end up not having a copy of most of it at the end of the day. This isn’t something that we can fix from the outside, but it is something where we can get the copies of the things that are most important or most often used and help our patrons that way.
Stephanie Hughes: A federal judge recently ordered that at least some of these government websites be restored. Do you have a sense of how easy or how hard it is to do that?
Cushman: Sure. So I think it could be very easy to restore a website, or it could be very hard, depending how it was taken down in the first place. If you imagine that a website is a car driving along the road, and then you make the car stop, it could be, well, you just pulled over and you left the engine idling, and you just need to, like, hit the gas again. It could be that you turn off the keys. It could be that you threw the keys out the window. It could be that you removed the engine. And it’s very hard to tell in a given instance which one it is. I do think that if you shut down big running software projects and you let the people go who know how they work, and you let the running copies go and you’re trying to restore from backups, you can easily end up in a situation where it is expensive or prohibitive to get something running again.
Hughes: This taking down of web pages, it’s happening not just with the government. It’s happening with corporate websites. It’s happening with prestigious universities. What’s the cost of losing this information?
Cushman: So there’s a focus on government data right now, but the same issue applies to any other kind of information that you care about. So if there are videos that you remember seeing and caring about on YouTube, there’s probably only one account that controls those, and either a person or a Google policy could make that video vanish. If you have stories that you’re proud of publishing, especially if you’re publishing behind a paywall, there’s probably only one copy that most people access, and a change in policy or an accident or mistake could change the content of that forever, in a way that you would have trouble recovering from. The reason this happens is that when we’re working with digital materials, the incentive is to invest everything in the one best copy. So you get the one video-sharing website that everyone likes to use, and you invest more in that instead of having more copies. And when we get down to one copy of everything, we’re vulnerable to policy change, but we’re also vulnerable to accident and we’re vulnerable to cyberattacks, or we’re just vulnerable to losing our cultural memory. I think the impact of having a vulnerable cultural memory is that anything that we care about is harder to plan, and as we get on thinner and thinner ice, as we have fewer and fewer copies of each piece of data, there’s more and more vulnerability to just forgetting where we came from or important information that we had, and therefore making mistakes about what we do next.
Hughes: You said, economically, it makes sense to not have multiple copies of digital information. Can you tell me a little bit more about that? Like, it just actually costs money to have multiple copies of things on the internet?
Cushman: Sure. Imagine that you were a nonprofit funder, and you were trying to decide whether to give me money to make a new website that shares access to important public information. I could tell you, “I’m going to make a second copy of public information that everyone can already get from the government website directly.” Or I could tell you, “I’m going to make a copy of information that no one else has, that only we have at this library. That’s going to be the first time you can get it online.” Well, your funders are going to be a lot more likely to fund the story that is, I’m going to put information online for the first time. It’s also true at a business. If you’re asking for investment capital, if you say, “For the first time, people will be able to share their story,” that’s a lot more fundable than “I’m going to make a copy of YouTube,” even though YouTube already exists. No one wants to hear, “I’m going to make this for the fourth time.” They’re going to ask, “What’s better about yours than the one that’s currently the most popular?” And most of the time, the answer is, well, the one that’s most popular has the most money coming in, it has the most users, it has the most features. The current popular one is the best one, and we’re all going to go use that one instead.
So in the internet, you end up with this concentration where everyone is using the most popular version of the thing, and that’s fine as long as it works, and then if there’s any threat to the most popular thing, you don’t have that resiliency that you used to have. The other thing I like to compare this to is the issue of supply chains, and just-in-time supply chains that a lot of us learned that phrase for the first time in the [COVID] pandemic, when all of a sudden you couldn’t find toilet paper, and it turned out that our supply chains had a lot less redundancy or resiliency built into them. I think all of our data infrastructure online has that same just-in-time quality, where we’ve paid only the minimum amount we have to to keep it working today and not the amount to survive any kind of shock or challenge.
Hughes: I mean, I’ve had the experience as a reporter where I’ve saved a link to data that I was referencing in a story and then I’d gone back to it and it wasn’t there, and I felt like part of my brain was missing. What advice do you have for people who stumble across some data or some information on the web that they want to reference and hold on to, how can they do that?
Cushman: Well, the No. 1 advice for any kind of data preservation is lots of copies — keep stuff safe. And what that means for you is don’t trust remote things that you don’t control to be part of your memory. If you notice, oh, this is part of my memory, then say, I better figure out how to have a copy of this. And there are great tools to do that with. One of my favorites is something called Archive Webpage by the project Web Recorder, and that’ll give you either a browser extension in your browser or an application you can download where you can visit parts of the web and click Record, and everything that you see will be recorded into a file, which you can then save and keep wherever you want. I think the next step is to start thinking about how you can group together to invest in things like your local library or the local resources that you use, how to not rely on that one point of failure for everything that you do.
On a slightly lighter note, while working on this story, I wondered where the “404” error message came from, and I found a 2017 Wired story on its origins. It says that in the early days of the web, writing long messages took up valuable time and memory for coders, so a numerical designation was given to certain errors. 404 was assigned to “not found.”
I guess it’s more polite than the internet blowing you a giant raspberry, which is kinda what that 404 feels like when you come across it, followed by a sinking feeling in the pit of your stomach because the information you thought you could rely on isn’t all that reliable.
The future of this podcast starts with you.
Every day, the “Marketplace Tech” team demystifies the digital economy with stories that explore more than just Big Tech. We’re committed to covering topics that matter to you and the world around us, diving deep into how technology intersects with climate change, inequity, and disinformation.
As part of a nonprofit newsroom, we’re counting on listeners like you to keep this public service paywall-free and available to all.
Support “Marketplace Tech” in any amount today and become a partner in our mission.