Revision — From the December 2018 issue

Preservation Acts

Toward an ethical archive of the web

Download Pdf
Read Online

After eighteen-year-old Michael Brown was shot and killed by a police officer in Ferguson, Missouri, Bergis Jules found himself worrying not only over the horrors of the present, but also over how little of the present was likely to be preserved for the future. The best reporting on the aftermath in Ferguson was being produced by activists on Twitter, a notoriously ephemeral medium. Jules, then an archivist at the University of California, Riverside, had the impulse to start saving tweets, but wasn’t sure how. “That whole weekend, watching things unfold, I thought, ‘This is a really amazing historical moment; we should think about capturing it,’ but I was just talking to myself,” he says. The following week, attending a Society of American Archivists conference in Washington, D.C., he voiced his fears en route to drinks at the hotel bar. He caught the ear of Ed Summers, a developer who just so happened to be the author of a Twitter archiving tool—and who promptly programmed it to va­cuum up #Ferguson tweets. Within two weeks, he had amassed more than 13 million.

Illustration by Hanna Barczyk

Three weeks after the shooting, Summers blogged about the archive, which he and Jules were considering making public. Shortly thereafter, they received an inquiry from a data-mining company. When they pulled up the firm’s website, they read that its clients included the Department of Defense and, ominously, “the intelligence community.” What did the company want with the data? And what were the ethical implications of handing it over—perhaps indirectly to law enforcement—when the protesters’ tweets would otherwise evade collection? Using Twitter’s Application Programming Interface (API), the code that developers use to call up Twitter data, anyone can sift through tweets that were posted in the past week, but older posts disappear from the API’s search function, even if they still exist out on the web. The data-mining company was too late to nab a swath of the #Ferguson tweets. (Twitter has since unveiled a “premium” API that allows access to older data, for a substantial fee.) Newly mindful of the risks, Jules and Summers waited almost a year to publish their cache.

This moment marked a shift in how the two thought about archiving. It’s easy to argue that they had simply copied public data to another forum. But they began to wonder what it meant to take an ephemeral object—destined, after days and weeks, to sink to the bottom of an ever-shifting pile—and render it permanent. It wasn’t hard to see how an archive of civil disobedience could become a tool of government surveillance. Even in less perilous cases, they were perhaps deciding how the future would remember the author of a tweet, inscribing a legacy from what was meant to disappear.

Moreover, the posts they had collected were not only political, but personal. In a traditional archive, it would be standard practice to contact an author—especially with regard to material as sensitive as the Black Lives Matter movement—and ask whether they wanted their words preserved for future generations. But at this scale, such a consideration was impossible. How would one begin to contact and secure permission from so many tweeters? And even if a researcher could, how would they decide what was most important, what worth saving? How would they explain to the future why they chose what they did? How would they explain what was missing?

In the early days of internet archiving, preservationists envisioned hoarding the web in its entirety. This maximal approach still holds sway at the Internet Archive, by far the biggest and best-known steward of our vast online world. Since 2001, its Wayback Machine has crawled the web thousands of times every day, saving as much as possible, as fast as it can.

In its quest to document everything, the Internet Archive adheres to no established archival protocol. In the paper archives of yore, each item went through a process of appraisal to ascertain its value—Is it worth saving?—its authenticity and completeness—What qualifying information should be saved along with it?—and the ethical hurdles to acquiring it—Who created or owned it, and should they get a say? Eschewing these questions, the Wayback Machine aspires to be more of an automatic backup than archivist. As the founder, Brewster Kahle, told me, “If we’re really successful, it just becomes plumbing.”

On the internet of Kahle’s dreams, nothing decays. No matter how deeply buried, the past waits to be found if you know where to look. On the internet we live with, however, things dissolve all the time. Digital objects decompose faster than physical ones—technology evolves, and files in old formats become unreadable hieroglyphs. Web pages migrate, disappear, or change without warning, on average every hundred days, and no person, or program, can be everywhere at all times. The largely automated Wayback Machine comes the closest—it has crawled more than 339 billion web pages at last count—but its contents remain a sliver of the gargantuan whole.

Even so, quixotic faith that the internet can be preserved wholesale persists. In 2010, the Library of Congress attempted to become the Internet Archive of the social web, announcing that it would preserve every tweet, from 2006 forward. Among journalists, the project generated mild derision of the who-cares-what-Joe-Schmo-ate-for-breakfast variety, but also enormous excitement among researchers, roughly four hundred of whom wrote during the project’s first three years with requests to use the data. So far, not one of them has gotten a hand on it. Instead, the project has come to symbolize just how ill-prepared libraries are for the internet era. The LOC struggled for years to index its tweets; at one point in 2013, a simple keyword search of the archive took twenty-four hours. When I spoke with one data scientist, Kalev Leetaru, he likened the project to “an elementary school library trying to take in the Library of Congress’s entire archive.”

The project also exposed the ethical problems with vacuum-hose archiving, which are especially evident in the semi-diaristic space of social media. In the beginning, the library maintained that Twitter was public, and therefore it didn’t need users’ permission—but this elided the uncomfortable fact that most tweets aren’t written with a broad audience—let alone posterity—in mind. Its archival methods were incapable of filtering out home addresses, credit card numbers, terminal diagnoses, and other personal information. If it was debatable whether the library should have been acquiring that information in the present, it was in many ways worse to save it for the future, when its uses couldn’t be known. As the internet ethics scholar Michael Zimmer has written, the library’s approach “presumes a false dichotomy that information is either strictly public or private, ignoring any contextual norms that might have guided the initial release of information through Twitter or how a person expects that tweet to flow.”

The LOC maintains that the archive will eventually be accessible, but it still won’t be a truly comprehensive collection. Many tweets link to articles, videos, and other pages that go unpreserved by the library. Changes to the Twitter platform, the demographics of its users, and the unmasking of thousands of handles as covert Russian bots are so far unacknowledged. In late 2017 the library switched to collecting tweets selectively, as it does with all other materials. The larger archive remains closed.

If wholesale collection is both technologically infeasible and ethically suspect, then what might a new model of digital archiving look like? Following Jules and Summers’s initial efforts at chronicling the online response to Michael Brown’s killing, the two launched a project called Documenting the Now (DocNow), a rare alliance between library studies and computer science. Jules and Summers realized that collecting the web—where communities of color have carved out space, and queer and marginalized voices have flourished—was an opportunity to democratize the historical record. Documenting the Now attempts to afford its subjects the same level of agency that has always been granted to the powerful.

I met them in March, at a conference they helped organize called the National Forum on Ethics and Archiving the Web, held in the bowels of New York’s New Museum. Summers, tall and bespectacled, sat quietly taking notes toward the back of the room; Jules, the archivist, with a round face, ready smile, and a voice inflected by his Afro-Caribbean origins, ushered speakers onto the stage. He seemed to know everyone, and to be hailed by a colleague at every break between panels. For the last four years, Summers had been working on a suite of tools designed to facilitate ethical archiving, among which is an appraisal program: a dashboard that displays the top Twitter users, hashtags, and keywords trending at any given moment, allowing archivists to monitor what they might want to save, and then to explore the data in bite-size samplings of a thousand tweets. In addition to cataloging activity related to the Black Lives Matter movement, this had enabled Jules and Summers to document the protests at Donald Trump’s inauguration, the Unite the Right rally in Charlottesville, Virginia, and the aftermath of Hurricane Maria in Puerto Rico. While the project is currently limited to Twitter (which is easier to harvest than the more labyrinthine Facebook or Instagram), Summers and Jules hope to eventually incorporate other social media platforms.

Jules and Summers had been sharing their tools, and their ethical concerns, with a growing interdisciplinary community of archivists, developers, and other interested parties via a Slack channel. They emphasized the importance of what Summers calls “the documentation that comes with the documentation.” Sometimes this is as simple as annotating a collection with information about when it was gathered, and the broader historical or social context. At other points, Jules has documented the choices he’s made in maintaining the archive. For example, Twitter’s terms of service require that any time a tweet is deleted, archivists “make all reasonable efforts” to erase their copies, too, and Jules has written about cases they’ve faced. (Intended to give creators control over their content, this policy has proved unenforceable. Jules and Summers, however, have found a way to comply. They delete tweets in their data set, and publish only the tweet ID numbers, with which a tweet can be “rehydrated” to its original form, but only if it still exists out on the live web.) In one case, Jules’s training as an archivist convinced him to disobey the rule. Twitter deleted posts by @Blacktivists—an account linked to the Russian government that posed as a Black Lives Matter activist to stoke racial tension in 2016—but convinced of their historical value, Jules decided to not only preserve but publish ninety of its tweets.

Offline, Jules also began cultivating relationships with the Ferguson activists. He emphasizes to them the fact that anyone—academic researchers, intelligence officers—can collect your social media presence for any number of purposes. In light of that knowledge, he prompts them to think about whether and how they’d like their voices preserved. Over the years, these conversations have added layers to the Ferguson archive, for example through a series of recorded interviews with activists in 2017. Many used their remarks to fill in the gaps that the Twitter record leaves out. “Online .?.?. we have to elevate ourselves to be these perfect people, and these perfect leaders, and these perfect voices for black liberation,” Ferguson activist Alexis Templeton said. In Templeton’s view, social media misses the mess, the mistakes, the “humanness and growth”: “I’m not just a bunch of retweets and favorites. I’m a whole-ass human, and Twitter misses that.”

It takes time to treat the people behind Twitter feeds with the same deference as a Pulitzer Prize–winner donating his papers. For many web preservationists, it takes too much time, and leaves too much data on the table. Justin Littman, a software developer and the creator of an archival tool called Social Feed Manager, told me he thinks ­DocNow’s emphasis on appraisal makes sense only in ethically fraught cases—when dealing with activists facing government surveillance, or perhaps with the #MeToo movement. He remains committed to collecting other topics en masse. “Every decision not to collect results in a gap in the historical record,” he says. It’s self-evident that some tweets, Donald Trump’s for instance, are matters of public record.

Jules, for his part, dispenses with the notion that more is better, arguing that a smaller archive could prove more enlightening to those who inherit it. “Maybe we don’t need one million tweets about Ferguson—we just need a good five hundred or one thousand,” Jules told me. This could make for a more legible and usable record—more sound and less noise, more representative and less random. A healthy archive illuminates its own contents through careful indexing and organization; we know exactly what’s in it, and also what’s not, which helps us make sense of the fragments we have. In contrast, the Wayback Machine has so far proved too vast for anyone to take complete inventory. The names of billions of homepages are indexed and searchable, but the contents of the web pages remain uncharted territory. Likewise, the more than half trillion tweets in the LOC’s possession, the library says, have only been partially catalogued. When no one is likely to lay eyes on a particular post or web page ever again, can it really be considered preserved?

A few months back, my fiancé decided to unearth his first ever email account. He was surprised and crushed to learn that Hotmail had deleted it over a decade ago. It got me thinking about my physical relics, which live in a plastic bin that he and I have hauled through half a dozen moves: a CD bearing saved photos, though neither of our computers contains a disc drive; a beloved mug that now leaks through a crack; a champagne cork from the day we got engaged. Pack rat though I am, I’ve been appraising my life all along. When we left Washington, D.C., for Boston a year ago, I threw out a decade’s worth of birthday cards and the notes from a recent writing workshop I remember as useless—but kept the fervent birthday letters my mother always writes, and the college syllabi of philosophy books that I keep telling myself I’ll revisit someday. I’ve been wondering which of my digital records are worth carrying like that overfull box—not as heavy, but no less consciously accounted for. I’ll never reassemble every scrap of myself I’ve scattered across Facebook, but I’ve started downloading my favorite photos and saving them to the cloud. I don’t expect to make the historical record, but if the archivists ever came knocking, I’d want to have saved my own annals, and decided for myself what to throw away.

I discussed this feeling with the digital preservationist Adam Kriesberg, a lecturer at the University of Maryland. “I never would have thought it would be cool to see what I wrote on my AIM profile,” he said with a laugh. “But now it’s gone, and I’ll never remember—I’ll only know I used to have all these handpicked quotes and inside jokes with my friends. That represents an evolution of the social web that we’ll never have.” If more people saved their own personal histories, “it would be easier in the future to re-create an understanding of how the internet worked.”

The bigger obstacle to deciding the fate of our own archives is, of course, that our data don’t really belong to us. At the Ethics and Archiving the Web conference—which took place only days after the New York Times reported that Cambridge Analytica had used private details from millions of Facebook profiles to aid Donald Trump’s campaign—the UT Austin information scientist Amelia Acker delivered an electrifying eschatological sermon on the subject of our authorial rights. The business model of social media companies “depends on keeping collections out of the hands of the creators as users,” warned Acker. “As a condition of their creation, we cede the power to collect them on our own, to own them, to govern them, to exert rights over whether and if they should be archived, and to select what should be remembered and what should not remain.” The issue of who owns our digital histories is entangled with our increasingly glaring lack of control over the use of our data in the present.

“What does it mean that most of the photos of grandkids are up on these platforms?” Acker asked when I called her on the phone. I thought about my father, and the summer he spent compiling family photo albums, identifying each long-dead relative in the yellowed portraits that came to him when his own father died. This kind of history, once capable of surviving generations in a box under the bed, now persists for the benefit of Mark Zuckerberg’s bottom line. So far, preservation isn’t part of the ongoing discussion about Facebook’s many failures. But if that changes someday, you can bet that the company will offer a self-serving solution. As Acker points out, “You could imagine a future where they make money off personal digital archives.”

Preservationists also wonder what will happen when Facebook and Twitter inevitably fold: what will become of their data, and also of their algorithms, the architecture of the spaces where we spend so much time. In a recent paper, Clifford Lynch, the director of the Coalition for Networked Information, a nonprofit that promotes the use of information technology in scholarship, described social platforms as strings of “unique, personalized, non-repeatable performances,” produced by a puppet master the user never sees. Only the algorithm can say how it chooses the posts that appear at the top of my timeline. Lynch argued that we need to stop focusing on artifacts—e.g., individual tweets—and begin collecting experiences: the streams we see and interact with every time we log on. The plain text of a tweet tells you little about Twitter. The way the platform bombards you with hundreds of tweets in a minute comes closer to conveying what the future needs to know.

To relay these invisible forces to our descendants, some scholars are falling back on time-tested strategies. Katrin Weller, a social media researcher in Mannheim, Germany, explained to me the history of changes to Twitter’s basic platform: when it introduced the “retweet” button, for example, or changed the “like” button from a star to a heart. She compared her work to the study of old coins. “You always need to know a lot about the surroundings,” she says. “You need to know how it was printed, what the symbols on the coins mean.” She imagines a future historian stumbling on a cache of tweets about the 2016 election decades after Twitter goes defunct. “To me, it seems almost impossible to make sense of that without some additional information about what Twitter was, what it looked like,” she says. She’s found screenshots of Twitter’s original platform in books—“things like Twitter for Dummies, ‘How to set up your profile’?”—and is looking for videos on ­YouTube that might capture the site at different stages in time.

Digital archivists, it seems, have much to learn from their paper and vellum predecessors. But so too must the orthodoxies of archival practice bend to accommodate protean new media. “Historically, the role of the archivist, the steward, has most often been to preserve documentation, recordings created by others,” Lynch observed in his paper.

But it’s increasingly obvious that those creating the documentation are also essential and crucial participants in the enterprise of stewardship. If archivists will not create, capture, curate the “Age of Algorithms,” then we must quickly figure out who will undertake this task, and how to get the fruits of their work into the custody and safety of our memory organizations for long-term preservation.

One proposal that Lynch has suggested to fill this lacuna is a digital version of Nielsen families, the anonymous households whose responses to TV shows are used as raw data for determining ratings. He envisions assembling “an appropriately diverse collection of users” who would consent to have preservationists “look over their shoulders, recording their interactions.” This idea strikes me as a not-so-distant cousin of Jules’s ambitions to teach members of the communities he documents to use ­DocNow. He envisions a world where activists and ordinary people collect their own tweets alongside historians and professional archivists. These creators could obviate the ethical question of consent by deciding for themselves what to save.

So far, Lynch’s vision is only a fantasy. But I like the idea of joining his army of Nielsens—of preserving our collective memory by following the instinct to hoard my own past. The conference at the New Museum was co-hosted by the digital art organization Rhizome, which has created a tool called the Web­recorder. If you turn it on and surf the web, it captures pages as you open them, creating a simulacrum to which you can return. It’s a sort of virtual reality, a camera that records your house as you walk through it, the re-creation growing by a room every time you open a door. I tried it out recently, capturing the homepage of my personal website, clicking through links to some of my work. I skirted the sinkhole of my email inbox—a problem for another day—and typed ­“” into the Webrecorder search bar. What here could possibly be worth remembering? I wondered, as I entered the slipstream of in-jokes and bad news. My timeline, which is 99 percent other journalists? My own tweets, which intermittently record what I read? A number in the upper-left-hand corner of the screen measured the megabytes the recorder had captured: 8.5, 9.3. I considered deleting what I’d collected, but ended up saving it to Rhizome’s online vault. Maybe, a decade or more from now, I’ll want to remember how it felt to waste time on the internet circa 2018. It’s one of my life’s core experiences—a thing I reliably do every day.

More likely, I’ll never go back and look at this, just as I’ve never reread most of my paper collection, though I derive comfort from knowing it’s there. The truth is that saving everything is no guarantee against forgetting most of it. Memory, like history, is never exhaustive, and we all make choices about what to include. Still, it’s frightening to let go of anything, not knowing what you’ll wish you’d saved.

You are currently viewing this article as a guest. If you are a subscriber, please sign in. If you aren't, please subscribe below and get access to the entire Harper's archive for only $45.99/year.

= Subscribers only.
Sign in here.
Subscribe here.

Download Pdf
Single Page
lives in Boston.

Get access to 168 years of
Harper’s for only $45.99

United States Canada


December 2018


You’ve read your free article from Harper’s Magazine this month.

*Click “Unsubscribe” in the Weekly Review to stop receiving emails from Harper’s Magazine.