Get Access to Print and Digital for $23.99 per year.
Subscribe for Full Access
March 2021 Issue [Readings]

Theory of Data Transformation

From Living in Data, which will be published in May by MCD, an imprint of Farrar, Straus and Giroux. Thorp is a data artist and an instructor at New York University.

“Data” has always been a restless word.

It first appeared in the English language on loan from Latin, where it meant “a thing given, a gift delivered or sent.” It spent its early years in the shared custody of theology and mathematics. The clergyman Thomas Tuke wrote this in 1614 about the difference between mystery and sacrament: “Every Sacrament is a Mysterie, but every Mysterie is not a Sacrament. Sacraments are not Nata, but Data, Not Naturall but by Divine appointment.” By 1704, data had found a hold in mathematics beyond geometry. Another clergyman, John Harris, defined “data” in his Lexicon Technicum as follows: “such things or quantities as are supposed to be given or known, in order to find out thereby other things or quantities which are unknown.” Data as truths like gravity and π and the Holy Ghost.

For a century or two more, the linguistic neighbors of “data”—that is, those words that most often appear in close proximity to it in text—remained consistent. “Math,” “numbers,” “quantities,” “evidence,” “unknowns.” Some new words arrived as mathematicians and philosophers worked to order their universe: “qualitative,” “quantitative,” “ordinal,” “cardinal,” “ratio.” At the turn of the twentieth century, with the birth of modern statistics, came a new way for data to be thought of: as the contents of a table. Fifty years after that, “data” became bound to a word that would change the way in which it is commonly understood: “computer.” Between 1970 and the end of the millennium, it changed from being a thing of God and mathematics to a collection of bits and bytes.

More recently, “data” has found its way to the mess of human lives. It’s there now with “social” and “genetic” and “sentiment,” with “migrant” and “gender” and “identity.” And as “data” settles in with its new neighbors, we must change the way we think about it.

Though the definition of “data” has changed—from mathematical givens, to pieces of evidence, to assemblages of electronic bits and bytes—it has always been thought of only as a thing, a noun. What if, along with a change in meaning, “data” were to undergo a shift in usage? What if “data” were also a verb? I data you; you data me. They data us; we data them.

In case this seems too outlandish, consider two synonymic neighbors of “data”: “record” and “measure.” Both of these words exist as nouns (I made a record), as verbs (We measured the temperature of the room), and indeed as verbal nouns (They found a list of measurements and recordings). The verbal forms of “record” and “measurement” make communication about the act of making records or taking measurements much easier. If we made “data” a verb, rather than having to say that the National Security Agency was collecting data on our every interaction, movement, and metabolic function, we could simply say, “They data us.”

Data is not inert, yet its perceived passivity is one of its most dangerous properties. This is why when citizens are warned that a government or corporation is collecting data about them, so many are underwhelmed. The act of collection seems so harmless, so indifferent, so objective. But of course data is not collected and then left alone: it is used as a substrate for decision-making and as an instrument for differentiation, discrimination, and damage. The systems of data collection and use are humming with the capacity for bias, influence, action, and violence. This is evident in the linguistic neighborhood that “data” has begun to occupy in the past ten years. The words moving away from “data” are the ones that it has lived closely with for much of the past century: “information,” “digital,” “software,” “network.” Among the words moving toward “data” are some that seem to summarize recent events: “scandal,” “privacy,” “politicians,” “misinformation,” “Facebook.”

But at the same time, there are now also words that we might not previously have expected to find in the same sentence as “data”: “lives,” “deserve,” “place,” “ethics,” “friends,” “play.” “Data,” it seems, is being pulled by strong currents. One is drawing it toward a dystopian future. The other, more hopeful, might bring data to a more utopian place.

Is it possible, then, that we might give it a push in the right direction? To do this, we must view data not just as a thing but as a system. Then we might begin to imagine a way toward that better technological future— one where we all data together.

I created a map of the linguistic neighbors of English words by gathering a corpus of three hundred million of them from Google News and processing them with a program called word2vec. What this program does is look at the position of every word in every sentence and keep a running tally of the relationships between them. Each word gets a position—a vector—in relation to every other word. For words that often appear close to “religion”—“God” or “church” or “pew”—this position will be given a number close to zero. For words that almost never sit in the same sentence as “religion”—“squid” or “pappardelle”—this number will be close to one. The number of vectors in the map that I’m using is huge—remember that every word gets a position in relation to every other word. Out of this comes a data set of nearly a billion vectors.

A word map this vast and multidimensional allows us to examine the ways that language is interconnected. For example, what word is connected to “woman” in the same way that “king” is to “man”? The network dutifully offers up an answer: “queen.” In 2016, Tolga Bolukbasi, then a machine-learning student, exposed troubling gender bias in the program’s results. When queried, for example, as to which word is connected to “woman” in the same way that “doctor” is to “man,” the system answers “nurse.” When asked about “computer programmer” in the same context, word2vec offers up “homemaker.” Gendered relations are evident even indirectly: “receptionist” is closer to “softball” than it is to “football.”

One might argue that the program is simply offering a neutral analysis of the underlying data. To understand the danger here, we need to consider why software like word2vec exists. It’s a tool built to make decisions such as which job candidates to hire. In October 2018, a software system developed internally for Amazon’s HR department was scrapped when it was shown to be dramatically biased against women. The system rated résumés lower if they contained the word “women’s” and higher if they used words that have been shown to be more common on male résumés, such as “executed” and “captured.”

This may seem like a modern problem, but it stems from the seventeenth century, when the word “data” drifted into English from Latin. We are still stuck with the idea that data is static, given to us, if not from God, from somewhere similarly divine. There seems to be a common belief that we can use data to investigate the world, but the result would be a very particular model of the world, gathered by particular humans in a particular culture and time. Data about anything—a sentence, a bird, the temperature of a room, the age of the universe, the sentiment of a tweet, the flow of a river—is an artifact of one fleeting moment of measurement and is as much a record of the human doing the measuring as it is of the thing being measured.


| View All Issues |

March 2021

Close
“An unexpectedly excellent magazine that stands out amid a homogenized media landscape.” —the New York Times
Subscribe now

Debug