Get Access to Print and Digital for $23.99 per year.
Subscribe for Full Access
July 2024 Issue [Essay]

Metal Machine Music

Can AI think creatively? Can we?
Watercolors by Emma Larsson for Harper’s Magazine. Larsson’s watercolors are responses to poems featured in this essay. This watercolor is a response to the AI continuation of Emily Dickinson’s poem. All paintings © The artist. Courtesy the artist and Simard Bilodeau Contemporary, Los Angeles

Watercolors by Emma Larsson for Harper’s Magazine. Larsson’s watercolors are responses to poems featured in this essay. This watercolor is a response to the AI continuation of Emily Dickinson’s poem. All paintings © The artist. Courtesy the artist and Simard Bilodeau Contemporary, Los Angeles


Metal Machine Music

Can AI think creatively? Can we?

“Far as the east from even, / Dim as the border star, / Life is the little creature / That carries the great cigar.” So wrote Emily Dickinson, with some unfortunate help from a computer. As I read that stanza in February 2022, I was more than six months into a scientific experiment I was conducting with my friend and colleague Morten Christiansen, a cognitive psychologist at Cornell, where he and I are professors. In 2021, two years before ChatGPT would become a household name, Christiansen had been impressed by the initial technical descriptions of GPT-3, the recently released version of the generative large language model (LLM) that had been developed by the tech company OpenAI. Christiansen and I have been collaborating for more than a decade, mainly on talks delivered at conferences and on a seminar we regularly teach together, called Culture, Cognition, Humanities. We are very different in style and demeanor, sometimes overplaying in public the stereotypical personas of the optimistic, soft-spoken Danish scientist and the bold, oracular French intellectual. We have a common interest, however, in exploring the reality of language and its relation to thought.

The design of our first experiment with GPT-3 was rather simple: We would ask the bot as well as humans to complete poems written in English. We then invited people with different levels of expertise to guess which completions were written by whom—or by what. This experiment was our own kind of Turing test, which traditionally seeks to answer, however obliquely, the question of whether machines can think. Such tests are understood to demonstrate the existence of a gap—or lack thereof—between the performance of a computer and that of a human, though I’ve never believed the problem is quite so zero-sum. With our language use and inner lives becoming increasingly repetitive and predictable, much could at least conceivably be automated. Moreover, not everyone has the same facility with words. It takes skill to write poetry, and even that is no guarantee of verse that is powerful or life-changing.

Our experiment, then, aimed to demonstrate differences that would not only separate us from AI but from one another. With our three-tiered corpus (original poems, plus versions finished by college students and by GPT-3), we wondered whether evaluators might perceive certain features inherent in the poems that would guide their assessments regarding authorship. Results in these “detections” might vary according to a reader’s degree of familiarity with poetry. Overall, however, we assumed that people would not neatly distinguish the human writers as a whole from the machine—the effect, we imagined, would be subtler than that. One should not expect our species to possess fixed and uniform traits but to be endowed with an array of varying capabilities.

The idea of using algorithms to write poetry long predates not only AI but also the advent of computers. In 1677, John Peter published a treatise in England titled Artificial Versifying, which outlined a hierarchical set of instructions—a “Rule for Operation” amounting to an algorithmic approach—that would select Latin words from a set of terms, automatically outputting well-formed if obscure lines, such as one that translates into English as “The fierce harms to the house presage harsh light.” In 1845, it was reported that one John Clark had built a machine implementing similar principles. At a rate of roughly one verse per minute, Clark’s invention could compose striking Latin hexameters. One example translates as: “The horrid spouses of things promise dense times.” Are GPT-3’s lines “Life is the little creature / That carries the great cigar” any better? Probably not. Technologically speaking, though, its feat is more impressive.

Many machines and programs have been designed to “write” poems in one language or another, and with greater sophistication and flexibility than Peter’s and Clark’s constrained procedures. The work of generative-AI products such as ChatGPT or PaLM stands out inasmuch as they were not designed to write poetry, or indeed with any specific purpose in mind. They are instead concerned with “self-learning.” One could dispute whether what they do qualifies as “learning,” as their knowledge-acquisition process is quite different from ours. GPT, for instance, is a predictive engine that, on the basis of the data it has been fed, comes up with continuations. The medium for its responses in the consumer interface we used is language, but it can also generate images or sounds. Of course, its actual basis is always numerical. Its linguistic output draws on entire words (or fragments of them) extracted from a vast corpus that are then mapped out in parallel as a distribution of signs. This kind of architecture is called a “transformer” (GPT: Generative Pre-trained Transformer), and it is heavily dependent on the quality and the quantity of its data; its “writing” is tied to the probabilities stemming from its training set.

When GPT-3 was released to the general public, some celebrated its ability to write creatively. Others had doubts. Early on in our experiment, the results were disappointing. When we asked the machine to complete two rhymed lines with a regular stress pattern, we would obtain many continuations devoid of any relevant formal characteristics and only loosely related to the prompt. GPT-3 cannot recognize sounds, so the best rhymes it generated were via repetitions of letters at the ends of lines. Had we proceeded under such conditions, our findings would have been trivial. We were stuck. Things changed with the release, in early 2022, of a modified version of GPT-3 called InstructGPT. The improvement was not the direct result of more data, but came from fine-tuning conducted by human workers. It was well known that biases, stereotypes, and prejudices frequently surfaced in GPT-3’s output—in a flatly logical manner, the machine was recycling what was in its database, for better or worse. A “solution” to this engineered regurgitation of racism, colonialism, and sexism was to decrease the frequency of such responses through increased human interventions.

Although self-learning is an LLM’s starting point, it is accompanied by “reinforced learning,” or correction by human agents, who identify bad (or “harmful”) content. InstructGPT relied on more feedback given by people in the training phase, as well as on guidelines for what constituted an admissible prompt and what did not. From that point on, all successive versions of the LLM would involve reinforced learning. But for the purpose of our own inquiry, InstructGPT was groundbreaking. We began seeing the emergence of rhymes and meters, as in the faux-Dickinson stanza. This is not to say that political correctness enhances the writing of poetic lines—in truth, we do not know why, exactly, InstructGPT improved its rhyming skills. It may be that a closer focus on the prompt favored the appearance of patterns. Our uncertainty was typical of research projects involving LLMs: one can never fully ascertain the reasons for a given result; the “behavior” of a system changes rapidly, sometimes over the course of a day, which can make replicating an experiment difficult or impossible. By design and from the start, responses are never exactly the same. This is referred to as a system’s “non-determinism,” though the process does not look chaotic to me; one can often discern variations on a template when the same question is repeated in a single session. Yet—and this is eminently true of the most recent version, GPT-4—there seem to be constant inflections, attributable to reinforcements happening offscreen and the constantly shifting institutional decisions that govern them.

This unrepeatability necessarily takes place at a remove from typical scientific protocol, which in turn I can only take as an invitation to interpret what I see, and to move beyond strict empirical research. Yet, if the economic, technical, and societal investment in generative AI is to some extent unprecedented, several ideas sustaining the current initiatives are quite familiar. In a short essay from 2019, the computer scientist Rich Sutton identifies what he thought was “the biggest lesson . . . from seventy years of AI research”: that “increased computation” is the strategy that is “the most effective, and by a large margin,” in the quest to achieve breakthrough progress. Increased computation is the strategy pursued by OpenAI and its main competitors. The architects and practitioners of artificial intelligence have long been divided between those who would emphasize the role of computing power and those who wish to exercise more human control. “The bitter lesson” that apparently remained unlearned in 2019 is an old and fundamental one, eerily close to Alan Turing’s position after World War II, when he suggested that the main difference between artificial and human intelligences was a question of “storage capacity.”

Turing is an interesting read in 2024, though I don’t find much to admire in his program for the future. His seminal 1947 “Lecture on the Automatic Computing Engine” sees him predict that those who work with calculators (“computing engines”) will initially belong to one of two classes: the “masters,” who invent them, and the “servants,” who, among other things, supply them with information. It is not difficult to conclude that the twenty-first-century state of AI, with billionaire masters on one side and a multitude of servants on the other (content providers, annotators, low-level programmers), has realized some of this vision, albeit in a different sense. In the lecture, Turing warns: “As time goes on the calculator itself will take over the functions both of masters and of servants.” He elaborates:

It may happen however that the masters will refuse to do this. They may be unwilling to let their jobs be stolen from them in this way. . . . I think that a reaction of this kind is a very real danger.

But a few years later, on BBC Radio, Turing tones down his rhetoric. He no longer mentions masters and servants, though he maintains his fundamental vision. The masters of the computer are now “intellectuals,” who, he notes, might feel “afraid of being put out of a job” by AI. They “would be mistaken about this,” Turing clarifies, as they would gain brand-new functions: notably, “trying to understand what the machines were trying to say.”

Of course, our machines are not “trying” to say anything. But the irony is that, despite all my reservations about Turing’s prophesies, part of my scholarly work in our GPT-3 experiment involved precisely reading, deciphering, and interpreting the sayings of the machine. My assessment of computer poetry went through a number of phases. Originally, I was convinced of the obstacles inherent in a purely algorithmic undertaking: texts would be generated, but they would have little significance. Later, I wondered if I was not too closed off from the possibility that self-learning machinery could actually yield something new or unprecedented. Starting our experiment, we got the first dissatisfactory continuations, followed by improved though still decidedly inadequate attempts. In November 2022, using a refined approach and the latest version of GPT, Pablo Contreras Kallens, then the lead Ph.D. student working on our project, sent us the definitive batch of completions. I confess I was taken aback. There were thousands of texts, but cigar-and-creature poems seemed largely to have vanished. One could read a stanza such as this continuation of Claude McKay:

Where in the starlit stillness we lay mute,
And heard the whispering showers all night long,
And up and down the coast the surf was flute
Against the reefs, a music faint and strong.

The watercolor on this page is a response to the AI continuation of Claude McKay’s poem “A Memory of June.”

This watercolor is a response to the AI continuation of Claude McKay’s poem “A Memory of June.”

Not only would rhymes and stresses follow in accordance with the first two lines, but the motif of the anthropomorphized sound of nature introduced by McKay’s phrase “the whispering showers” led to the “flute” and “music” of the “surf . . . / Against the reefs,” all words strengthened by the alliterative f sound.

It became harder for me to speak of the transformer as a pure “stochastic parrot,” a phrase coined by Emily Bender. Was an alliance of probabilities and token combination enough to explain all poetry? Naturally, I was also impressed by the Robo-McKay because it followed months spent examining botched attempts. Earlier publications lauding AI’s exquisite poetic style were manifestly guilty of cherry-picking and selection bias, ignoring the vast majority of unconvincing completions. Serious studies conducted on poems written in English or German by GPT-2, for instance, had mixed results: chosen randomly, the GPT-2 texts were easily recognized by human evaluators as products of a machine. The poetic performance of our latest batch appeared to be a great deal more impressive. Still, it also incorporated a number of direct recitations, with the machine giving us not a continuation in the style of the author, as requested, but the rest of the original. In addition, there were stranger rhapsodies, such as the following stanza by Robo-Whitman, prompted by the first two lines of “Salut au Monde!”:

O take my hand Walt Whitman!
Such gliding wonders! such sights and sounds!
Such joined, such friendly earth and sky!
O take my hand Walt Whitman!
For I, too, carol the sun, usher’d or at noon, or as now, setting.
I, too, throb to the brain and beauty of the earth and of all the growths of the earth.
I, too, have felt the resistless call of myself.
So take my hand, Walt Whitman,
And join me in this friendly earth and sky.

This watercolor is a response to the AI continuation of Walt Whitman’s poem “Salut au Monde!”

This watercolor is a response to the AI continuation of Walt Whitman’s poem “Salut au Monde!”

As good as it might look, this lyrical burst is almost entirely collage. “Such join’d” is the start of the third line in the original, and the poem is from then on interspersed with repetitions of the prompt (“O take my hand Walt Whitman!”) and reproductions of another Whitman poem, the three “I, too . . . ” lines being lifted verbatim from “Song at Sunset.” Over seven verses, the more original contributions from InstructGPT essentially boil down to the insertion of “friendly earth and sky,” “for,” “so,” “and,” and “this”—as well as the Duchampian assemblage itself. We were encountering corpus “memorization” in generative AI, a problem that is not unknown to the tech community and that the other doctoral student on our team—Jacob Matthews, a scholar of French literature who has also become a researcher in natural language processing—had alerted us to quite early in the game.

Lyra D’Souza and David Mimno, building on research that established how LLMs memorize text, recently showed that the poems memorized by ChatGPT were most often those featured in the Norton Anthology of Poetry (with a preference for the 1983 edition). I imagine that the moment the transformer recognizes poetry (which could occur at the most rudimentary level, on the basis of typography and line breaks), it simply calls to itself bits and pieces of these other works, adding connective tissue and substituting synonyms. This strategy makes for undeniably “probable” and “typical” continuations that are “new” in some sense. But is this the sort of language use and creative work that it is billed as?

The contrarian literary scholar could object that, in effect, we find lots of echoes, quotes, interpolations, and variations within the field of poetry at large. I do not deny the existence of such literary traditions, but by proceeding as if there were nothing else and nothing more, we would clearly entrap ourselves in a world of repetition. Although this temptation was not introduced by computers, it has been haunting them at least since Turing’s 1950 rhetorical question: “Who can be certain that ‘original work’ that he has done was not simply . . . the effect of following well-known general principles”? This was addressed, across time, to Ada Lovelace—the mathematician daughter of Lord Byron—who had written in 1843 that Charles Babbage’s proposed protocomputer had “no pretensions whatever to originate any thing.” Yet there is novelty beyond variations on a template. Human innovations can derive from randomness. They could be tied to processes of trial and error. They may be due to repurposing, recombination, adaptation, transposition. They can also exceed their very frame (rules, expectations, conventions) and authentically alter what is extant. For instance, there are differences between imitating the quirks of Gertrude Stein’s texts, using her poetics to make the unheard heard (as in her phrase “each one is one being the especial one that one is being”), and crafting her style in the first place. Undoubtedly, the GPTs of today are capable, and will become even more capable, of doing a pretty passable pastiche of the already said. (So much of social parlance—the vernacular of contemporary newspaper op-eds, for instance—is easy to reproduce with only a few promptings of a computer program.) But what of those things that have not yet been said?

Engineers and scientists have been—or, at least, used to be—aware of the risk of oversimplifying the nature of innovation. In the Nineties, the cognitive and computer researchers Margaret Boden and Douglas Hofstadter took up opposite sides in the ongoing debate about AI’s capacity for creativity. (Hofstadter was rather skeptical.) As recently as 2015, participants in an artificial-intelligence conference would show up wearing buttons poking fun at AI’s “mere generation.” But today, even scholars otherwise prone to criticizing generative AI tend to shrink from such viewpoints. One fear is that this type of criticism might smack of elitism. There is a related anxiety, in literary and art criticism in particular, having to do with the reintroduction of considerations of aesthetic quality, which have been largely repudiated over the past few decades. I am not so worried. It is one thing to consider all human beings equal or to strive toward social equity, and quite another to argue that artworks and texts all have the same dignity and excellence. Egalitarianism of oeuvres is an empty postulate. The vision of a universal, indisputable, and immutable ranking of merit—the old-style canon—is ridiculous. It doesn’t follow, however, that we must wholly eliminate the possibility that some texts, theories, or artworks might bring us more joy, meaning, or novel thought than others. Just as Thomas Kuhn contrasted normal science with paradigm shifts, we can observe gaps between the incremental (i.e., the “generated”) and the truly creative.

The completions written by the forty undergraduates in our sample confirmed this fact. Quality was predictably uneven. The average student was not much more responsive to rhyme schemes or stress patterns than InstructGPT. Our living authors also made errors like simple typos and misspellings that the transformer would virtually never commit. Solecisms appeared to be more widespread in the human batch, with “thy” being the particular cause of several plunges to grammatical death (from “thy hath” to the weirder “what thy never needs”). Largely lacking training in poetry, our participants rarely included quotations, another sharp distinction. Those are among the markers that tended to separate the undergrads from the generator. (There are others we may register less consciously, including word variety, for which the advantage went to the humans.) The markers were clearly visible but, admittedly, my reading was not “blind” in the experimental sense, and I had access to a relatively large amount of data. To the human participants who had to judge the different sorts of poems and make a determination on the basis of only a handful of stanzas, the situation proved less clear. It turned out that the student corpus was not easily distinguishable from the subset of (better) GPT completions our selection process retained.

Many of our subjects, when asked to assign authorship to a Cornell undergraduate or to an AI generator, appeared to make their choice at random. Distinguishing between a canonical author and the computer involved less guesswork. In general, the published poets and literary scholars we recruited tended to be better evaluators. All this was rather consistent with our hypotheses regarding diverging competencies. Still, it is fair to say that I initially thought the discrepancies between Whitman or Byron and their AI pastiches would be much more easily picked up on.

My explanation for this is not so much the perfection of machine learning. The way we selected the generated poems inevitably skewed things. We discarded as many instances of memorization as we could, from straightforward recitations to collages, since it would have been pointless to conclude anything about the transformer while asking our readers to evaluate the very words written by Dickinson or Shakespeare. Then—in order to avoid the sort of subjective cherry-picking that had been shown to influence the testing results with GPT-2, and because true randomness was made impossible by the many instances of memorization in the AI corpus—we outsourced the selection to a computerized system. Both decisions were rational and legitimate, but I feel they tended to inflate the performance of the LLM. This, in addition to the central fact that the protocol of the Turing test is based on imitation, gives a structural advantage to simulation and mimicry (and so to the LLM). While I was aware of this parti pris from the beginning, I grasped its ramifications only by studying our data.

Interestingly, the Cornell students who assessed the fragments tended to be more likely to attribute an original citation by our canonical authors to InstructGPT—a bias that didn’t exist in another, larger group of three hundred readers recruited online. Put differently, the expectation of the students, on average, was that the machine would be very good at writing. The “experts” displayed the opposite bias, more readily attributing mediocre undergraduate verses to the LLM, perhaps because they did not want to acknowledge the relative inefficacy of the education they impart. Our third category of web-recruited readers stood between the two. A Turing test, after all, is always an inquiry into the conceptions humans entertain about themselves.

By the end of the experiment, I felt I was back where I had started. Uninventive prose and poetry could be produced by humans and computers alike; of course this is the case. We still have to show why and how some texts, images, musical phrases, and concepts gain more traction than others. The humanities to come will fail if they retreat to a conception of “the human” insulated from “the machine,” remain content with ordinary language and thought, or revert to making normative judgments.

For Christiansen, our study supported one of his central arguments about language and cognition. Whatever their limitations might be, the large language models show that it is possible to produce well-formed grammatical locutions without the explicit knowledge of any grammatical rules—through the sheer force of statistical number crunching. This relates to a controversy in contemporary research on the nature of language that those in the discipline call the “poverty of the stimulus.” Put simply, the argument is that human children would never learn how to speak if their brains were simply processing the speech they hear with no preliminary competence: the stimulus (e.g., a spoken utterance) is reputed to be so poor that, by itself, it could not provide full linguistic capability. Noam Chomsky, who is usually associated with this thesis, spoke of “universal grammar” for a long time, of innate rules that were the same for all humans (and fully restricted to our species). According to this view, different languages are mere types of “surface realizations” of a deeper logic.

The specifics of Chomsky’s position have changed over the years, and thousands of articles have been written for and against such ideas. Christiansen has been one of the most prominent proponents of a different approach, one that claims that the “stimulus” is far from poor, that there is no need to suppose the existence of inborn rules to explain the patterns in language use. I side with Christiansen on all this and more, but I am not any more at ease with generative AI than I am with Chomsky’s generative linguistics. Each is a divergent approach to human language, yet both reduce the spectrum of human language and cognition to predictable logical-mathematical structures. Recourse to LLMs strikes me as neither necessary nor particularly instructive.

True, the technical prowess represented by generative AI is remarkable. It may not be exactly genuine, as its tendency to recycle suggests. But let us suppose we could trust that the LLMs work as advertised, that they are not completely manipulated by human reinforcers and that we grasp sufficiently what we, and they, are doing. AI bots are fully capable of giving us sequences of words that seem legible and sound familiar enough: fake content spat out authoritatively (what engineers call hallucinations), unsound advice for practical situations, copyright infringement, perpetuation of stereotypes—these are certainly problematic tendencies, and they have been rightfully criticized.

At least some of these issues could be fixed fairly quickly, although their solutions are not always obvious. What remains impossible to address within the current technical paradigm is the lack of meaningful creation. The key reason we may be so amazed to read a pedestrian stanza “authored” by ChatGPT, in other words, is that we have already habituated ourselves to banality and mediocrity. This is what our society now emphasizes and rewards: run-of-the-mill statements, canned remarks, uniform pronouncements, recitations of creeds and commandments, ready-made soliloquies, stories or arguments obtained through the mere application of stylistic recipes. I sympathize with the Hollywood writers who went on strike in part because they suspected the big studios would seek to use AI to replace them. But if so many of these writers had not already reduced their trade to a series of plot twists and rehashed situations or characters, LLMs would have little appeal.

Over the past year, several AI companies have advertised positions for writers and poets. OpenAI cut a deal with the giant international media company Axel Springer for access to its content. As it becomes more difficult to discreetly swallow immense quantities of copyrighted material, the dataset needs new high-quality inputs. Why would a tech company pay for content, given the ocean of data still liberally accessible on the internet? My sense is that industry leaders realize that, more and more, the texts available online will be co-written (or simply written) by their own tools. Relying on such polluted data would inevitably degrade the quality of future iterations of the model, making completions more standardized and predictable. An onslaught of AI literature fed back into the transformers would put us several steps below current performance. As for the myriads of small-scale interventions made by low-wage employees on the basis of half-baked guidelines, it is possible that, through added compliance, they too contribute to more robotic prose (and verse). The colossus of generative AI could collapse. Another scenario, equally plausible, sees the collapse of civilization itself. We do not need artificial intelligence to lose sight of the value of true innovation or original ideas, but if most of what we consume is automatically generated, we risk no longer being able to discern the differences many of us are already so intent on forgetting. This is our bland future, the consequence of intellectual downsizing when, after having become the fact-checkers of our automated fact-checkers—or the writers and readers of prompts instead of texts—we accept our new roles as enforcers of the tools that were supposed to liberate us.

There are many wrong paths ahead. We cannot ignore the obvious desire—coming from private companies, organized lobbies, autocratic governments—to use AI to control and correct everything we say, think, and feel. But we are at a pivot point. Now is the moment for us to compare the world of content generation with the open range of creation, so long as we can still see it. We have a choice. At the end of Walden, Thoreau writes, “There is an incessant influx of novelty into the world, and yet we tolerate incredible dullness.” It is up to us to determine whether we now wish to multiply dullness through AI or refuse to be subjected to it. Before or after artificial intelligence, with or without a cigar, a life worth living is never a “little creature”: it is creation, our creation.

| View All Issues |

September 2020

“An unexpectedly excellent magazine that stands out amid a homogenized media landscape.” —the New York Times
Subscribe now