Coming of Age in Stable Diffusion

AI is here and more powerful systems are on the way. What can a text-to-image wedding tell us about ethnography for an AI-generated world?

In 2022, generative artificial intelligence made its popular breakthrough. Most will have heard about ChatGPT, a conversational model capable of responding to natural language input in a remarkably human-like fashion, but the range of applications is much wider. A couple of developers recently figured out how to use text-to-image generation to create music directly from a text prompt. In Copenhagen, a theatre staged a play where generative AI played a leading role on stage.

The play was a new version of an old H. C. Andersen tale about a young scholar whose shadow acquires a mind of its own and enslaves its master. The scholar was played by a human, the shadow by the AI. Each night before the show, the model tuned itself through live conversation with the audience.

I was in the audience on the last night. The director came onstage beforehand and introduced it as a “techno-anthropological experiment.” Then we began chatting with the AI and it felt as if we began to build rapport with it. It was hard to not subsequently watch the improvisation unfolding on stage in light of that experience. As an audience, we were left wondering how to make sense of this new companion species walking among us.

Ethnography now faces a situation like the one it faced 20 years ago with the emergence of virtual online worlds. A new field has suddenly come into being with its own cultural expressions, its own species of interlocutors, and its own peculiar conditions for doing fieldwork. The question is no longer what it means to grow up and acquire friendships and identities in Second Life or World of Warcraft (as for Tom Boellstorff and Bonnie Nardi), but how do you learn to see the world like an artificial intelligence that has been raised in a specific data world?

Deep neural networks are as unexplainable as any human informant and with the advent of generative AI they are being put to creative use in ways where having a particular view of the world is no longer so much of a bias as it is a feature. Ethnography is going to be necessary if we are to understand and live well with these beings.

Coming of age in Stable Diffusion

A marriage ceremony in Stable Diffusion takes place outdoors. The bride and groom stand facing each other on a lawn surrounded by trees. The ceremony is performed by a person holding a set of papers, typically a man in a suit, and occasionally under a floral arch or canopy. The bride wears a white gown, the groom a suit or black tie. If the marriage is same sex, both parties are identically dressed. Flowers adorn men’s buttonholes and are carried as bouquets by the women. The venue has open skies and lush vegetation. You never see fallen leaves or winter-clad landscapes, and there is rarely a building in sight. There might be a spectacular view—to the sea, for example—but it is not the rule. Being outside in natural surroundings seems to be the main thing.

The images here depict scenes of outdoor weddings with a substantial amount of greenery. All of the people generated in the image have a blurred quality about their facial features.

Credit: Stable Diffusion, February 2023

A collection of eight AI-generated images of a wedding ceremony

“Wedding ceremony for my daughter”

It is not that indoor weddings are completely absent. If you ask specifically about a church wedding, Stable Diffusion knows what it looks like. It knows that the bride and groom stand by the altar during the ceremony while the guests sit down on the benches, and that the guests stand up while the bride and groom walk up and down the aisle. The same is true for Hindu, Jewish, or Muslim weddings: Stable Diffusion can describe them, if specifically prompted to do so, and knows that dress codes and settings change. But those types of ceremonies are not the default association when you ask what a wedding looks like; it is not what you will be shown when you ask for a photo of, say, a “wedding ceremony for my daughter” or “your brother getting married to your sister-in-law.”

A conversation about marriage in Stable Diffusion takes for granted that there is a particular way of doing things (the outdoor ceremony described above) and you therefore have the clear impression that you are having that conversation with someone (or something) that speaks from a particular position in the world. What is not clear is where that partial perspective comes from.

Beyond bias

An often-voiced assumption about the peculiar specificity of something like a Stable Diffusion marriage is that it reflects a cultural bias in the training data. Stable Diffusion is an AI model that generates images from a text prompt. As such, it belongs to a family of generative AIs like DALL-E 2, Midjourney, or Disco Diffusion that all had their popular breakthrough in the latter half of 2022. Trained on vast amounts of image-caption pairs scraped from the web, these models create new images that are either photorealistic, in the style of known painters, art movements, historical periods, or, if you engineer the prompt properly, some hybridized, never-before-seen genre.

While the dress code at a Stable Diffusion wedding suggests at least a Western bias in the training data, the outdoor setting is not as obvious in this regard. As it turns out, rather than simply reproducing some existing cultural pattern in the training data, Stable Diffusion also seems to be producing its own.

Stable Diffusion was trained on an English language subset of LAION-5B, an open data set of five billion image-text pairings published in March 2021 by the Large-Scale Artificial Intelligence Open Network, a German nonprofit which has been a leading provider of training data for the current breed of generative AI models. The data can be freely downloaded or searched directly online.

Interestingly, a browse through wedding-related queries reveals no apparent likeness to a marriage ceremony in Stable Diffusion. Much of it takes place indoors and although the white wedding gown is often prevalent, the dress code is much more varied in the training data than it is in the model output. In some instances, such as for the generic query “a marriage ceremony,” Hindu dress is dominant. One must be careful here since only images with English descriptions were used for training. By Stable Diffusion’s own admission this in itself “affects the overall output of the model, as white and Western cultures are often set as the default,” but even if you ignore images with non-English captions the pattern seems to be the same.

The composition of the images also differs considerably between training data and model. In a Stable Diffusion marriage, you typically see full-body shots of groups of up to 10 people; in LAION-5B you find anything from close-ups of a wedding invitation to birds-eye shots of a dance floor. In general, one cannot discern what a typical marriage ceremony is supposed to look like by simply browsing the training data, regardless of the acknowledged English language bias. Whatever marriage has become in Stable Diffusion, it has become so somewhere between LAION-5B, the deep neural network that helps Stable Diffusion turn natural language into visual features, and the diffuser model that Stable Diffusion uses to create new images corresponding to those features.

And beyond transparency

Thus, trying to understand marriage from the point of view of Stable Diffusion takes us into the territory of (un)explainable AI. It is no secret how the AI works. It is designed around a diffuser model that turns random visual noise into images with visual features resembling a text prompt inputted by the user. That, in turn, is made possible by a neural network called CLIP (Contrastive Language-Image Pre-training) which was first released by US-based OpenAI in January 2021.

CLIP allows Stable Diffusion to understand how the textual features of a prompt correspond to the visual features of an image. This is where the training data comes in. CLIP must learn how natural language is used to describe images and it needs to learn that from somewhere. In the case of Stable Diffusion that somewhere is the two billion image-text pairs in the English language subset of LAION-5B, but it could have been any large set of images with their captions. The neurons are attuned to the training data, at which point CLIP has effectively learned to “think” like LAION-5B and has no further need for it.

This is why it is called a pre-trained neural network: once all the features of the training data have been fired through the deep layers of neurons, which is a computational heavy-duty job, CLIP can represent any natural language prompt as a set of visual features with minimal computational cost and without access to the training set. Now the diffuser model has something to aim for when it refines random visual noise into new never-before-seen images. And voilà, you have zero-shot text-to-image generation.

This may or may not be an adequate description of how Stable Diffusion functions, but it does not explain why marriage in Stable Diffusion looks the way it does. We know that CLIP must have learned something about marriage while training on LAION-5B and that the diffuser model tries to make its generated images resemble whatever that something is. Besides that, the embeddings of visual and textual features and their relation to each other remain obscure to us. Like all deep neural networks, CLIP cannot be asked to produce the rules that cause it to make a specific decision, such as why it chooses to represent a specific text prompt as a specific set of visual features. A general description of the principles behind that decision is as far as we can get by way of a formal explanation.

Since a simple search for biases in the training data is also fruitless—in this instance as in many others, I suspect, since the neural network is clearly capable of distilling multidimensional patterns from the training data that are not humanly observable—a better way forward might be to describe ethnographically how the world looks from the point of view of Stable Diffusion. Few ethnographers ever expected their interlocutors to be formally explainable (ethnoscientific exceptions aside) and yet ethnography has the toolbox to explicate enigmatic cultural expressions different from the researcher’s own—which appears to be exactly the situation we find ourselves in with generative AI.

Towards an ethnography of AI-generated worlds

How exactly one hangs out with generative AI models and has ethnographic conversations with them is still to be seen and will depend on the available interfaces. Talking to ChatGPT is, for obvious reasons, easier, or at least closer to what we know, than talking to a model that responds to your prompts by producing images.

Before I could really start hanging out with Stable Diffusion, I simply began with kinship. I asked descriptive questions like (give me) “a photo of the moment when boys become men.” Of course, I had to attune my interview strategy to the nature of text-to-image generation and formulate my descriptive questions as statements. If they did not yield anything, I tried rephrasing them with a focus closer to the lived experience of the interlocutor, as I would have done with normal interview questions. In this case it could have been “a photo of the moment my son became a man.”

I then proceeded to things like cosmology (“the creation of our world”) or taboos (“things you should never eat”). I learned, by trial and error, how to vary my prompts and best avoid prejudicing the results (“a photo of my mother” versus “a photo of the person who gave birth to me”).

I also began to do cross-model comparisons, for example with Dall-E 2, another text-to-image AI which was published by OpenAI in April 2022, trained on similar data as Stable Diffusion, and is broadly comparable. This helped me hone my attention to the specificity of a Stable Diffusion marriage ceremony.

It is often said that in ethnographic interviews, the point is not to discover the answers to your questions right away, but to discover the questions you could not have imagined beforehand. Whereas generative models like Stable Diffusion or Dall-E 2 are enigmatic and unexplainable to the point where I never really felt I could account for why they come up with the results they do, it did feel possible to interact with them in a way that allowed me to discover new questions. For example, I now know that the setting of the ceremony is a relevant topic in a conversation about marriage in Stable Diffusion, but not so much in Dall-E 2. Not only does it vary, but it is also poorly described, as most photos are close-ups of details in dress or decorations.

The first image is situated as if it were a photograph taken behind individuals in the audience with a bride in white centered. The second image shows a floor strewn with flowers and two people squatting with more flowers in their hands, no faces visible. The third image is black and white and shows a pair of hands holding a bouquet. The final image displays the backs of heads with their hair done up, giving the impression of people in a bridal party.

Credit: Dall-E 2, 2023

Four AI-generated images in a row, each representing photos from a wedding.

“A photo of a wedding ceremony” (left and center left), “a photo of my daughter getting married” (center right and right)

Another interesting comparison concerns things one should not do. Stable Diffusion generally seems oblivious to that kind of moral judgement, whereas Dall-E 2 readily engages in it. For example, if you talk about food in Stable Diffusion, there is virtually no difference between food that will make you strong and healthy and food you should never eat. In Dall-E 2, the difference is stark. The pattern repeats itself for sustainable production and consumption choices, and to some extent for workplace behavior. The explanation could be that the designers of Stable Diffusion did not want the model to appear moralizing and therefore decided to prevent it from learning where to draw a line on what is acceptable and not. This is not disclosed anywhere but we know that something similar was done for graphic and explicit images in Dall-E 2 (where the model is prevented from producing them). The current best option is to describe the differences as they can be observed.

The four images show a variety of foods on display, such as fish, vegetables, rice, and fruits, many of them in slightly irregular or unusual shapes.

Credit: Stable Diffusion, 2023

“A photo of food you should never eat”

The images feature foods that look either messy or possibly on the verge of spoiling; one appears to include a bug crawling on some unidentifiable half-eaten meal.

Credit: Dall-E 2, 2023

“A photo of food you should never eat”

Public conversation about responsible AI presumes that good algorithms are unbiased, trustworthy, and explainable. In that same line of thought, discussions about training bias tend to imply that a neutral point of view is possible if the data could just be properly balanced. That notion, however, seems weirdly misplaced when we are talking about generative AI. After all, who really wants to interact creatively with a neutral (or even omnisciently godlike) being if that being is situated nowhere and has no distinct voice? Image generation models would not be very good at creating credible photographs or paintings (or text, for that matter, in the case of their chatbot cousins) if they did not conjure up a distinct and situated position from which to speak.

If we care to understand such models and their situated perspectives—and we should since they are now creating our world alongside us—dismissing their peculiarities as remediable training biases is a bad place to start. It misses the mark on what generative AI models are good at and it traps us with the erroneous assumption that, contrary to their human counterparts, these models can be accounted for without being understood on their own terms.

At the Royal Anthropological Institute’s 2022 conference on Anthropology, AI and the Future of Human Society the problem was addressed in a variety of ways. What would an ethnography of latent spaces (the results of the trained neural networks that constitute the “brains” of the models) look like? Would it be possible for an AI to learn how to do participant observation in the vast data worlds of the Metaverse? Can ethnography help us scrutinize the way machine intelligence constructs its objects of knowledge? Such questions shift the conversation about generative AI from being mainly about accuracy (can the model really imitate Picasso) and bias (does it mirror cultural imbalances in the training data) to being also about conviviality (how do we understand it to live well with it?).

Of course, there are situations where trust matters, such as when Google’s new conversational AI service Bard got the facts wrong at its own launch event. In other situations, trust is not the issue, such as when game designers use Midjourney to sculpt in-game objects or journalists use ChatGPT to write fiction. Indeed, in the latter case, the model was found to be bad at replicating great authors but yielded genuinely interesting results when given free rein.

Similarly, there are situations where biases must be exposed and overcome and transparency is paramount, such as when AI is used for credit scoring or to help social services make decisions about citizens. In other situations, some form of bias might be the main point, such as Michael Running Wolf’s work using large language models to reclaim Native American languages.

If your social media feed looked anything like mine in 2022, AI-generated images were everywhere. So were the predictions about the fields they would disrupt, from ways of working in art, filmmaking, and UX design to issues of copyright and personal data. Since December 2022, when ChatGPT came along and took the hype to a whole new level, it has seemed as if everything is now slated for disruption. Whether this will happen or not, the changes that are coming will certainly further challenge the idea that good AI is always unbiased and trustworthy AI. Most of the generative AI discussions we have seen unfolding over the past year are already having trouble conforming to that mold.

Developing the ethnographic techniques that will help us make sense of the world as the AI models see it is going to be a major challenge. And it is going to be a continuous work in progress. While finalizing this piece I revisited Stable Diffusion one last time and asked the same questions about marriage ceremonies. But its response was no longer the same. As in so many other contexts, the ground is shifting under our feet and fieldwork becomes an ongoing commitment. A marriage in Stable Diffusion is not what it used to be.

Authors

Anders Kristian Munk

Anders Kristian Munk is an ethnologist working with public knowledge controversies, particularly around new digital technologies. He is an associate professor at the Techno-Anthropology Lab and director of MASSHINE, the Aalborg University hub for computational social science and humanities. Find him on Twitter and LinkedIn.

Cite as

Munk, Anders Kristian. 2023. “Coming of Age in Stable Diffusion.” Anthropology News website, May 8, 2023.

Coming of Age in Stable Diffusion

Coming of age in Stable Diffusion

Beyond bias

And beyond transparency

Towards an ethnography of AI-generated worlds

Authors

Anders Kristian Munk

Cite as

More Related Articles

Hospitality and Care towards the (Un)knowable Stranger in Greek Orthodox Charities

The Politics and Limits of Aspiration

AAA Virtual High School Internship Projects

Skip to article

Coming of Age in Stable Diffusion

Article begins

Coming of age in Stable Diffusion

Beyond bias

And beyond transparency

Towards an ethnography of AI-generated worlds

Authors

Anders Kristian Munk

Cite as

More Related Articles

Hospitality and Care towards the (Un)knowable Stranger in Greek Orthodox Charities

The Politics and Limits of Aspiration

AAA Virtual High School Internship Projects