Article begins

Language work defines the globalized new economy and, with it, strategies for managing how we do language. Emerging voice technologies in contexts like call centers represent the latest attempt to create the perfect language worker. 

Who Am I Speaking With?

Grace is, by all accounts, the perfect call center operator. She picks up calls without hesitation and, not one to interrupt, listens patiently to the customer’s every need. When it’s time to speak, Grace responds with both reassurances and practical solutions. All the while, she reminds the customer of her attention through subtle cues and personalized comments. By the time the call ends, her success is judged not only on her ability to deliver customer service, but customer satisfaction. Each call can last no more than five minutes, meaning Grace receives hundreds of calls per day. Call centers have a long history of regulating, surveilling, and assessing operators’ conversations in order to achieve the same levels of performance as Grace. Communication factories that they are, management has the task of reformulating how we do language, eliminating human error while preserving what makes it “human.” For most operators, such skill takes hours of training, on top of years of practice—and even then, no one is immune to human error. Except Grace, a conversational AI purpose-built for customer service.

Voice Technology and Language Workers: A Natural Evolution

The language worker, whose job is defined by talking with others, is the emblematic figure of the globalized new economy. Language in its own right has become an important resource. A globalized economy means that the flow of goods and services is no longer bound by the nation-state, a process that requires more communicative work to maintain. At the same time, the human capacity for conversation has become the central activity for workers engaged in rising industries like customer service, retail, tourism and hospitality, and call centers. Language now challenges traditional skills like physical and cognitive abilities in terms of importance on the labor market.

On the one hand, this can be read as a source of economic empowerment. People can now convert their human capacity to talk with others into capital. On the other hand, though, the introduction of language onto the economic market leaves it subject to new, managerial conceptions of how we do conversation. Where language was previously thought of as a source of pride and identity and as a deeply improvisational practice, language on the market is reconceptualized as a source of profit. Consequently, conversation, like other economic resources, has become vulnerable to operational logics that reduce it to a set of commodifiable “skills” to be bought and sold on the market. As the business of conversation has become about efficiency, a second transformation is taking place: the attempt to replicate conversation using AI voice technologies.

Voice technologies represent the latest epoch of AI technological development in public discourse, and include programs designed to (a) recognize and interpret a user’s speech and (b) respond with speech of its own. In other words, the current state of voice technologies is designed to be conversational. The designation “voice technology” represents a progression of the operational logics of language that came before. Like conceptualizing conversation as a commodity or skill, thinking of language as a “technology” prioritizes its instrumental value. Subsequently, such a perspective assumes that human conversation can be broken down and artificially (re)constructed.

Voice technologies (especially those that talk back as well as listen) present a particularly attractive option to industries that employ language workers. If the goal of a language worker is to reproduce the same, manufactured conversations all day, then deploying a programmable and infallible AI seems preferable. The mass replacement of human language workers with voice technologies has thus far been rejected by anthropologists. Voice technology, it has been argued, is too expensive, too primitive, and would surely face social resistance. The advent of programs like Grace, however, provides an updated lens that suggests affordability and sophistication are no longer an issue for employers. We are left, then, with the question of social resistance. How “human” are voice technologies, really? Can they seamlessly replace the job of human language workers, or is there something social about conversation that AI can’t quite replicate?

Credit: Generated by ChatGPT’s DALL-E image generation model
This image was generated by DALL-E based on the prompt to "create a photorealistic header image for an article about call centers staffed entirely by robots." It shows several rows of white humanoid robots sitting front of blue computer monitors, wearing headphones and microphones.
This image was generated by DALL-E based on the prompt to “create a photorealistic header image for an article about call centers staffed entirely by robots.”

Becoming Human

Advancements in voice technology have been motivated by both technological and economic factors. Technologically, developments in artificial intelligence and machine learning have enabled widespread use of voice technologies to speak (e.g., sophisticated text-to-speech, voice modification, AI voice replication (e.g. ,OpenAI Whisper, Eleven Labs) and write (e.g., ChatGPT, automatic translation services) for us. Economically, voice technologies are often sold as affordable, “efficient” alternative to human labor. This is especially true in the contexts of language work, where voice technologies make language tasks easier, faster, and cheaper for those involved. Call center AIs like Grace, for example, can manage the work of multiple operators at once, without error, in a fraction of the time.

Increasingly, however, developers are directing their attention to making voice technologies that not only do the work of human language workers, but sound human as well. Grace, made by developers at Gridspace, is one such example. Grace is an example of conversational AI, intended to replicate the free-flowing, spontaneous conversation we associate with real human people. The goal, according to Gridspace, is to create voice technology that replicates the “tone and skill of a top-performing contact center agent,” making for “natural [i.e., naturally human] sounding calls.” In pursuit of these goals, Grace is equipped with a repertoire of linguistic features designed to index human conversation.

For one, she is emotionally responsive, tracking customer experience and emotionality and adjusting her own tone of voice to match. Further, management can manually adjust her tone of voice to appear more or less empathetic, talkative, or formal. Most apparent, though, is that Grace shows linguistic signs of improvisation. Being able to speak “off-the-cuff” and adapt to new communicative contexts is a defining feature of everyday human conversation. Anthropologists of conversation have long attempted to identify the linguistic features that set improvised and scripted speech apart. Improvised speech is, for example, more likely to be receptive to audience participation and changes in genre. Famously, Erving Goffman identified multiple features that indicate someone is engaged in improvised speech, or “fresh talk” in his own words including backchanneling (e.g., “mHm”, “uh-huh”), parentheticals (e.g., going on tangents), and filler words (e.g., “um”, “uh”).

Becoming Language Worker: Human-ish

The style of conversation expected of language workers differs significantly from that of everyday conversation. In fact, it is fair to say that the conversations language workers engage with have much in common with the programming of voice technologies. In both cases, they are subject to operational logics of language. Whereas voice technologies are literally coded—designed and pieced together individual components of speech to create a speaking whole—language workers’ speech is coded as a collection of scripted commodities and/or linguistic “skills”. The result is a standardized style of conversation that is much less free-flowing or spontaneous than everyday conversations.

Call center workers, for example, are subject to what Deborah Cameron calls “top-down talk”—highly standardized scripts created by management, specifically designed to regulate worker-customer interactions. Other language workers, like those in retail or hospitality, are less likely to follow tight scripts but are still subject to processes of styling. Thus, while it may seem that talking to an actual human is somehow more organic or unscripted, human language workers might be required to perform characteristics of improvisational or everyday conversation (they still appear personable, friendly, etc.). Like Grace, language workers’ speech is designed and regulated by management in pursuit of customer service. (You likely recognize this if you’ve ever had a particularly “robotic” sounding customer service call.) This strategy, mass-produced personable service, is what Norman Fairclough aptly calls “synthetic personalization”.

The challenge is that when language workers are asked to converse in ways that are less human, they experience dehumanization in the workplace. The intensity with which conversations are regulated—scripting and styling, but also the looming presence of management—means language workers rarely get the opportunity to go “off-script”. In some cases, doing so can be cause for getting fired. In the extreme cases, this manifests in communication-based worker abuse. Regulation over what language workers can and cannot say means bearing the brunt of customer dissatisfaction, expectations of servility (“the customer is always right”), and physical, emotional, and sexual abuse, with little to no resistance.

Given the negative effects of regulating language workers’ speech, there may be space for voice technologies to improve worker wellbeing. Human language workers are already in many cases treated like vocal robots, and attempting to meet the expectations of management and customers can cause genuine harm. Deploying voice technologies to take over these practices could relieve workers of these stresses, whilst still adhering to—maximizing, even—the operational logics of upper-management. Implementing voice technology in situations where language workers are asked to adhere to strict scripting regardless, such as incoming calls in delivery-dispatch centers, could alleviate the pressures of time and communication and perhaps even improve productivity in other tasks (e.g., outgoing calls).

This places voice technologies like Grace in a curious position.  On the one hand, Grace and her kin are an attempt to replicate the free-flowing, spontaneous style of everyday human conversation, using operational-logics of language—a contradiction that leads to limited success. On the other hand, human-sounding voice technologies are being applied to contexts of language work, where conversation is markedly ­dehumanized to begin with. A vocal robot, trying to sound human, in a context where humans are expected to talk like vocal robots. The social implications of this, contrary to the positive reactions of test-users, includes a sense that the performances of voice technologies like Grace are inauthentic, leading to a degradation of trust between user and technology.

Performance, Authenticity, and Trust

To understand the social implications of voice technologies performing human-ness, consider the ways humans attempt to perform particular social roles. As socially and historically situated individuals, we associate certain behaviors—ways of talking, doing, being—with different social roles and practices. 

Successful presentation of the self requires that (1) a person’s performance be “good” enough according to the expectations of others, and (2) appear authentic. Authenticity does not necessarily have to derive from genuine self-investment, or what Goffman calls a “sincere” performance—but it helps. The stylized speech of language workers, for example, makes it far from sincere.

When performed well, and in contexts where synthetic personalization is expected, there is no issue. When the insincerity—even cynicism—of one’s performance slips through the gaps, however, the speaker loses authenticity in their role (as a language worker, customer-service person, etc.) and trust is broken. This feeling can be exacerbated in contexts where personalization is considered inappropriate. What might be considered a friendly, personable voice among language workers in one culture, for example, might lead to confusion (or even irritation) in others. Insincerity may also lead customers to feel as if they are being “duped” by engaging in these kinds of conversation, or that there are ulterior, malicious purposes at play.

These same concepts can be translated to voice technologies. Trust is often given as a justification, rather than a barrier, for developing more human-sounding voice technologies. The tagline for Grace given by Gridspace is “Hire a voicebot you can trust,” later associated with her “natural[ly human]” sounding speech. What is missing, however, is a consideration of how authenticity and performance affects trust between voice technologies and users.

Voice developer and writer Tobias Dengel takes a more holistic view, noting that the issues of performance, authenticity, and trust in voice technologies still make uneasy bedfellows.  Despite technological leaps, “human-sounding” voice technologies still widely undermine trust among users, both in the short and long-term. In the short-term, Dengel found that human-like voice technologies evoke feelings of an “uncanny valley” in users—almost but not quite human. This, subsequently, has affected long-term trust in voice technologies to do what they say they are able to do. If voice technologies aren’t as human as developers claim they sound, then what else are they claiming to do that they can’t?

 From a social perspective, then, not-quite-human technologies undermine their own ability to engender trust by producing inauthentic performances of humanness. In the short term, “uncanny valley” performances do not meet the social expectations we have of human conversation, and, in the long term, this can lead voice technologies to lose any sense of authenticity in their roles as language workers. The solution to this issue appears to be that, rather than try and develop voice technologies that sound more human, developers may find more social success in performances that are authentically operations-oriented, to compliment the work of human language workers. In other words, rather than shy away, voice technologies might find more success as authentic vocal robots.

Authors

Tom Parkerson

Tom Parkerson is a Ph.D. student in social anthropology in the School of Global Development of the University of East Anglia (Norwich, UK). His current research explores the relationship between language, work, and precarity.

Cite as

Parkerson, Tom. 2024. “Vocal Robots.” Anthropology News website, June 18, 2024.