Social Interaction
Video-Based Studies of Human Sociality
Situations of Group-Robot Interaction:
The Collaborative Practice of "Robot Speak"
Salla Jarske, Sanna Raudaskoski, Kirsikka Kaipainen & Kaisa Väänänen
Tampere University
Abstract
Social robots are designed to mimic embodied human interaction capabilities and are envisioned as social interaction partners either for individuals or groups of people. To interact with such robots, human sensemaking and active practical effort is required. In this microanalytic study, we examine video-recorded multi-party interactions with the social robot Pepper to illustrate some interactional and collaborative practices that humans engage in to achieve interactions with the robot. We examine situations where a group of university students engages in embodied practices trying to get responses from Pepper. We coined two concepts to describe these encounters: "robot speak" refers to embodied, spoken utterances directed at the robot with the aim of prompting a response. It is a specific, situated way of speaking shaped by assumptions about the robot's interactional competence. "Framing talk" describes the participants' collaborative commentary used to make sense of the situation, co-constructing the human-robot interaction as a meaningful social event. Additionally, in reference to our previous work, we illustrate a practical-ethical dimension related to social robots by examining how even a minor gaze shift of a robot can become immediately recognized as contextually significant for a participant, even when such are not explicitly designed as invitations to interact.
Keywords: human-robot interaction, multi-party interaction, gaze, socially interactive robot
1. Introduction
Many so-called "social" robots are designed to mimic embodied human interaction capabilities and are envisioned as social interaction partners in various contexts, such as elderly care and education. Interaction with such robots requires human sensemaking and active practical effort, outlined by several studies that investigate human-robot interaction (HRI) from a conversation analytical (CA) perspective (Majlesi et al., 2023; Stommel et al., 2022; Tuncer et al., 2023). Much of the research on HRI has focused on dyadic interactions between one human and one robot. However, robots are also envisioned for use purposes that place them in public settings, such as shopping malls or hospital lobbies. In public spaces, people may engage with robots in groups rather than individually.
This microanalytic study explores multi-party interactions with the social robot Pepper (Figure 1). The data consist of two video recordings of university students exploring Pepper robot's capabilities. The robot in these situations is on "idling" mode, which means that it is responsive to some human behaviors but does not aim to proceed interaction based on a specific script (compared to, e.g., Gillet et al., 2021; Krummheuer & Rehm, 2024; Stommel et al., 2022). In the situation, Pepper is surrounded by people and occasionally turns its head to different participants. This study offers insights into the nature of group-robot interactions (Sebo et al., 2020), by examining the interactional phenomena that arise with socially interactive robots and proposing more specific terminology to aid in their conceptualization. The analysis focuses on the collaborative, embodied practices through which humans make sense of Pepper's interactional capabilities, and how they achieve "interaction" with it. Broadly, this paper contributes to the understanding of HRI as a situated practice of human effort and is located within the growing research efforts to adopt Ethnomethodology and Conversation Analysis (EMCA) to AI and robotics contexts.
Previous ethnomethodological research has established that robots are machines that people can treat interchangeably as both "agents" and "things" (Alač, 2016), and that robots' agency and sociality are contextually achieved (Pelikan et al., 2022; Rudaz & Licoppe, 2024). In this work, we illustrate how a contextually emerging sociality of the robot characterizes human-robot interaction (HRI) as a practice, and approach HRI as a social, collaborative activity. By adopting this perspective, the study contributes to both the field of HRI and EMCA. The aim is not merely to name aspects of interaction, but to render visible the broader social and interactional phenomena at play. In doing so, we move beyond the conception of HRI as a series of back-and-forth exchanges, and toward a more nuanced understanding of what actually happens when robots are embedded in human groups.
Figure 1. The robot featured in this study is the Pepper robot developed by Aldebaran (formerly SoftBank Robotics).
2. Background
Human interaction is an inherently situated phenomenon that relies on interactional competence to interpret other's actions within the immediate context (Heritage, 1984). In face-to-face interaction, people interact with their bodies and use various resources available to them (Goodwin, 2000). From engineering and design perspectives, robots designed to interact in social ways, specifically "socially interactive robots" (Fong et al., 2003) like Pepper (Figure 1), may face challenges in navigating the highly context-dependent and embodied nature of human interactions. Movements and behaviors of robots can become temporally part of ongoing scenes of interaction when humans actively assign meaning to them (Pelikan et al., 2020), and people work to maintain the robot's status as a competent participant (Rudaz & Licoppe, 2024).
This study adopts an EMCA perspective to examine the methods people use to achieve mutual intelligibility in situated actions (Heritage, 1984; Schegloff, 2007). Originally developed in the early 1970s by Harvey Sacks, Emanuel Schegloff, and Gail Jefferson (1974), conversation analysis (CA) examines how participants respond to each other in their conversational turns, i.e. "talk-in-interaction." Grounded in ethnomethodology, this approach emphasizes the immediate production of conversational action. While EMCA has primarily centered on verbal communication, the broader "embodied turn" in conversation analysis (e.g. Nevile, 2015) has led to an expansion of CA to include bodily expressions, resulting in the development of multimodal CA (e.g. Mondada, 2016). This multimodal approach has become the dominant perspective in analyzing video-recorded interactions, as it integrates various modes of communication beyond speech alone. In our analysis, robot's gaze and changes in gaze direction are identified as key interpretive (multimodal) resources for participants in the interaction with the robot. Experimental eye-tracking studies in HRI reveal that participants interpret a robot's gaze in much the same way as human gaze, and this interpretation does not so much depend on anthropomorphic features as such, but rather on movement patterns that suggest "seeing" (Staudte & Crocker, 2011). Robot's gaze behavior also has been found to impact the persuasiveness of robot's utterances (Fischer et al., 2020).
2.1 Prior EMCA research on human-robot interaction
People design their interactional turns to be recognizable by specific recipients, a phenomenon known as recipient design. According to Sacks et al. (1974), recipient design encompasses a multitude of aspects in which a speaker's communication is constructed or designed with an evident awareness and sensitivity to the specific others who are participating in the conversation. This means that the speaker takes into account the listener's knowledge, background, and situated context, etc., shaping their words accordingly to facilitate mutual understanding. The way in which human-robot interaction (HRI) is practically accomplished, involves robot-specific recipient design (Pelikan & Broth, 2016; Tuncer et al., 2023). In studying how people use gestures with social robots, Tuncer et al. (2023) outlined that people design and re-design their gestures to make them more recognizable to the robot. In doing "conversation" with a robot, people have also been found to use shorter expressions (Pelikan & Broth, 2016). Robot-directed speech also involves unique phonological and prosodic features that differentiate it from ordinary talk (Fischer, 2016). The practices involved in robot-recipient design exhibit people's assessment of robot's interactional competence (Tuncer et al., 2023).
Much like in human conversation, with robots, people also engage in "conversational repair" when a specific turn appears to have failed in communicating something. By analyzing the conversational trouble emerging in health survey interactions with the Pepper robot, Stommel et al. (2022) found that when a robot does not register correctly what a person has said, people would engage in repair by reintroducing their turn by repeating it in a different form.
Humans adopt normative expectations of ordinary conversational practices to interactions with robots, for example in organizing turn-taking (Majlesi et al., 2023). With robots, people adjust behaviors and turn design to meet the situation (Pelikan & Broth, 2016; Stommel et al., 2022; Tuncer et al., 2023), and for example, find ways to manage various normative violations of turn-taking (Majlesi et al., 2023). Studies in interaction experiments with socially interactive robots Furhat (ibid.) and Nao (Pelikan & Broth, 2016) show how participants specifically struggled with early turn transitions when actions are recognized before being completed, but also adjusted their turn design to this context as they learned what is and is not possible.
While the interactional phenomena, for instance, of turn-taking and repair can be studied in experimental settings through dyadic interactions between one human participant and one robot, multi-party interactions with robots can reveal interesting social phenomena, particularly concerning the practices that people, together, engage in with a robot. For example, Pitsch (2020) studied encounters between child, adult, and the Nao robot in a museum exhibition, showing that adults plays an important role in helping children co-participate in the production of actions in HRI, even instructing the children what to say to the robot. Rudaz and Licoppe (2024), who also investigated HRI in a museum setting, show the collaborative work of bystanders in framing robot's actions social. Furthermore, Krummheuer (2015b) has pointed to the various participation roles, those who "take the stage" and bystanders that formed an audience who observe and engage in commentating the situation, sometimes becoming helpers and co-users.
3. Data and Methodology
Video data was gathered in 2022, by recruiting university students of a robotics course to take part in a video recording session for an extra credit. These participants, who were part of a robotics course during the Covid pandemic, had not physically interacted with these robots before and were provided an opportunity to test the robots and interact with them. As part of the data gathering process, participants signed consent forms, and privacy statements about the data handling were made available.
In the room where Pepper was located, there were two cameras and an additional voice recorder. One researcher was stationed in the room for the entire duration as technical support, and another researcher was present part of the time. The session lasted about an hour, with total 8 students visiting Pepper, first a group of 5, then a group of 3. The participants spoke English. Participants were given the freedom to explore the robot in any way they wished, including trying out a city survey quiz app with Pepper.
The footage from the two cameras and the audio recorded was combined, the full video divided into six parts to make the transcription process easier, and thus, the extracts below refer to both video clips and their corresponding timestamps.
The research employs an ethnomethodological conversation analysis approach (Sidnell & Stivers, 2013). In this study of human-robot interaction, the focus is shifted from participants' perceptions of the interactions to the interactions themselves, as they naturally occur and are enacted. Interactions are studied through the sensemaking processes of the participants, emphasizing how individuals in those situations accomplish actions. The analyzing process incorporates multimodal analysis (Mondada, 2019), which includes markings for multiple, relevant modes of interaction in the transcripts. When presenting data extracts in this article, we also combine Jeffersonian speech transcription conventions (Jefferson, 2004) with still images (as drawings) extracted from the video data.
We initially compiled interactional sequences from the data that were in some way related to shifts in gaze by either the participants or the robot, resulting in a collection of 10 instances. At this stage, we did not yet define which of these would be subjected to detailed analysis. As we engaged more closely with the data, our interest began to shift from gaze to the various types of social actions that appeared to be occurring in these excerpts. We then moved toward a more systematic examination of these excerpts. In the following section, we present our analysis.
4. Analysis
The following four examples present group interactions with the Pepper robot. In the first example (4.1), the group treats the situation as if it were an "experiment" to test Pepper's capabilities, and the example illustrates the role of robot's gaze in speaker selection and how participants adjust their actions to align with Pepper's view. In the second example (4.2), Pepper does not adhere to human interactional norms, and the group acts as if it does, adapting embodied efforts to maintain a coherent conversation. In the third example (4.3), we identify another collaborative aspect of HRI, when a participant encourages/instructs another person what to say to the robot. The fourth (4.4) example identifies a moment during which Pepper's head movement and thus the direction of the gaze results in immediate reinterpretation of earlier events, altering instantly the participant's interpretation of the ongoing situation.
Throughout these examples we coin descriptive terminology to refer to the phenomena that are observed. First, we call "framing talk" the talk that participants engage in to co-construct an ongoing scene of HRI. By framing "talk," we refer to talk-in-interaction where individuals produce actions that are recognizable within a normative order, that contribute to the emergence and realization of the current HRI situation as it is understood by everyone involved. Second, the spoken utterances directed specifically to the Pepper robot to get Pepper to do something, is what we refer to as "robot speak" (not to be confused with speech that sounds robotic). We use the term "speak" here as a noun, referring to a specific way of speaking used by a particular group in a particular context. The notion of "robot speak" is based on our prior theoretical analysis that establishes a distinction between "talk-in-interaction" and the practice of "using speech" when interacting with a robot (Jarske et al., 2020). There "using speech" refers to situations lacking a reciprocal attitude of social interaction, making it more about performing actions (i.e. speak) rather than genuine social interaction (i.e. talk). Robot speak can alternatively be understood as giving spoken commands to the robot, but the concept of robot speak is aimed to cover more broadly the embodied practice of enacting spoken utterances to the robot, with robot-recipient design and the continuous interpretation of the robot's interactional competence (e.g., Tuncer et al., 2023).
4.1 The collective "experiment"
To provide context for the situations examined in the following extracts, the participants had initially interacted with Pepper using a specific application designed to quiz them about famous places in the local city. During this quiz, participants would select their responses by touching the tablet interface after Pepper posed a question. Once the quiz was completed, Pepper experienced a technical malfunction and had to be reset. After the reset, a standard "waking up" animation triggered, Pepper appearing to observe its environment. At this point, the participants began exploring the robot's capabilities while Pepper is in idle mode, leading to the interaction transcribed in the first excerpt.
Extract 1. Clip 3, 01:05-02:02
Open in a separate windowPrior to the beginning of this excerpt, the participants were discussing the robot's voice and waiting for Pepper to complete a reboot. During the reboot, Pepper's head was down and facing the floor (Image 1.1). Pepper produces a sound "ou" before lifting its head, and Zoe reacts to the sound with a quiet laughter. Pepper's head raises and simultaneously turns to the left, toward Amy (Image 1.2) and Amy reacts to with a "wow." Amy adjusts her position so that she is directly in front of Pepper and begins a greeting sequence "hi," waving her hand directly in front of the robot's face (Image 1.3). Pepper's "hi," along with a small, jerky movement of its arms, follows immediately. Amy responds to it with vocalization "oh" and begins to laugh with Zoe.
Pepper produces another "hi," to which Amy responds with "hi how are you," readjusting her standing position and putting her hands behind her back (Image 1.4). This is followed by a 2 second silence of waiting for a response, during which Amy leans left slightly in front of Pepper's "view" (Image 1.5). Zoe, who is standing behind Amy to the left (not visible in images) joins in with a "hello," which is followed with "hello" from Pepper who also nods its head. This is met with quiet, brief laughs from Mia and Zoe.
Then, Amy greets Pepper with the Finnish word "moi" ("hi"), which, after 1.5 seconds, is followed by Pepper's head starting to turn away from Amy and to the direction of the door (Image 1.6). The participants treat the head movement as confirmation that the robot does not know the language. The researcher, who has engaged in the situation as both a bystander and an expert (Krummheuer, 2015b), confirms this with "it's in English mode." Amy now steps back to her previous position prior to the greetings (Image 1.7).
As the situation unfolds, the initial practice of greeting leads to multiple attempts to discover what Pepper is actually capable of, turning the interaction into a collaborative, collective "experiment" by the group to test Pepper's interactional competence. During this brief episode, where the Pepper robot is engaged in multiple greetings sequences with a group, the participants engage collaboratively in conversational repair: when Pepper does not respond to Amy's second greeting attempt, Zoe repairs Amy's turn by producing another attempt which is a simpler one ("hello") and receives a response. The extract also indicates that the direction and movements of Pepper's head appear as meaningful interpretive resources for the group. Pepper's head turning directly toward Amy at the beginning was interpreted as availability for interaction as well as speaker selection. What is additionally relevant to this HRI scene is how Amy actively attempts to position herself in Pepper's direct view (Image 1.3, Image 1.5). Her waving gesture is also lowered at the level of Pepper's view (Image 1.3), indicating a specific recipient design (Tuncer et al., 2023). Behind and to the right of Amy, Mia also moves to align herself with Pepper's potential view (Image 1.6). Pepper's head turning away from Amy, who had attempted to greet it in Finnish, is treated as a response to the use of a non-English word by multiple participants, including the researcher, who confirms that the robot is in English-only mode. The head turning, in this moment, resulted in the participants talking to each other instead of trying to re-engage or repair: on lines 26-31, the participants stopped talking to the robot and began to talk about the robot's capabilities with each other as an "intermission" before re-engaging it again (see Extract 2). As the researcher comments on the language mode, this provides conclusions for the group about what they have now learned, what works and what does not work. Note that the role of Pepper's head movements will play a significant role in the following excerpts of group HRI as well, and we will return to address this phenomenon in the Discussion.
4.2 "Robot speak" and "framing talk"
This Extract 2 continues shortly after Extract 1. Up to this point, the participants have come to understand that while the robot does not initiate conversation in idle mode, it can respond when spoken to. They continue the collective "experiment". In here, we focus on the notion of "robot speak" and "framing talk," defined earlier.
Extract 2. Clip 3, 01:33-02:02
Open in a separate windowDuring the entire extract, Pepper's head is directed towards Zoe and Tom. Zoe asks for Pepper's name, and Pepper produces a fitting second pair part "My name is Pepper." To this, Zoe vocalizes empathetically "aaawh." Then, Amy asks where Pepper is from. When no response is provided to this question, Mia quietly chuckles "it's like, 'I don't know'". Mia's utterance here accounts for the missing response by explaining to the group that Pepper does not know where it is from. On line 12, Zoe asks Pepper "what do you like" which is followed with "sorry, I can't for the moment" from Pepper. The group jointly laughs at the incongruent response (similar to Due, 2019). Mia, gazing at Zoe who had asked the question, laughingly accounts for the robot's response "it doesn't like anything at the moment," Zoe making also brief mutual eye contact with Mia (Image 2.2). Zoe then utters to the direction of Pepper "it's okay. I can't either," followed with bursts of laughter from the group. Then, Tom self-selects and reintroduces Amy's earlier attempt with "where are you from."
Participants' turns which project an expectation for a conditionally relevant response from a robot are henceforth referred to in our analysis as "robot speak." Examples of robot speak in this extract are attempts at getting a response from Pepper by trying different questions (highlighted lines 1, 3, 8, 12, and 22). Robot speak is characterized by unique, robot-specific recipient design based on the assessment of robot's competence (e.g. Pelikan & Broth, 2016; Tuncer et al., 2023), and also they always practically render robot's subsequent actions accountable. This excerpt illustrated two methods for dealing with a missing response together in the context of "robot speak." One was introducing a different question by a different participant: after Tom's attempt "how are you," only 1.6 seconds later Zoe took a turn with "what's your name." Another is introducing the same question by a different participant (lines 8 and 22).
Additionally, this excerpt illustrates participant's accounting for the missing response on behalf of the robot. After Amy's attempt "where are you from," a significant pause of 4.6 seconds occurred, and Mia accounted for the lack of response on behalf of the robot, as if the robot does not know where it is from (line 10). This does not indicate what Mia truly believes (cf. Pelikan et al., 2022) but served as a method of providing (with laughs) a sense to the missing response. Although the question was posed by Amy, Mia was gazing at Zoe for the duration of her turn (Image 2.2), highlighting that what is happening concerns everyone; it highlights the co-constructed scene (cf. Alač et al., 2020). Moreover, Mia has not attempted "robot speak." Rather than being designed for the robot, Mia's turn is conducted in the background of the HRI activity.
Prior research has shown that side-sequences (Jefferson, 1972), where a main speaker establishes a momentary focus with bystanders, excluding the robot as a recipient (Krummheuer, 2015a), are a common practice of multi-party HRI in public spaces (Rudaz & Licoppe, 2024). Explaining or accounting for robot's behaviors or internal states, is also something previous work has recognized that people do with robots (e.g. Parviainen et al., 2019). As Mia is not "on the stage" as the main speaker with Pepper, but rather, she is standing on the side, providing commentary of the unfolding HRI scene, we refer to her participation in this moment as "framing talk." With "framing talk," which can also incorporate side-sequences, we emphasize its role in maintaining a coherent, shared sense of the situational appearances (Garfinkel, 1963), and of "what is happening" (cf. Goffman, 1986) for all participants in the scene.
The notion of "framing talk" incorporates more than side-sequences, which we shall illustrate with Zoe's response to the robot ("I can't either"). Pepper's response "sorry, I can't for the moment" is contextually incongruent (Due, 2019), but the participants treat the response as if it corresponds with the question, i.e., Mia's "it doesn't like anything at the moment." In this context, Zoe's response to Pepper is not a command uttered for the robot to recognize ("robot speak"), but rather a normative closing sequence as well as also confirming (and following) Mia's interpretation "it does not like anything at the moment." We consider Zoe's "response" to the robot an embodied production within the framework of "framing talk," although it appears as if it is an instance of "robot speak." The reason Zoe's utterance is not in our definition a clear instance of "robot speak," is because it does not project an expectation for a corresponding response from the robot but rather is an action placed within a sequence of events that manage Pepper's incongruent response with others. However, rather than actively excluding the robot as a participant (as was in the case of Mia), Zoe playfully enacts Pepper's participation by uttering a response to it, and importantly, this prompts laughter from the group. Thus, in this moment, 'responding to the robot', although it looks like interacting with a robot, is actually a meaningful, co-constructed event-in-an-order, designed to allow the situation to move on.
4.3 Instructing others to do "robot speak"
This episode features different participants than in Extracts 1 and 2, and includes eight instances of "robot speak" that are highlighted in the transcript (see Excerpt 2 for our definition of "robot speak"). We specifically examine how the phenomenon that Garfinkel (2002) refers to as "instructed actions" is constructed, specifically when instructing others to do "robot speak."
In this extract, Ben, who had previously interacted with the robot for a few moments alongside the previous group (who are no longer present), is trying to get Pepper's responses with Lee and Ash. While interacting with Pepper with the previous group (as the sixth, new participant), Ben had asked Pepper "are you alive," to which Pepper had responded with "not the same way as you are." Ben, here reintroduces this question with new participants present, presumably knowing that this command could possibly lead to a response from Pepper.
Extract 3. Clip 6, 02:45-03:20
Open in a separate windowAt first, Ben tries to elicit a response from Pepper with "are you alive" but the robot remains silent. He also tries "hi," "what is your name," and "who are you," each receiving no response from Pepper (lines 1-7). When asking Pepper "who are you," he leans slightly forward (Image 3.2). Then, Pepper's head turns left toward Ash (Image 3.4), who immediately takes the turn "how old are you," receiving, again, no response. While having "eye contact" with Pepper, Ash greets them with "hi" and a hand wave. When Pepper does not respond, Ben quietly comments "they're thinking." Pepper provides a response "hi," which immediately prompts Ash to reintroduce the earlier question ("how old are you"), leaning slightly forward (Image 3.6), but receives no response. A pause of 3 seconds occurs, after which Ben gazes at Ash and quietly instructs her "ask if they're alive," pointing to Pepper also with his head (Image 3.8). Ash briefly looks at Ben and then leans forward uttering the instructed question (Image 3.9). She remains in this position for a moment, waiting for a response, before returning to stand upright. Pepper responds with "not the same way as you are," leading to laughter from the group (Image 3.10). Lee engages in "framing talk" to say that what happened was cool.
In this instance, Pepper's gaze direction again signals speaker selection, demonstrating how participants interpret the robot's gaze as an invitation or cue regarding who should speak to the robot (lines 9-10). Once Pepper's gaze shifted from Ben to Ash, Ash became treated as the main speaker in the scene. This is confirmed as Ben instructs what Ash should say to the robot, rather than taking the turn himself, treating Ash "having the stage" (Krummheuer, 2015b). According to Garfinkel (2002) instructed actions are achieved together through shared engagement with the task, and involves not only verbal instruction but also an embodied manner of interacting with the task. Ben's quietly delivered instruction to Ash ("ask if they're alive") achieved through a joint focus (cf. Goodwin, 2006) involving a head pointing and a brief mutual gaze with Ash, is a side-sequence that instructs Ash to reattempt Ben's earlier question again. Upon the delivery of Ash's "are you alive" question, Ben's gaze shifts to Pepper at the same time with Ash but turns to Ash and Lee when they laugh at Pepper's response. Ben's role was to 'facilitate' Ash's participation (Pitsch, 2020) but also to specifically instruct her based on what he had previously learned is possible and to show what he was initially trying to achieve with Ash and Lee present (line 1). It is interesting to note that the participants in this episode frequently lean forward when speaking to the robot, a behavior that was not as prevalent in the previous group. This suggests that the way people engage with the robot can evolve as participants, together, adapt and develop a local culture of doing robot speak.
4.4 "Oh god sorry"
In this final extract, we investigate a specific event that is caused by a small shift in Pepper's gaze direction, which, in the situation has meaningful consequences for how the situation appears and is subsequently managed. The excerpt illustrates the constant attentiveness of a person who navigates a social setting and anticipating the meaning of the situation (Rawls, 2002, 35-36). It demonstrates how interpretive frameworks shift seamlessly through performed actions. Socially interactive robots present people with familiar appearances from human interaction that become recognizable possibilities for action, involving also an obligatory/accountable nature due to the normative expectations of, for example, greetings (Jarske et al., 2020).
Mia has begun exploring the potential location of Pepper's head camera, and she leans close to Pepper's face to find if the camera is also located in the mouth, in addition to its forehead (Image 4.1).
Extract 4. Clip 3, 05:27-05:51
Open in a separate windowMia utters the question "does it have a camera on the mouth" (line 2) and standing upright from the previous position in which she was exploring the camera (Image 4.2). Meanwhile, Pepper's gaze shifts from Amy to Mia at this exact time. Mia immediately takes a step back (Image 4.3) and apologizes to the robot with "oh god sorry" (Image 4.4). Between asking the question, and producing the apology, there is no pause but one immediately follows the other.
During subsequent overlapping speech, the researcher (R1) confirms the camera location (lines 5 and 8) and Eva takes interest, asking "oh, is the camera on the mouth." Mia also confirms the location of the camera by pointing to Pepper's head (Image 4.6) Pepper's gaze is fixed on Mia and remains there for the remainer of this excerpt. Being now the recipient of Pepper's stare, Mia positions herself close to Eva (Image 4.7) and produces a chuckle. She then turns her gaze to Eva saying "when I found that out now it's looking at me," and laughs (Image 4.8). Zoe attempts to catch Pepper's attention by calling its name two times and asking it "can you tell the time." However, the robot does not respond to Zoe's question and its gaze stays fixated on Mia, who accounts for the lack of response with "we're having eye contact right now it can't," and laughs loudly. Zoe comments that "she's only focused on you," confirming this, and Amy instructs Mia to say something to the robot as the gaze is directed at Mia (giving instructions, similar to Extract 3).
When Mia is exploring the camera on the robot's face, she is not orienting to the robot as a social interaction partner but treating it as machine. It is also noteworthy that in human interaction, for example surgeons or doctors can treat humans as "bodies" which means their orientation is to treat the person like an object to be studied, examined, and so on (Guo et al., 2020). In this particular moment, Pepper's gaze shift toward Mia resulted in an immediate reframing of the situation. The apology provided to Pepper indicates that Mia interpreted the robot's head movement as motivated and relevant for her earlier activity as "invasive" toward the robot. The discussion about the cameras continues despite the ongoing stare from Pepper. However, as the stare continues, Mia engages in "framing talk" (which is what she has done in earlier examples as well), interpreting the robot's gaze as to indicate the robot has now become aware of what she was doing. The shift in interpretation during the apology is not just about treating the robot as an agent (or it becoming an agent), but also about rendering her own prior action under moral accountability. In here, a small movement from the robot was able to retroactively frame the person's previous behavior as morally accountable (see also Rudaz & Licoppe, 2024). Moreover, the shift of Pepper's gaze was a witnessable, embodied resource that made Mia, who was previously a bystander, into the one that is "on the stage" (cf. Krummheuer, 2015b).
Human interaction is underpinned by what has been described as an "interaction engine" (Levinson, 2006). Both embodiment and normative practices are developed early in human life, guiding how we orient ourselves in the world. The sudden "oh god sorry" is not the result of deliberate cognitive processes but rather a fundamental and spontaneous outcome of our bodily engagement in the world. It is the result of "the work that the body of practices do," as Garfinkel (2002, 210) phrases it. It emphasizes the immediate, embodied and contextually responsive character of social action. Jarske (2025) has referred to this as the "fundamental sociality," in contrast to "feature sociality," where a robot is designed with appearances and functions that are thought to invite people to interact with it.
5. Discussion and Conclusions
In this study, we have demonstrated two phenomena of human-robot interaction: "robot speak" and "framing talk." Extracts 1-4 illustrate some collaborative practices of group HRI. We have observed that doing interaction with a socially interactive robot in a group setting is characterized by the coordination of "robot speak." Robot speak is the embodied practice of enacting spoken utterances to the robot, involving robot-recipient design and the continuous interpretation of the robot's interactional competence. Robot speak is primarily an embodied action that renders robot's subsequent actions accountable. By attempting to get a response from the robot, the participants treat the robot as a recipient, which means they position their own actions as having purposes directed to get something to happen from the robot's point of view - they are not simply interacting with a robot, but actively, through embodied efforts, make the robot a recipient. We emphasize that when the participants speak to a robot, they enact themselves as "persons who speak with robots," they do "robot speak" which is the embodied deployment of human interaction practices with the machine. This "robot speak" is not to be confused with actions that people do with humans in ordinary talk - the background expectations for robot speak involves normative elements of how one ought to engage in this practice. This article has not focused on the various details of recipient design (as explored thoroughly in Fischer, 2016; Tuncer et al., 2023), but rather focused on understanding the nature of the event. Moments of "robot speak" are embedded within a broader normative order of events and human interaction, methodically shaped through various embodied practices of recipient design. In our data, speaking to a robot become the act of issuing commands, akin to pulling the lever of a slot machine, with the expectation of a response. These speech commands often resemble small performances or cues intended to trigger the machine's actions, leading to events that reshape the situation onwards. In this data, through "robot speak," people seek to collaboratively "interact" with the robot, learning as the situation evolves, and even instructing others.
We also identified "framing talk" as the talk that participants engage in, together, to co-construct a sense of the situation. Framing talk occurred alongside and in the background of an ongoing HRI scene through a commentary, constructing a sense of "what is happening" (cf. Goffman, 1986), and maintaining the robot as a participant (see also Rudaz & Licoppe, 2024). Moreover, the activities proceeded along a shared goal of "experimenting" what is interactionally possible with Pepper, making the actual HRI a unique practice lodged within a normative social order, with the aim to establish and maintain a recognizable environment. In all extracts, participants found the situations more or less amusing, engaging in joint chuckles and laughter (Glenn, 2003) even in instances when Pepper provided a correct and timely response to a question (see also Due, 2019; Pelikan et al., 2020). Laughter, here, had a particularly strong role in the unfolding of events, with multiple participants engaging, concertedly, in chuckles as responses to robot's behaviors. In future work, it could be explored in more detail how laughter is used to make sense of the situation and how it structures group-HRI as a practice.
In addition to "robot speak" and "framing talk," we have demonstrated how even a minor movement of a robot can become immediately recognized as a contextually significant event for a participant. Prior research has established that the sociality of the robot is a phenomenon that emerges within situated contexts through collaborative human effort (Alač et al., 2011; Rudaz & Licoppe, 2024). Robots can be treated as technical machines (Alač, 2016), and they are not participants of interaction fulltime (Krummheuer, 2015a; Pelikan et al., 2022). In other words, their participation in an ongoing scene occurs through human effort and within specific sequences of actions. In Extract 4, we illustrated how a simple gaze shift of a robot became a social event that prompted an immediate social response ("oh god sorry"). The excerpt illustrates the constant attentiveness of a person and demonstrates how interpretive frameworks shift seamlessly in the moment, and reactions occur instantaneously. Prior to the apology, the person observably treats the robot as an object to be explored, but upon "reading" the 'mere appearances' (Garfinkel, 1963; Jarske et al., 2020) of a robot's head movement as socially meaningful, recognizable event, apologizes to Pepper. Jarske (2025) refers to this kind of phenomena with the concept of fundamental sociality. The ability of robots to create situations that carry moral obligations for participants is a challenge for social robotics (cf. Jarske et al., 2020), particularly in everyday practices where the encounters between humans and social robots cannot be fully anticipated in the design process. Even though Large Language Models may in the future enhance the "conversational skills" of social robots, phenomena related to robots' physical embodiment, such as changes in gaze direction in our data, ultimately play a strong role in how events unfold and are interpreted (cf. Pelikan et al., 2022).
References
Alač, M. (2016). Social robots: Things or agents? AI & SOCIETY, 31(4), 519-535. https://doi.org/10.1007/s00146-015-0631-6
Alač, M., Gluzman, Y., Aflatoun, T., Bari, A., Jing, B., & Mozqueda, G. (2020). Talking to a Toaster: How Everyday Interactions with Digital Voice Assistants Resist a Return to the Individual. Aesthetic Intersections, 9(1), 3-53.
Alač, M., Movellan, J., & Tanaka, F. (2011). When a robot is social: Spatial arrangements and multimodal semiotic engagement in the practice of social robotics. Social Studies of Science, 41(6), 893-926. https://doi.org/10.1177/0306312711420565
Due, B. L. (2019, September 10). Laughing at the robot: Incongruent robot actions as laughables. https://doi.org/10.18420/muc2019-ws-640
Fischer, K. (2016). Designing Speech for a Recipient: The roles of partner modeling, alignment and feedback in so-called "simplified registers" (1st ed., Vol. 270). John Benjamins Publishing Company.
Fischer, K., Langedijk, R. M., Nissen, L. D., Ramirez, E. R., & Palinko, O. (2020). Gaze-Speech Coordination Influences the Persuasiveness of Human-Robot Dialog in the Wild. In A. R. Wagner, D. Feil-Seifer, K. S. Haring, S. Rossi, T. Williams, H. He, & S. Sam Ge (Eds.), Social Robotics, 12483, (pp. 157-169). Springer International Publishing. https://doi.org/10.1007/978-3-030-62056-1_14
Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2003). A survey of socially interactive robots. Robotics and Autonomous Systems, 42(3), 143-166. https://doi.org/10.1016/S0921-8890(02)00372-X
Garfinkel, H. (1963). A Conception of, and Experiments with, "Trust" as a Condition of Stable Concerted actions. In O. J. Harvey (Ed.), Motivation and Social Interaction: Cognitive approaches (pp. 187-238). Ronald Press.
Garfinkel, H. (2002). Ethnomethodology's Program: Working Out Durkheim's Aphorism. Edited and Introduced by Anne W. Rawls. Rowman & Littlefield.
Gillet, S., Cumbal, R., Pereira, A., Lopes, J., Engwall, O., & Leite, I. (2021). Robot Gaze Can Mediate Participation Imbalance in Groups with Different Skill Levels. Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, 303-311. https://doi.org/10.1145/3434073.3444670
Glenn, P. (2003). Laughter in Interaction. Cambridge University Press.
Goffman, E. (1986). Frame analysis: An essay on the organization of experience. Northeastern University Press.
Goodwin, C. (2000). Action and embodiment within situated human interaction. Journal of Pragmatics, 32(10), 1489-1522. https://doi.org/10.1016/S0378-2166(99)00096-X
Goodwin, M. H. (2006). Participation, affect, and trajectory in family directive/response sequences. Text & Talk, 26(4-5), 515-543. https://doi.org/10.1515/TEXT.2006.021
Guo, E., Katila, J., & Streeck, J. (2020). Touch and the Fluctuation of Agency and Motor Control in Pediatric Dentistry. Social Interaction. Video-Based Studies of Human Sociality, 3(1). https://doi.org/10.7146/si.v3i1.120249
Heritage, J. (1984). Garfinkel and ethnomethodology. Polity Press.
Jarske, S. (2025). Strange Machines: Robot sociality as a challenge for human-centred design. Tampere University. https://trepo.tuni.fi/handle/10024/163543
Jarske, S., Raudaskoski, S., & Kaipainen, K. (2020). The "Social" of the Socially Interactive Robot: Rethinking Human-Robot Interaction Through Ethnomethodology. Culturally Sustainable Social Robotics, 335, 194-203. https://doi.org/10.3233/FAIA200915
Jefferson, G. (1972). Side sequences. In D. N. Sudnow (Ed.), Studies in social interaction (pp. 294-33). NY: Free Press.
Jefferson, G. (2004). Glossary of transcript symbols with an introduction. In G. H. Lerner (Ed.), Conversation Analysis: Studies from the first generation (pp. 13-31). John Benjamins Publishing Company. https://doi.org/10.1075/pbns.125.02jef
Krummheuer, A. (2015). Technical Agency in Practice: The enactment of artefacts as conversation partners, actants and opponents. PsychNology Journal, 13(2), 179-201.
Krummheuer, A. L. (2015). Users, Bystanders and Agents: Participation Roles in Human-Agent Interaction. In J. Abascal, S. Barbosa, M. Fetter, T. Gross, P. Palanque, & M. Winckler (Eds.), Human-Computer Interaction - INTERACT 2015 (pp. 240-247). Springer International Publishing. https://doi.org/10.1007/978-3-319-22723-8_19
Krummheuer, A., & Rehm, M. (2024). Dealing with Moral Assistance in Robot-Supported Decision Making for Sustainable Consumption. 2024 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO), 188-193. https://doi.org/10.1109/ARSO60199.2024.10557753
Levinson, S. (2006). On the Human "Interaction Engine". In N. J. Enfield & S. Levinson (Eds.), Roots of Human Sociality: Culture cognition and interaction. Berg, Oxford.
Majlesi, A. R., Cumbal, R., Engwall, O., Gillet, S., Kunitz, S., Lymer, G., Norrby, C., & Tuncer, S. (2023). Managing Turn-Taking in Human-Robot Interactions: The Case of Projections and Overlaps, and the Anticipation of Turn Design by Human Participants. Social Interaction. Video-Based Studies of Human Sociality, 6(1), Article 1. https://doi.org/10.7146/si.v6i1.137380
Mondada, L. (2016). Challenges of multimodality: Language and the body in social interaction. Journal of Sociolinguistics, 20(3), 336-366. https://doi.org/10.1111/josl.1_12177
Mondada, L. (2019). Contemporary issues in conversation analysis: Embodiment and materiality, multimodality and multisensoriality in social interaction. Journal of Pragmatics, 145, 47-62. https://doi.org/10.1016/j.pragma.2019.01.016
Nevile, M. (2015). The Embodied Turn in Research on Language and Social Interaction. Research on Language and Social Interaction, 48(2), 121-151. https://doi.org/10.1080/08351813.2015.1025499
Parviainen, J., van Aerschot, L., Särkikoski, T., Pekkarinen, S., Melkas, H., & Hennala, L. (2019). Motions with Emotions?: A Phenomenological Approach to Understanding the Simulated Aliveness of a Robot Body. Techné, 23(3), 318-341. https://doi.org/10.5840/techne20191126106
Pelikan, H., & Broth, M. (2016). Why That Nao? How Humans Adapt to a Conventional Humanoid Robot in Taking Turns-at-Talk. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 4921-4932. https://doi.org/10.1145/2858036.2858478
Pelikan, H., Broth, M., & Keevallik, L. (2020). "Are You Sad, Cozmo?": How Humans Make Sense of a Home Robot's Emotion Displays. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction (pp. 461-470). Association for Computing Machinery. https://doi.org/10.1145/3319502.3374814
Pelikan, H., Broth, M., & Keevallik, L. (2022). When a Robot Comes to Life: The Interactional Achievement of Agency as a Transient Phenomenon. Social Interaction. Video-Based Studies of Human Sociality, 5(3), Article 3. https://doi.org/10.7146/si.v5i3.129915
Pitsch, K. (2020). Answering a robot's questions: Participation dynamics of adult-child-groups in encounters with a museum guide robot. Réseaux, 220221(2), 113-150.
Rawls, A. W. (2002). Editor's Introduction. In A. W. Rawls (Ed.), Harold Garfinkel. Ethnomethodology's Program: Working Out Durkheim's Aphorism. (pp. 1-64).
Rudaz, D., & Licoppe, C. (2024). ‘Playing the robot's advocate': Bystanders' descriptions of a robot's conduct in public settings. Discourse & Communication, 18(6), 869-881. https://doi.org/10.1177/17504813241271481
Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language, 50(4), 696-735. https://doi.org/10.2307/412243
Schegloff, E. A. (2007). Sequence organization in interaction: A primer in conversation analysis. 1. University Press.
Sebo, S., Stoll, B., Scassellati, B., & Jung, M. F. (2020). Robots in Groups and Teams: A Literature Review. Proc. ACM Hum.-Comput. Interact., 4(CSCW2), 176:1-176:36. https://doi.org/10.1145/3415247
Sidnell, J., & Stivers, T. (2013). The handbook of conversation analysis (1st edition). Wiley-Blackwell.
Staudte, M., & Crocker, M. W. (2011). Investigating joint attention mechanisms through spoken human-robot interaction. Cognition, 120(2), 268-291. https://doi.org/10.1016/j.cognition.2011.05.005
Stommel, W., de Rijk, L., & Boumans, R. (2022). "Pepper, what do you mean?" Miscommunication and repair in robot-led survey interaction. 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 385-392. https://doi.org/10.1109/RO-MAN53752.2022.9900528
Tuncer, S., Licoppe, C., Luff, P., & Heath, C. (2023). Recipient design in human-robot interaction: The emergent assessment of a robot's competence. AI & SOCIETY, 39. https://doi.org/10.1007/s00146-022 -01608-7