Vision-sharing system for android avatars to enable remote eye contact

Shinkawa, Kaoruko; Nakajima, Mizuki; Nakata, Yoshihiro

doi:10.1186/s40648-024-00284-0

Research
Open access
Published: 05 October 2024

Vision-sharing system for android avatars to enable remote eye contact

Kaoruko Shinkawa¹,
Mizuki Nakajima² &
Yoshihiro Nakata¹

ROBOMECH Journal volume 11, Article number: 16 (2024) Cite this article

178 Accesses
1 Altmetric
Metrics details

Abstract

Maintaining eye contact is a fundamental aspect of non-verbal communication, yet current remote communication tools, such as video calls, struggle to replicate the natural eye contact experienced in face-to-face interactions. This study presents a cutting-edge vision-sharing system for android avatars that enables remote eye contact by synchronizing the eye movements of the operator and the avatar. Our innovative system features an eyeball integrated with a wide-angle lens camera, meticulously designed to mimic human eyes in both appearance and functionality. This technology is seamlessly integrated into android avatars, enabling operators to perceive the avatar’s surroundings as if physically present. It provides a stable and immersive visual experience through a head-mounted display synchronized with the avatar’s gaze direction. Through rigorous experimental evaluations, we demonstrated the system’s ability to faithfully replicate the operator’s perspective through the avatar, making eye contact more intuitive and effortless compared with conventional methods. Subjective assessments further validated the system’s capacity to reduce operational workload and enhance user experience. These compelling findings underscore the potential of our vision-sharing system to elevate the realism and efficacy of remote communication using android avatars.

Introduction

The increasing demand for remote communication tools has raised expectations for technologies that enable more natural face-to-face interactions. Video calls have become the predominant method of communication with individuals who are geographically distant. However, the challenge lies in creating a sense of physical presence during these calls, often leading to feelings of fatigue. One significant factor contributing to this is the inability to establish natural eye contact through video calls [1].

Eye contact is a significant aspect of nonverbal communication in human interactions. It occurs approximately 61% of the time during casual conversations between two people, with about half of these instances being mutual [2]. Humans rely on eye contact for various purposes, such as seeking information, signaling openness, concealment and exhibitionism, and establishing and recognizing social relationships, which are essential for effective communication [3]. Eye contact also possesses both approach and avoidance forces. When we sense that a certain social factor, such as physical proximity, is not suitable for the relationship we have with someone, we instinctively adjust our eye contact to maintain the appropriate social distance [3]. However, current remote communication tools do not allow for the natural exchange of eye contact that occurs during in-person interactions.

Consequently, researchers have turned to robotic avatars as a potential solution. Tanaka et al. [4] have reported that physical embodiment of interaction partners influences the perceived presence of the partner. This was demonstrated through a comparison of dialogues via robot avatars and various existing communication media. The utilization of robot avatars is anticipated to facilitate remote communication, creating a sense of being physically present with the other individual. Furthermore, Sacino et al. [5] suggested that a high level of anthropomorphism in appearance is effective in improving the presence of a robot. Considerable research has been conducted on androids, which are robots designed to have human-like appearances [6,7,8,9,10]. These robots exhibit human-like behavior and, in recent years, have even been equipped with the ability to perform eye movements.

Based on these advancements, we hypothesized that utilizing androids as avatars can enhance remote communication by enabling eye contact across long distances. To achieve this, avatars and teleoperation systems should be developed to replicate human-like appearance and eye movements while providing operators with a face-to-face view of the remote environment. Advancements in teleoperation technology have focused on enhancing the operator’s sense of presence by incorporating haptic feedback and highly immersive visual experiences in VR headsets [11]. Furthermore, androids that mimic human behavior more closely are being developed, such that the person interacting with the avatar can feel the presence of the other person. Some of these robots can move their faces with multiple degrees of freedom, enabling them to not only replicate the movements of the operator’s head and arms but also convey various facial expressions.

However, in remote communication, instances in which both the operator and individual interacting with the robot avatar effectively communicate are limited. This is because the appearance of the avatar often deviates from that of a human when equipped with cameras and sensors to relay information about the avatar’s environment to the operator, which enhances functionality. Moreover, research and development on androids that aim to mimic humans often focus on achieving autonomous or pre-programmed behavior, rather than prioritizing operability when utilized as avatars.

Therefore, we propose a teleoperation system that focuses on visual interaction, enabling eye contact through an android avatar while providing a natural user experience, both in terms of operation and natural appearance. The main contributions of this study are as follows.

Proposition of a vision-sharing system concept that enables eye contact through an android avatar.
Development of an eyeball integrated with a wide-angle lens camera for android avatars, designed to closely resemble human eyeballs in appearance.
Implementation of the proposed system and the developed eyeball on an android avatar.
Evaluation of the system’s accuracy through experiments in which participants operate the android avatar and gaze at specified points.
Assessment of the impact of the proposed system on the ease of establishing eye contact through experiments involving pairs of individuals engaging in mutual eye contact through an android avatar

We have presented our vision-sharing system concept and the development of the eyeball integrated with a wide-angle lens camera for android avatars [12]. Additionally, a preliminary report on the implementation of this system on an android avatar has been documented [13]. However, evaluating the proposed system’s ability to facilitate remote eye contact necessitates two specific investigations.

To determine whether the operator can perceive their surroundings from the avatar’s perspective.
To assess whether the individual interacting with the avatar experience a sense of eye contact when the operator gazes at them through the avatar.

This study presents the findings of two experiments conducted to evaluate the accuracy and effect of the system on eye contact between two individuals through an avatar. These experiments delve into the aforementioned aspects and provide an in-depth discussion on the efficacy of the proposed system.

Concept of the vision-sharing system

We propose a vision-sharing system that enables synchronization of eye movements and sharing of the same visual perspective between operators and avatars. An overview is shown in Fig. 1. Notably, this system requires avatars to be equipped with cameras in both their left and right eyeballs, enabling them to mimic human-like eye movements in all directions. To achieve immersive avatar operation, the system was designed for use with a head-mounted display (HMD). During operation, the avatar’s eye movements were synchronized with those of the operator based on the operator’s gaze as detected by the HMD. Simultaneously, images captured by the avatar’s eye cameras were presented to the operator. Images from the left-eye camera of the avatar were displayed on the left side of the HMD, whereas those from the right-eye camera were displayed on the right. Each image was projected onto a hemispherical virtual screen centered on the operator’s viewpoint and displayed on the HMD. This provided a stereoscopic view of the avatar’s surroundings.

Our previous study [14] partially validated the effects of eye movement synchronization. However, by synchronizing eye movements, when the operator shifted the focal point of the image displayed on the HMD to a desired position, the camera’s capture area also shifted, resulting in a discrepancy where the desired visual content was not aligned with the focal point (see Fig. 2). This synchronization prevents eye contact with the individual interacting with the avatar. To address this, a system was developed to rotate the hemispherical screens based on the gaze direction of the avatar (see Fig. 3). Screen rotation was based on the rotation angle of the avatar’s eyeballs, with the screen rotating in the opposite direction to match the estimated rotation angle of the avatar’s gaze at the time of projection. This ensures a stable, seamless view for the operator, aligning the avatar’s gaze with that of the operator.

The proposed system was implemented and evaluated. To assess its effectiveness in enabling remote eye contact through an avatar, two key aspects needed to be evaluated:

1.
Accuracy of the implemented system: Determining whether the operator can perceive the avatar’s surroundings as if they were physically present at the avatar’s location, as intended by the system’s design.
2.
Impact on the subjective difficulty of establishing eye contact: Assessing whether the system facilitates easier eye contact through the avatar.

We conducted evaluation experiments, referred to as Experiments I and II, to validate the aforementioned aspects. In each evaluation experiment, we compared the following two conditions:

Fixed screen condition (“Fixed condition”): A condition in which the virtual hemispherical screens are fixed (conventional system).
Synchronized screen condition (“Sync condition”): A condition in which the virtual hemispherical screens are rotated based on the avatar’s gaze direction (proposed system).

Implementation of the vision-sharing system on an android avatar

We developed both hardware and software components necessary for implementing the proposed vision-sharing system on an android avatar.

Hardware

The implementation of the proposed system necessitates the integration of a camera within each eyeball of the avatar to provide the android avatar’s operator with a visual field equivalent to that of a human. Moreover, the eyeballs must perform human-like movements and have a human-like appearance. To satisfy these requirements, we developed an eyeball integrated with a wide-angle lens camera specifically designed for android avatars.

The structure of the developed eyeball is shown in Fig. 4, with the specifications for each component shown in Table 1. A camera module featuring a 200°wide-angle lens was employed to capture images from the android’s perspective. The design of the eyeball ensured that the wide-angle lens resembled the pupil and iris of a human eyeball. The wide-angle lens camera was connected to the control PC through a USB cable and an interface board, enabling the acquisition of captured images. It was enclosed in a resin component that mimics the appearance of the sclera of the human eye. The iris was designed with a slightly blurred edge to replicate the natural appearance of human eyes, achieving a highly human-like appearance.

Table 1 Component specifications

Full size table

Software

Our vision-sharing system was successfully implemented on the humanoid cybernetic avatar Yui, serving as an android avatar [6] shown in Fig. 5. The mechanism surrounding the eyeballs is shown in Fig. 6, with a photograph of the system shown in Fig. 7. The three motors enabled yaw axis rotation for each eyeball and pitch axis rotation for both eyeballs. This enables each eyeball to perform independent horizontal and linked vertical movements. The range of motion of the eyeball, with the lens facing forward, defined as \(0^\circ\), is \(\pm {35}^\circ\) horizontally and \(14^\circ\) to \(8^\circ\) vertically. The eyeballs embedded within the android avatar are shown in Fig. 8. The developed eyeball closely resembled the human eye.

In this system, an HMD (MetaQuest Pro, Meta) and Unity were utilized to enable eye tracking, allowing for the simultaneous tracking of the operator’s gaze and presentation of images from the avatar’s perspective. ROS2 was employed to construct the system and facilitate data exchange between the avatar and operator. The configuration of the communication system is shown in Fig. 9. To ensure stable communication, all evaluation experiments were conducted with the PC for the operating interface and the PC for the avatar connected through a wired LAN.

The yaw and pitch eye angles of each operator were obtained from the HMD eye tracker, and the values corresponding to the respective motors were input as target values. For the pitch direction, the average pitch angle of each eye was used as the target value for both eyeballs.

During operation, the left and right displays of the HMD provided a virtual space. A camera was installed at the origin of each virtual space and the HMD displayed the image captured by the camera. The position and direction of the camera in each Unity space remained constant and the image displayed on the HMD was independent of the position and direction of the HMD. The hemispherical screens were strategically positioned such that the center of each hemisphere aligned with the origin in the virtual space, specifically with the camera position serving as the viewpoint. The eye camera captured 200° range images, with the 180° range image being projected onto a hemispherical screen. This projection enables the operator to observe the avatar’s environment on a scale equivalent to that of the real environment. Figure 10 shows the actual display of the image.

Throughout the operation, camera images and the actual rotation angles of each avatar’s eyeballs were recorded along with timestamps. By utilizing linear interpolation based on these timestamps, the system determined the direction of the avatar’s gaze at the time of image capture, subsequently adjusting the screens projecting the images accordingly. Each image was displayed on a screen that rotated to align the normal vector of the cross-section through the center of the screen hemisphere with the avatar’s gaze vector (see Fig. 3). This allowed the avatar’s eyeballs to rotate in sync with the operator’s movements, creating a lifelike interaction as if the operator were physically present at the avatar’s location. The avatar’s appearance captured by one of the authors while looking at a camera positioned slightly to the right of the avatar’s front is shown in Fig. 11. This was done in both the Fixed and Sync conditions. In contrast to the conventional system with fixed screens Fig. 11a, the implementation of the proposed system in Fig. 11b results in more pronounced eye rotations and a direct gaze toward the camera.

For stable control, the target update and data acquisition frequencies of the motor were set to 10 Hz during the evaluation experiments. To ensure stable communication, the acquisition frequency of the camera images was set to 30 Hz, with images being captured at a resolution of 800 x 600. A median filter was utilized to process input values related to the avatar’s eyes, reducing unintended vision oscillations caused by communication delays, accuracy of avatar control, and eye tracking through the HMD. Furthermore, to reduce the possibility of causing VR sickness owing to frequent view changes, the weighted-average filter shown in Algorithm 1 was applied to the input values of the avatar eye and screen rotation angles to prevent view changes when the difference between the avatar eye and screen rotation angles was minimal. The avatar eye camera experienced occasional noise issues owing to interface board problems. However, thorough evaluations validated that these noises did not significantly impact the results.

Experiment I: accuracy evaluation

An initial experiment was conducted to evaluate the accuracy of the system, ensuring that operators could effectively view their surroundings from the avatar’s perspective within the developed system. The experimental hypotheses are as follows:

The operator’s gaze direction when focusing on a target point through the avatar closely aligns with the gaze direction from the avatar’s position toward that point in the Sync condition.
The operator experiences a lower workload in the Sync condition owing to improved view stability.

Method

In this experiment, participants were instructed to gaze at a specific point using an avatar, and the researchers examined how closely their gaze direction matched the avatar’s viewpoint when looking at the target point. The participants, equipment, materials utilized, and the experimental procedure are examined below.

Participants

Eleven students from the Nakata Laboratory at the University of Electro-Communications participated in the study, each receiving 1000 JPY in cash vouchers. Data from 10 participants (nine males and one n/a, aged 21–24) were included in the analysis, as one participant was unable to complete the task properly owing to improper use of the HMD.

Materials

The object shown in Fig. 12 was used to display the target position for gazing at the participants. This object, a part of a hollow sphere with an inner radius of 0.5 m, featured holes of diameter 12 mm at \(10^\circ\) intervals from \(\theta = 0^\circ\) to \(30^\circ\) and at \(30^\circ\) intervals from \(\phi =0^\circ\) to \(330^\circ\), as shown in the figure, from the center of the surface. This object was placed in front of the avatar, such that the center of the sphere aligned with the center of the avatar’s two eyeballs (see Fig. 13). LEDs were placed in each hole and illuminated to indicate the target point to the participant.

The workload was evaluated using the Japanese version [15] of the NASA-TLX [16], a subjective method for evaluating workload. The participants rated each of the six NASA-TLX endpoints on a scale of 0–100. To omit the weighting process, Raw TLX (RTLX), which is the simple average of the six NASA-TLX scores, was utilized to estimate the overall workload [17].

Procedure

The design, task details, and experimental procedures used in this study are outlined below.

Design

The experiment was conducted using a between-subject study design under Fixed and Sync conditions. Six participants performed the task in the Fixed condition, whereas the remaining five participants completed the task in the Sync condition. Data from five participants in the Fixed condition (five males, aged 22–23) and five participants in the Sync condition (four males, one n/a, aged 21–24) were analyzed, with the exception of one participant in the Fixed condition who had a problem.

Task

We designed a task for the participants to gaze at an indicated position on an object placed in front of the android avatar. The participants wore HMD and manipulated only their avatar’s eyes by moving their own eyeballs. Each participant held a button in each hand to manipulate the avatar’s gaze.

During the experiment, an LED illuminated one of the holes on the object in front of the avatar (Fig. 13). Participants focused on the illuminated point for approximately 5 s and then pressed the button in their right hand. Subsequently, they pressed the button in their left hand to move the illuminated point, repeating the process. The LEDs were activated in counterclockwise order from \(\phi =0\), starting from the center (\(\theta =0^\circ\)) and from the circle of \(\theta =10^\circ\) to \(\theta =30^\circ\). The task was terminated when the aforementioned trials were completed for all target points. Participants were instructed to re-press the button if they blinked or if any noise disrupted the camera image while pressing the button and looking at the target point before pressing the button in their left hand.

The direction of the participant’s gaze was recorded as a measure of the accuracy of the system when the participant pressed the button while gazing at each target point.

Experimental Steps

Participants were informed that the study aimed to explore the functionality of a teleoperated robot and gather their feedback on it

First, the participants were instructed to perform eye calibration while wearing the HMD. A tool officially provided by Meta, the distributor of the HMD, was used for eye calibration.

Before starting the task, an operational test was conducted on android avatars. With the avatar’s eye movements synchronized, the participants were instructed to move their eyes in different directions to ensure that the avatar’s movements and camera image were functioning correctly.

Once the operation test was completed, the experimental task outlined in section 4.1.3 was conducted.

All images displayed on the HMD when the participant pressed the buttons were recorded to ensure that noise in the camera images did not significantly influence the experimental results.

Ethical considerations

The study protocol was approved by the ethics committee of the University of Electro-Communications (No. H23034), and all participants provided written informed consent before the experiments.

Results

All of the camera images recorded during the task were noise-free, indicating proper data acquisition.

Using the gaze direction data, the focal point on the object when viewed from the avatar’s perspective (referred to as “actual focal point” hereafter) was calculated. Furthermore, the target gaze direction was determined by the position of each target point in relation to the avatar’s position. The deviation angle was then calculated as the angle between the target and actual gaze direction vectors.

The actual focal points of the participants under each condition are shown in Fig. 14, showcasing the deviation angle from the target gaze direction as the median for the five data points at each target point. Data for all participants are shown in Fig. 26 in Appendix A.

Additionally, the deviation angles between the target and gaze directions at each target point are shown in Fig. 14 as box plots. The results of the Mann–Whitney U test (\(\alpha =0.05\), one-sided) revealed that the deviation angle was significantly smaller for the Sync condition than for the Fixed condition (\(p < 0.05\)) for all target points except \((\theta , \phi ) = (0^\circ , 0^\circ ), (10^\circ , 0^\circ ), (30^\circ , 90^\circ ), (30^\circ , 210^\circ )\). Moreover, for \((\theta , \phi ) = (30^\circ , 210^\circ )\), the Sync condition exhibited a significant trend for the deviation angle to be smaller than that of the Fixed condition (\(p<0.1\)). The medians and test results are shown in Table 3 in Appendix A.

The workload results for each condition are shown in Fig. 15. The result of the Mann–Whitney U test (\(\alpha =0.05\), one-sided) indicated no significant difference in the RTLX scores between the Fixed condition (\(Mdn=43.3\)) and Sync condition (\(Mdn=40.5\)) (\(U=10\), \(p=0.345\)).

Experiment II: subjective evaluation

Second, an experiment for a subjective evaluation was conducted to investigate whether the developed system effectively enhanced eye contact through an android avatar. The hypotheses of this experiment are as follows:

When the operator gazes at the observer through the avatar, the observer perceives a stronger mutual eye contact with the avatar in the Sync condition.
The disparity between the Fixed and Sync conditions intensifies as the observer’s position shifts further away from directly facing the avatar.
The operator’s workload is lower in the Sync condition.