Skip to main content

Multimodal reaching-position prediction for ADL support using neural networks

Abstract

This study aimed to develop daily living support robots for patients with hemiplegia and the elderly. To support the daily living activities using robots in ordinary households without imposing physical and mental burdens on users, the system must detect the actions of the user and move appropriately according to their motions. We propose a reaching-position prediction scheme that targets the motion of lifting the upper arm, which is burdensome for patients with hemiplegia and the elderly in daily living activities. For this motion, it is difficult to obtain effective features to create a prediction model in environments where large-scale sensor system installation is not feasible and the motion time is short. We performed motion-collection experiments, revealed the features of the target motion and built a prediction model using the multimodal motion features and deep learning. The proposed model achieved an accuracy of 93% macro average and F1-score of 0.69 for a 9-class classification prediction at 35% of the motion completion.

Introduction

With an aging society, the demand for intelligent robots to support activities of daily life (ADL) for elderly and disabled people living alone is increasing. These robots are required to work close to human users in environments that are relatively narrow and difficult to sense, in contrast with industrial robots, which work in well-controlled environments of factories or warehouses. The influence of the support provided by the autonomous robot on the mental health of the user must also be considered. For example, is not always appropriate for the assisting robot to fully support a user who intends to reach a distant object by picking up the object and bringing it to the user.

In this case, despite the intention of users to move on their own, the support ends up undermining the intention. Therefore, it is essential for the ADL supporting system not to inhibit the active motivation and self-efficacy of users [2, 18]. In [18], the authors reported a correlation between SE evaluation scores and movement speed and accuracy for reaching motions of patients with residuals from stroke.

This study aims to develop an autonomous ADL support robot, considering the intentions of the user. We focus on cooperation task completion with users to maintain and improve their self-efficacy. In the previous example, the robot could support the arm of the user to reach out to pick up the object or move the target object into an easier-to-pick position. Such a support system, not only maintains the self-efficacy of the users but also improves it through the experience of accomplishing tasks that would be difficult to accomplish alone.

To achieve the goals of the study, in this paper, we propose a novel scheme to predict reaching position in reaching motion involving upper-arm lifting. Although lifting the upper arm is an essential part of ADL, such as picking up an object from a high place, putting it up, or drying laundry, it is difficult for elderly or disabled patients because of the need to maintain their arms at high positions, which may cause an imbalance in the torso. There are many possible actions that the assisting robot can perform to support the motions, such as directly supporting the arms and torso of the user and interfering with objects that are the target of movements of the user.

Therefore, establishing a prediction scheme for this motion along with an analysis of the features of the motion will be useful for both hardware and software development of support robots.

The main contributions are summarized as follows:

  • We collected the motion data of lifting upper arm, which imitates object grabbing using multiple sensors, and analyzed the motion features.

  • We created a multimodal-neural-network model to predict the reaching position as a classification problems that can be adapted to real-time robot control.

The remainder of this paper is structured as follows: In the next section, we introduce related works. In “Research questions and our approaches”, we defined our research problems and approaches. In “Analysis of reaching motions”, we describe the motion data collection and analysis. Then, we proposed our multimodal reaching-position prediction network in “Prediction model”. The results and discussion are presented in “Evaluation and result” and “Limitations and discussions”. Finally, “Conclusions” concludes the paper.

Related work

There are various approaches to developing cooperative robots. Research themes in this area are shifting from proposing robot motion generation methods to developing human-motion estimation frameworks. For tasks where robots and humans work in proximity, as in this study, there are systems aimed at sharing the workspace to avoid interference with each other, and systems aimed at cooperating when performing a single task, like in handover or load-sharing tasks.

Assembly tasks at a factory are typical scenes of cooperative tasks between robots and humans (e.g. [1, 12]). In the case of a robot and humans sharing a workspace and working individually, the robot must predict their motion trajectory to avoid collisions with humans. In [1], the authors addressed the problem of estimating the reaching motion of human workers as a multi-class classification problem. They reported that the accuracy of the proposed method, which used 3D point-cloud data, was around 80% after 50% of the operation. In [12], the authors collected data on the reaching motions of workers using a motion capture system and used the data to predict the arm trajectory of human worker. The construction of advanced sensor environments is beneficial in factory and laboratory environments.

In research on handover tasks, which require the positions of robots and humans to be close, sensors such as voice and electromyography sensors are used to predict the trajectory of a workers’ arms [22, 23].

In the area of human-robot interaction, many studies have been constructed models to predict user activities and intentions using various features as well as the movement of the users to achieve a natural interaction between humans and robots that is similar to human and human interaction. For example, in [20, 26], the authors proposed emotion estimation models using the facial expressions and verbal features of the user. In [22], the authors proposed a deep-neural-network model to estimate the order of the users for the robot by using both verbal and nonverbal features. In [25], the user intention to service robots was estimated using facial direction. Although using human natural motions, such as facial direction, seems effective in informing the intention or purpose of the motion to systems, the system we aim for, as described above, targets supporting daily life activities according to the actions of the user. Therefore, it is not appropriate to build complex sensor systems for trajectory tracking in the home or give voice instructions to robots like “I want to get the book on the upper right shelf.” In this study, we address these problems by using simple sensor systems. In addition, we deal with the reaching-position prediction problem for upper-arm lifting motion as a multi-class classification task and create a novel model that uses multi-nonverbal features. The proposed method could be applied in the future to load-sharing tasks [5, 17]. Currently, studies focus on control methods and algorithms after humans and robots hold the load; however, this study proposes one approach to the important problem of how a robot can hold objects together according to human intentions.

In the following section, we present our research questions and approaches.

Research questions and our approaches

Research questions

This study addressed two research questions: First, we investigated the practical features of upper-arm lifting motion to construct a reaching-position prediction model; second, we built a neural-network model using the features. Considering our goal, we assumed the following use environment and scenario: The system would be used in the everyday household environment, the users would be patients with hemiplegic and older adults with weakened muscles, and the support system should operate autonomously, and avoiding compromising the self-efficacy of the user by not providing full support.

For a specific task, assume that the user takes an object from a shelf with a healthy arm. The system recognizes the reaching position of the motion and interacts with the user’s arm and the object to be grabbed. This means that the system supports the task by, for example, keeping the arm or torso or moving the object to a position that is easier to grasp.

Based on these assumptions, the proposed method has the following requirements.

  • It deals only with available information without installing or attaching large sensor systems to the user or environment.

  • The proposed method assumes that the support system works autonomously without active manipulation by the user for operation.

  • It provides an environment in which the user does not have to wait for support or adjust their operating speed.

Approach

Under the conditions described above, our approach to investigating the research question is as follows. First, we collect target motion data of multiple subjects in an assumed environment. Next, the features of the motions are selected from the collected data, which are considered adequate for constructing a prediction model. Finally, as in [1], we constructed a prediction model of the reaching position as a multi-class classification problem using deep learning and evaluated its performance.

The next section describes the data collection method, its features, and the features that can be used to predict the arrival position.

Analysis of reaching motions

Motion collection

Figure 1 shows the environment settings for the data collection. The motion data collection procedure is as follows. As illustrated in the figure, the participant sat on a chair in front of a shelf divided into nine regions. The participant performed the motion of grabbing things from the area randomly indicated by the experimenter. Every indication was presented visually after a 3-second countdown in one region to be determined instantly on a display set in front of the participant. One set of trials consisted of four randomly ordered motions to each region repeated four times, and all participants sequentially performed seven sets of trials. The participants were instructed to place their right hands on their knees and face the display in front of them during the countdown. The sensors used were an RGBD camera (Microsoft, Azure Kinect) installed in front of the participant and an inertial measurement unit (IMU) sensor (MicroStrain, 3DM-Gx5-45) attached to the right arm of participants. Color (resolution: 1280 \(\times\) 720, field of view: 90\(^{\circ }\) \(\times\) 59\(^{\circ }\)) and depth images (resolution: 640 \(\times\) 576, field of view: 75\(^{\circ }\) \(\times\) 65\(^{\circ }\)) were acquired from the RGBD sensor at 15 frames per seconds (FPS), and magnetometer, angular velocity, and acceleration data were obtained from the IMU sensor at 100 FPS.

Fig. 1
figure 1

Overview of data collection environment

Six able-bodied male participants (aged 22–25, all right-handed) were recruited from our laboratory. Excluding data recording failures, the effective number of data samples was 1538. Figure 3 shows an example of the collected data: the sequence of reaching motion to the center-left region. The numbers indicate the elapsed frames from the start of the motion.

Motion analysis

Table 1. shows the descriptive statistics for the reaching times to each region derived from the collected data. Here, the reaching time is measured based on video data, from the moment the participant starts the movement after the target region is indicated by the display to when the extended right arm becomes stationary. Therefore, the time it takes for the visual reaction is not included. One-way ANOVA and the post hoc Tukey HSD test (\(p <0.05\) was considered significant) were conducted and suggested that there were significant differences between the regions (F(8, 1529) = 19.87, \(p <0.001\)). All the post hoc test results are shown in Fig. 5, which will be discussed later. From the results, reaching the uppermost regions, which are the farthest from the right hand’s initial position, top-left (TL), top-center (TC), and top-right (TR), required approximately 1.56, 1.56, and 1.58 s, respectively. And, there was no significant difference observed between them. Similarly, no significant difference was observed among the middle regions, center-left (CL), center (C), and center-right (CR). These results suggest that, within this experimental setup, participants unconsciously adjust their movement speed to reach regions at the same height. This adjustment equals modulating the waiting time until the next movement’s target position is presented. Therefore, it might be influenced by experimental conditions. On the other hand, for the bottom regions, reaching the bottom-center (BC), which is located directly in front of the body, was the fastest, with an average time of about 1.33 s. According to the post hoc test results, there was no significant difference between BC and bottom-right (BR). However, significant differences (\(p <0.05\)) were observed between BC and the bottom-left (BL), where required to extend the right arm to the front-left. This suggests that such a movement appears to be particularly difficult, even for healthy individuals. Additionally, it was observed that the maximum reaching time to BR was relatively larger compared to the other bottom regions. Upon reviewing the video data of this motion, it was noted that, after quickly getting closer to the target location, participants continued a slow approach movement without coming to a complete stop. This data has not been excluded, as it is considered not to affect future analyses or system development significantly. Thus, in simple reaching movements, while there is an observed tendency to unconsciously adjust speed, it became clear that due to presence of locations significantly more difficult to reach, reaching speed and movements are not solely determined by simple distance between the arm and the destination.

Table 1 Descriptive statistics of the reaching motions time(s) by target region

This study uses the average value of 1.47 s from all data as a guideline for developing a system to support this task. The proposed system must perform user motion recognition, predict the reaching position, and provide support actions all within this time frame.

Additionally, Fig. 2 shows the differential images created using the SAD (Sum of Absolute Differences) method from color images over 10 frames after the start of the motion. Figure 2A and B are from specific motions extracted from the collected dataset, while Fig. 2C is generated from all data. Bright areas in the images indicate regions of significant movement within the frames. The collected video data and the differential image also revealed the following features.

  1. (1)

    The target motion consisted of movements of specific body parts, that is, upper body, right arm, and face, rather than the entire body; in particular, from the differential image, the upper body movements appear not to be significant in the initial phase of the motion, making it seem challenging to use this information to predict the reaching position. On the other hand, the image shows significant movement near the face and around the right arm. Therefore, it was considered effective to use the right arm motion with visual features and to capture changes from the early stage of the motion together with face direction changes.

  2. (2)

    The preparatory motion was not useful for prediction; in other words, the time used to predict from the start of the motions directly affected the time available for the support action, and it was also necessary to recognize the timing of the start of the motion. However, it was challenging to obtain the exact timing of the start of the motion due to there was no prior motions. This point is discussed in “Limitations and discussions” as a future issue.

Fig. 2
figure 2

Visualization of variations within 10 frames after the start of the motion using the sum of absolute difference (SAD). A, B Examples from the collected motions. C Result based on the entire collected motions

Modals for prediction

Based on the observation results stated above, we selected the face, visual, and motion features to construct a reaching-position prediction model. The observation results and [25] suggested that the face direction or features would be an essential cue to estimate the following motions. Additionally, to use this system universally, it is more appropriate to estimate the reaching position using depth information as visual cues rather than color image data, which contains redundant external information related to the user and the environment. Furthermore, motion features acquired using the IMU sensor attached to the healthy side wrist of the user were employed, because, considering future robots supporting the user’s arms or grasping objects, it becomes imperative to understand the three-dimensional movements and postures of the arm. Although there are many technologies to estimate human postures by using only color images [3], considering the typical house environment in Japan, it is difficult to obtain a camera angle of view sufficient to estimate the arm’s posture in reaching an unspecified direction to a shelf placed in front of the user. Therefore, estimated posture data were not used. Eye-tracking devices were also not used to avoid complicating the system.

Motion data extraction

In this study, the start and end recognition of the motion was not performed. Therefore, it was necessary to extract data of each motion from the collected data based on some criteria. We manually annotated the motion start and end timings based on the following definition: The motion-start timing was defined as the frame when the right hand, initially positioned at the knee, began to move. The motion-end timing was determined as a frame when the extended arm started to retract at the reaching position. The collected data were divided into individual motions according to the annotated timings.

In the next section, we organize the features discussed in this section into specific feature data and discuss the construction of the reaching-position prediction model.

Fig. 3
figure 3

Example of the collected motion data; it is a sequence of color images subject to reach the center-left region. The number indicates elapsed frames from the start of the motion

Prediction model

Features

This section describes the features used to build our prediction model.

Face features

Face feature was set with the expectation of capturing the direction of the face, its variations, and the characteristics of gaze transition. In studies to create prediction models of human behavior, movements of the head and gaze are often used as features [9, 24]. We attempted to capture such features without attaching sensors to the users. In this study, the face mesh data was employed as face features. We used Google Mediapipe [11] to obtain 468 3D face landmark positions from the color images. The time elapsed from the start time of the motion was added to each frame, resulting in data of 1405 (468 \(\times\) 3 + 1) dimensions.

Depth features

Depth features were set with the expectation of extracting the three-dimensional characteristics of movements while reducing dependency on clothing and the experimental environment. These features are valuable for understanding the user’s position and posture in future support scenarios. The depth image data was acquired at 15 FPS and cropped to the user center. The resolution was reduced to 256 \(\times\) 188. The elapsed time was also added to each frame, as described below.

Motion features

Motion features were set to capture the characteristics of rapid three-dimensional movements of the arm. As the motion features, the data from the IMU sensor attached to the right wrist provided ten dimensions of information (geomagnetism, acceleration, and angular acceleration). The elapsed time was also added to the data to obtain 11 dimensions.

Network structure

We constructed a multimodal 9-class classification neural-network model to predict reaching positions as seen in “Analysis of reaching motions”. The long short-term memory (LSTM) [8], and the local attention mechanism [21] were used to construct our machine-learning model. The network structure is shown in Fig. 4. As shown in the figure, the output from all modal layers are combined through late fusion [6, 19]. This is the method used in modeling multimodal information. The composition of each unimodal network is as follows.

Face layers

A bi-directional LSTM layer with 1405-dimensional input and 2048-dimensional output was employed to train the face modal. The final output data passed through a self-attention layer and was output as 2048-dimensional data. The number of parameters in the network was 44,554,241.

Fig. 4
figure 4

Multimodal late fusion model

Motion layers

The same structure as the face model was used to train the motion model. The number of input dimensions was 11, and the number of output dimensions was 512. It had 2,139,137 network parameters.

Depth layers

Depth features were learned by combining latent representation of depth images using a convolutional neural network (CNN) and time-series learning using LSTM [13]. The CNN parameters used are shown in Table 2. The CNN + LSTM network had a total of 204,509,720 parameters, and the latent representation of each frame of the CNN data was combined with the elapsed time, described earlier, and input to the LSTM layer.

Table 2 CNN model parameters for depth network

Classification layers

The three output vectors from the unimodal layer were simply combined into a 1 \(\times\) 4097-dimensional vector. They were then input to a fully connected layer with dropout at each layer. The dimension of the last output layer was nine, the number of classes. The dropout rates were set to 0.6, 0.4, and 0.2, respectively, and the ReLU function was used as the activation function.

Input frames

As shown in Table 1, the target motion time for this task was a minimum of aproximately 1.33 s to complete the motion. Even if we disregard the movement time of the support robot, we still need a prediction of a shorter time to assist the robot’s movement. The time used for the prediction was set to 0.5 s (7 frames for the face and depth models and 50 frames for motion model). This means the model used information from 32 to 36% of the motion time. This is a short prediction time compared to the previous study [1]. To improve the performance of the prediction model, data interpolation should be performed for missing data. However, considering real-time use, the raw data obtained from the sensors should be input to the predictor with as little processing as possible. Therefore, data shaping was kept to a minimum. For example, frames where face mesh could not be recognized were padded with zeros.

In the next section, we discuss the result of the training and the features of our prediction model.

Evaluation and result

Model performance

For training and evaluation the proposed late fusion model, firstly, the whole dataset is randomly divided into the training set and test set with 1341 and 197 motions, respectively. Then, we trained the model using a 10-fold cross-validation strategy. Table 3. shows model accuracy and Macro Precision, Recall, and F1-score, which are used to evaluate multi-class classification problems, of the fusion model and each unimodal model for comparison. These values are obtained by calculating the scores for each class in a one-vs-rest manner and then averaging these results across all classes. The unimodal models were trained only using LSTM (CNN-LSTM) and a classifier structure for each model in the Fig. 4. The results found that the proposed model performs well or better than other unimodal models. In particular, the fusion model has the highest F1-score of 0.69, indicating that it was the most balanced model. In a previous study addressing a similar 9-class reaching-position prediction problem, an accuracy of 80% was reported at approximately 50% of the motion completion time. In contrast, our method achieved higher accuracy (93 %) at an earlier stage of the motion (32%). Figure 5 shows a confusion matrix obtained from 197 test data. The rows represent the actual classes and columns represent the predicted classes by the fusion model. The percentages represent the precision of the classification results for each class. For example, data classified as TL are correctly classified with an 88.0% probability; however, it shows a 4.0% probability of ML, BL, and BC motions being incorrectly classified as TL. The results show that the precision increases from the right side, near the starting point of motion, towards the upper left. Also, as a trend across the classes, there is some confusion within the same column. In particular, the precision for the middle row was relatively low, often confused with motions to the same column. This is an interesting result, suggesting that even from the initial few frames, movements towards the top and bottom regions are distinctively characteristic. Distinguishing whether the arm stops in the middle row or continues moving up or down is difficult for the current model. However, the results from the post-hoc test, conducted in “Analysis of reaching motions”, also show potential for classification. The asterisks in the figure indicate pairs where a \(p<0.05\) significant difference in motion speed was observed in the post-hoc Tukey HSD test. From the result, significant differences were observed in the final reaching motion time between CR and TR, and CR and BR, indicating that there are differences in arm motions between them. Even if there are differences in the current input frames regarding movement trajectories, this model, which predicts reaching positions by combining arm, face, and depth features, has cases where classification may fail due to factors other than motion features. To reveal the factors causing differences in reaching times, arm trajectories and features must be analyzed, and neural networks capable of extracting these data must be constructed. These tasks remain for future research. Additionally, it is impossible to avoid the possibility of misclassification completely. Based on these results, the operation of future support systems will be discussed in “Limitations and discussions”.

Table 3 Single vs Multi modal model performance comparison
Fig. 5
figure 5

Confusion matrix of the proposed fusion model. the percentages represent classification rate. The asterisks indicate pairs for which a significant difference in motion speed was found as a result of the post-hoc Tukey HSD test describe in “Motion analysis

Estimation speed

Finally, we measured the estimation speed of the proposed model — with 282,930,211 parameters. The computer used for inference was Ubuntu 20.04.6 LTS for OS, Intel(R) Xeon(R) W-2225 @ 4.10GH CPU, NVIDIA RTX A5000 and 24GB RAM. The input data utilized the motion data collected in “Analysis of reaching motions”. These data are stored using the functionalities of Robot Operating System [15], allowing for the simulation of receiving camera images and IMU sensor data while maintaining timestamp information. However, delays such as the camera’s image acquisition time or data transmission between the sensor and the computer are not considered. The measurement program measured the time from when it collected the number of input frames of sensor data through the conversion and trimming into a format suitable for the predictive model to obtain the prediction results. The prediction model is trained using PyTorchFootnote 1 and optimized with Nvidia TensorRTFootnote 2. The average prediction time for 100 data inputs was 0.0086 s, with a standard deviation of 0.0036 s. The maximum value was 0.022 s, and even if this worst-case scenario is adopted, the time required for estimation is approximately 1.5% of the motion time of our collected motion data, which is considered sufficiently small. This result indicates that the proposed method requires a prediction time of approximately 0.5 s (for collecting motion data) + 0.086 s (for estimating the target position) for a reaching motion that takes approximately 1.47 s on average. Therefore, the proposed system leaves appropriately 0.96 s of grace time for the support robot.

Limitations and discussions

In this section, we elaborate on insights encountered while conducting data collection, creating the prediction model, and analyzing the results. Due to the COVID-19 pandemic, it was impossible to recruit a sufficient number and variety of participants. Whether the motion of people who have a stroke or have advanced age is the same as that of the participants should be considered in further investigations. In [4, 16], the authors conducted a reaching-tasks experiment for mostly right-handed patients with hemiplegia. They report that there is no difference in the motor function of the unaffected arm between left- and right-affected patients. However, in comparisons between these patients and elderly individuals without paralysis, it is shown that patients with paralysis exhibited inferior motor function even in the unaffected arm. Furthermore, In [7], it is shown that the arm motor functions of healthy elderly individuals differ depending on the dominant arm. From this, while it may seem difficult to directly apply the proposed model or the collected data from healthy individuals in this study to support a system intended for elderly or hemiplegic patients, the requirements for assistive systems identified through this study, along with the series of methods for model creation, can be considered useful. Using the proposed model, it would be possible to realize a system where support robots autonomously operate triggered by the user’s active movements, supporting the completion of user tasks. In [14], the authors reported that there is a correlation between physical activity and self-efficacy, or life satisfaction in the elderly. On the other hand, there have also been reported that elderly individuals have psychological barriers to engaging in physical activities in the first place [10]. Support systems in ADL can encourage active movements from familiar activities, as seen in this task, reduce psychological barriers to physical activity, and promote more extensive social activities. To achieve this, the support system aims to appropriately assist users’ activities in daily life, enhancing their self-efficacy and motivation for active engagement. For this purpose, one of the future challenges is to enable support at multiple levels based on the user’s physical condition, from simple arm support to higher-level assistance like directly retrieving objects and handing them to the user’s extended hand. As mentioned in “Evaluation and result”, it is impossible to eliminate the possibility of misclassification when the classification model is used in the wild. While it is important to improve model performance, it is equally crucial to build and operate a system that is robust against misclassification. The proposed model is expected to improve in accuracy with increased input frames (i.e., as the motion progresses). Additionally, the results from Fig. 5 suggest that the proposed model achieves very high accuracy in the 3-class classification of rows (Left, Center, Right), with respective accuracies of 97.15%, 96.60%, and 93.47%. Hence, in the actual operation of support robots, for instance, at the beginning of a movement, the robot might perform lateral shifts, and as time progresses, it could execute more detailed movements based on predictive results. For interactions with people or objects, it would be practical to utilize proximity sensors equipped on the robot for precise positional adjustments. However, it is difficult to say that the current model ensures sufficient time for actual support operations. As seen in “Estimation speed”, the time available for assistance is only about 0.96 s, which is clearly insufficient for a stationary robot to approach a user or target object and provide support. Therefore, to achieve our goal, we must solve problems from many directions, such as establishing a fast support method, developing a soft robot that considers collision with humans or surrounding objects, and integrating these technologies, including this study. These are challenges for future study.

Conclusions

We proposed a novel scheme for constructing a reaching-position prediction model for the reaching motion involving upper-arm lifting, which is part of activities of daily living (ADL), to develop a support robot. Based on the results of the motion collection experiment and its analysis, we developed a target position prediction model using time-series data of face, motion, and depth features. The proposed model, which demands that the support system autonomously operates triggered by the user’s movements using data from simple sensors, achieved an accuracy of 93% at 35% of the motion completion time. This model, utilizing only 0.5 s of data, was able to make predictions in approximately 0.086 s of computation time. However, it is difficult to say that sufficient time has been secured for the operation of support robots, and the issue of misclassification needs to be resolved to adapt the classification model in the wild. In the future, along with improving prediction accuracy, we aim to develop robust support methods against misclassification and robots that support ADL in close contact with users, striving to realize the proposed system.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

  1. PyTorch: https://pytorch.org.

  2. Nvidia TensorRT: https://developer.nvidia.com/tensorrt.

References

  1. Arai S, Pettersson AL, Hashimoto K (2020) Fast prediction of a worker’s reaching motion without a skeleton model (F-PREMO). IEEE Access 8:90340–90350

    Article  Google Scholar 

  2. Bandura A (1978) Self-efficacy: toward a unifying theory of behavioral change. Adv Behav Res Ther 1(4):139–161

    Article  Google Scholar 

  3. Cao Z, Hidalgo G, Simon T et al (2021) OpenPose: realtime Multi-Person 2D pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 43(1):172–186

    Article  Google Scholar 

  4. Coderre AM, Zeid AA, Dukelow SP et al (2010) Assessment of upper-limb sensorimotor function of subacute stroke patients using visually guided reaching. Neurorehabil Neural Repair 24(6):528–541

    Article  Google Scholar 

  5. DelPreto J, Rus D (2019) Sharing the load: human-Robot team lifting using muscle activity. In: 2019 International Conference on Robotics and Automation (ICRA). IEEE, pp 7906–7912

  6. Gadzicki K, Khamsehashari R, Zetzsche C (2020) Early vs late fusion in multimodal convolutional neural networks. In: 2020 IEEE 23rd International Conference on Information Fusion (FUSION). IEEE, pp 1–6

  7. Heller A, Wade DT, Wood VA et al (1987) Arm function after stroke: measurement and recovery over the first three months. J Neurol Neurosurg Psychiatry 50(6):714–719

    Article  Google Scholar 

  8. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  9. Holman B, Anwar A, Singh A, et al (2021) Watch where you’re going! gaze and head orientation as predictors for social robot navigation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp 3553–3559

  10. Lee LL, Arthur A, Avis M (2008) Using self-efficacy theory to develop interventions that help older people overcome psychological barriers to physical activity: a discussion paper. Int J Nurs Stud 45(11):1690–1699

    Article  Google Scholar 

  11. Lugaresi C, Tang J, Nash H, et al (2019) MediaPipe: a framework for perceiving and processing reality. In: Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019

  12. Mainprice J, Hayne R, Berenson D (2015) Predicting human reaching motion in collaborative tasks using inverse optimal control and iterative re-planning. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), vol 2015-June. IEEE, pp 885–892

  13. Mou L, Zhou C, Zhao P et al (2021) Driver stress detection via multimodal fusion using attention-based CNN-LSTM. Expert Syst Appl 173:114693

    Article  Google Scholar 

  14. Phillips SM, Wójcicki TR, McAuley E (2013) Physical activity and quality of life in older adults: an 18-month panel analysis. Qual Life Res 22(7):1647–1654

    Article  Google Scholar 

  15. Quigley M, Conley K, Gerkey B, et al (2009) ROS: an open-source robot operating system. In: ICRA workshop on open source software

  16. Scott SH, Dukelow SP (2011) Potential of robots as next-generation technology for clinical assessment of neurological disorders and upper-limb therapy. J Rehabil Res Dev 48(4):335–354

    Article  Google Scholar 

  17. Sirintuna D, Giammarino A, Ajoudani A (2022) Human-Robot collaborative carrying of objects with unknown deformation characteristics. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp 10681–10687

  18. Stewart JC, Lewthwaite R, Rocktashel J et al (2019) Self-efficacy and reach performance in individuals with mild motor impairment due to stroke. Neurorehabil Neural Repair 33(4):319–328

    Article  Google Scholar 

  19. Sun L, Xu M, Lian Z, et al (2021) Multimodal emotion recognition and sentiment analysis via attention enhanced recurrent model. In: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge. Association for Computing Machinery, New York, NY, USA, MuSe ’21, pp 15–20

  20. Tan ZX, Goel A, Nguyen TS, et al (2019) A multimodal LSTM for predicting listener empathic responses over time. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, pp 1–4

  21. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S et al (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc

    Google Scholar 

  22. Wang H, Li X, Zhang X (2021) Multimodal human-robot interaction on service robot. IEEE Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) pp 2290–2295

  23. Wu M, Taetz B, Dickel Saraiva E, et al (2019) On-line motion prediction and adaptive control in Human-Robot handover tasks. In: 2019 IEEE International Conference on Advanced Robotics and its Social Impacts (ARSO), pp 1–6

  24. Yang B, Huang J, Chen X et al (2023) Natural grasp intention recognition based on gaze in Human-Robot interaction. IEEE J Biomed Health Inform 27(4):2059–2070

    Article  Google Scholar 

  25. Yuguchi A, Inoue T, Ricardez GAG, et al (2019) Real-Time gazed object identification with a variable point of view using a mobile service robot. In: 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, pp 1–6

  26. Zadeh A, Zellers R, Pincus E et al (2016) Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work is supported by JST [Moonshot R &D],[Grant Number JPMJMS2034].

Author information

Authors and Affiliations

Authors

Contributions

Y.T. worked on the research concept, participated in the design and development of the method, and drafted the thesis. K.Y. participated in the research design. All authors reviewed the results and approved the final version of the manuscript.

Corresponding author

Correspondence to Yutaka Takase.

Ethics declarations

Ethics approval and consent to participate

Ethical approval was not required as per institutional guidelines. All participants were informed about the purpose of the study, the anonymity and confidentiality of their results, and provided informed consent prior to participation.

Competing interests

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Takase, Y., Yamazaki, K. Multimodal reaching-position prediction for ADL support using neural networks. Robomech J 11, 14 (2024). https://doi.org/10.1186/s40648-024-00282-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40648-024-00282-2

Keywords