Open Access

Learning motion primitives and annotative texts from crowd-sourcing

ROBOMECH Journal20152:1

DOI: 10.1186/s40648-014-0022-7

Received: 18 August 2014

Accepted: 15 November 2014

Published: 20 January 2015


Humanoidrobots are expected to be integrated into daily life, where a large variety of human actions and language expressions are observed. They need to learn the referential relations between the actions and language, and to understand the actions in the form of language in order to communicate with human partners or to make inference using language. Intensive research on imitation learning of human motions has been performed for the robots that can recognize human activity and synthesize human-like motions, and this research is subsequently extended to integration of motions and language. This research aims at developing robots that understand human actions in the form of natural language. One difficulty comes from handling a large variety of words or sentences used in daily life because it is too time-consuming for researchers to annotate human actions in various expressions. Recent development of information and communication technology gives an efficient process of crowd-sourcing where many users are available to complete a lot of simple tasks. This paper proposes a novel concept of collecting a large training dataset of motions and their descriptive sentences, and of developing an intelligent framework learning relations between the motions and sentences. This framework enables humanoid robots to understand human actions in various forms of sentences. We tested it on recognition of human daily full-body motions, and demonstrated the validity of it.


Motion primitives Natural language Crowd-sourcing


Robots are able to understand their surroundings by relying on senses supplied by their body, which they can then move to act on the environment. For some time, research has been conducted on imitation learning [1,2], where the bodily motions of humans are projected onto the bodily motions of humanoid robots and recorded as dynamical system [3-6] and statistical model [7-10] parameters while compressing the information. By using these models, it has become possible for robots to recognize human bodily motions and to generate their own natural human-like motions. However, in the motion recognition phase, the motion is classified into its specific model, and in the motion generation phase, a command specifying the model is given to a robot. More specifically, indices of the motion models that are not understood by human partners intervene in the motion recognition and generation. The intermediate codes that can be intuitively understood by the human partners are required. A natural language can be its solution, and facilitate an intuitive interaction between humans and robots. Several approaches extend the motion models to the language expressions, where robots understand human motions as text and can then generate bodily motions from text input [11,12]. Several models for a robot manipulating object via linguistic instructions have been developed using a neural network [13,14]. The Variation of the objects and actions is small. Our daily lives are overflowing with a huge variety of possible motions and expressions for describing them. Therefore, there is a need for humanoid robots to be able to adapt to this diversity.

In this study, I created a training dataset of motions and corresponding texts describing those motions by assigning a variety of text phrases to human bodily motions via crowdsourcing [15]. I then built an intellectual framework that can understand language for expressing movement by learning the correspondence between bodily motions and language expressions via a statistical model. This technology to collect and utilize a massive amount of text expressions as training data is expected to form the foundation for intelligence that can adapt to a diversity of language expressions.


Motion annotations

The full-body motions of humans were measured by optical motion capture or wearable motion sensors. Position data at each point on the body were converted into motions of a computer-generated model character using inverse kinematic calculations. Videos of these motions were made viewable on the Internet. Figure 1 shows examples of frames from the videos.
Figure 1

The movies of human full-body motions are viewable on the internet.

The task of manually assigning descriptive annotation to each motion video was carried out via crowdsourcing. In the annotation task, a video, a playback time, and a word representing the subject are presented. The user inputs descriptive text in English corresponding to the motion initiated by the given subject at the specified time. Using this task, a training dataset of motions and corresponding descriptive texts can thus be collected. In this study, the annotation task was openly available from our research laboratory’s website as shown by Figure 2. The students and researchers from my department are allowed to annotate the motions such that appropriately assigned descriptive texts can be collected efficiently.
Figure 2

A video containing human activity is openly available. Any user can assign the descriptive annotation to the video on the Internet.

The task described above provides descriptive sentences and their corresponding times. This task does not provide a start point and an end point of a motion segment to which the descriptive sentence is assigned. I manually detected the start point and end point for each motion segment after the annotation task, and consequently obtained datasets of the motion segments and their descriptive sentences.

Learning motions and annotations

A human full-body motion is represented by a sequence of angles of all the joints. Each sequence is encoded into an HMM λ. An HMM is a statistical model used to classify input data into an appropriate category. An HMM is defined by a compact notation λ={Q,A,B,Π}, where Q={q 1,q 2,,q n } is the set of nodes, A={a ij } is the matrix whose entries a ij are the probability of transitioning from the ith node to the jth node, B is the set of output probability density functions at the nodes, and Π={π 1,π 2,,π n } is the set of initial node distribution. In this study, the parameters of the HMM are optimized by Baum–Welch algorithm using its corresponding sequence of the joint angles. Baum–Welch algorithm is one of the expectation maximization (EM) algorithms [16]. The motion can be classified into its relevant HMM that is the most likely to generate this motion. The motion is expressed by the discrete form of the index of the HMM, and the HMM is hereinafter referred to as a “motion symbol”.

In the annotation task, a descriptive annotation is assigned to each motion symbol. Consequently, a training dataset of motion symbols and descriptive texts is collected. More specifically, each training data is a pair of motion symbol λ k and a descriptive sentence ω k , where the descriptive sentence is expressed by a sequence of l k words, \(\boldmath {\omega }_{k} = \left \{ {\omega ^{k}_{1}}, {\omega ^{k}_{2}}, \cdots, \omega ^{k}_{l_{k}} \right \}\). This paper proposes a statistical model that converts the motion symbol to descriptive sentences as shown by Figure 3 [12]. This conversion results in understanding human full-body motion in the forms of sentences. The statistical model consists of two modules. One module learns the probabilistic relations between a motion symbol λ and a word ω. This module is hereinafter referred to as “motion language module”. The other module learns the probabilistic relations of transition of two words in a sentence. This module is referred to as “natural language module”.
Figure 3

Human annotators gives descriptive sentences to the motions. Training datasets of the motions are sentences are used to model the referential structure between the motions and sentences, to model the sentence structures only using sentences. The models are applied to interpretation of motions in the form of natural language.

Figure 4 shows an overview of the motion language module that consists of three layers. The top layer includes motion symbols, the middle layer includes latent states, and the bottom layer includes words. A motion symbol generates a latent state, and a latent state generates a word. Association between the motion symbols and the words are represented by a generative model. Probabilistic relation between the motion symbol and word is represented using the probability P(s|λ) that the motion symbol λ generates the latent state s, and the probability P(ω|s) that the latent state s generates the word ω. These probabilities are optimized such that the total probability that motion symbols generate the words in the descriptive sentences in the training dataset is maximized. The logarithm of the total probability is written as
$$\begin{array}{*{20}l} \Phi(\theta) &= \log{\prod_{k} P\left({\omega^{k}_{1}}, {\omega^{k}_{2}}, \cdots, \omega^{k}_{l_{k}} | \lambda_{k}\right)} \end{array} $$
Figure 4

A motion language module learns the probability of a word being generated from a motion symbol. This probability is calculated using the probability of a latent state being generated from the motion symbol and the probability of the word being generated from the latent state.

$$\begin{array}{*{20}l} &= \sum_{k}{\log{P\left({\omega^{k}_{1}}, {\omega^{k}_{2}}, \cdots, \omega^{k}_{l_{k}} | \lambda_{k}\right) }} \end{array} $$
$$\begin{array}{*{20}l} &= \sum_{k,i}{\log{P\left({\omega^{k}_{i}} | \lambda_{k}\right) }} \end{array} $$
$$\begin{array}{*{20}l} &= \sum_{k,i} \log{\sum_{j} {P\left({\omega^{k}_{i}}, s_{j} | \lambda_{k}\right)}} \end{array} $$
where θ is a set of the probabilities P(s|λ) and P(ω|s). I assume that a word is independent of each other, and is dependent on only the motion symbol in the motion language module. Equation (2) can be subsequently rewritten as Equation (3). The dependence relationship between two words is learned by a natural language module. The The optimal θ is derived by the iterative computation. Let θ [t] be the set θ derived at t-th iteration. The probabilities P(ω,s|λ), P(s|λ), and P(ω|s) derived at t-th iteration are rewritten as P(ω,s|λ,θ [t]),P(s|λ,θ [t]), and P(ω|s,θ [t]) respectively. Equation (4) at t-th iteration is rewritten as
$$\begin{array}{*{20}l} \Phi(\theta^{[t]}) &= \sum_{k,i} \log{\sum_{j} {P\left({\omega^{k}_{i}}, s_{j} | \lambda_{k}, \theta^{[t]}\right)} } \end{array} $$
$$\begin{array}{*{20}l} &= \sum_{k,i} \log{\sum_{j} {P\left(s_{j} | \lambda_{k}, {\omega^{k}_{i}} \right) \frac{P\left({\omega^{k}_{i}}, s_{j} | \lambda_{k}, \theta^{[t]}\right)}{P\left(s_{j} | \lambda_{k}, {\omega^{k}_{i}} \right)} } } \end{array} $$
$$\begin{array}{*{20}l} &= \sum_{k,i} \log{E_{P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right)} \left[ \frac{P\left({\omega^{k}_{i}}, s | \lambda_{k}, \theta^{[t]}\right)}{P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right)} \right] }, \end{array} $$
where E P [R] denotes the expected value of R given the distribution P. According to Jensen’s inequality, Equation (7) satisfies the following relation.
$$\begin{array}{*{20}l} \Phi\left(\theta^{[t]}\right) \ge \sum_{k,i} E_{P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right)} \left[ \log{\frac{P\left({\omega^{k}_{i}}, s | \lambda_{k}, \theta^{[t]}\right)}{P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right)}} \right] \end{array} $$
Using Equation (3) and Equation (8), the following equations can be derived.
$$\begin{array}{*{20}l} & \log{P\left({\omega^{k}_{i}} | \lambda_{k}\right)} - E_{P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right)} \left[ \log{\frac{P\left({\omega^{k}_{i}}, s | \lambda_{k}, \theta^{[t]}\right)}{P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right)}} \right] \end{array} $$
$$\begin{array}{*{20}l} &= E_{P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right)} \left[ \log{\frac{P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right)}{P\left(s | \lambda_{k}, {\omega^{k}_{i}}, \theta^{[t]}\right)}} \right] \end{array} $$
$$\begin{array}{*{20}l} &= KL\left(P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right) || P\left(s | \lambda_{k}, {\omega^{k}_{i}} \theta^{[t]}\right) \right). \end{array} $$
Equation (11) represents the Kullback Leibler information that measures the dissimilarity between the distributions \(P(s | \lambda _{k}, {\omega ^{k}_{i}}) \) and \( P(s | \lambda _{k}, {\omega ^{k}_{i}}, \theta ^{[t]})\). The Kullback Leibler information becomes zero only when these two distributions are exactly same, and takes a positive value otherwise. The difference between Φ(θ [t+1]) and Φ(θ [t]) is subsequently written as follows:
$$\begin{array}{*{20}l} \Delta \Phi &= \Phi\left(\theta^{[t+1]}\right) - \Phi\left(\theta^{[t]}\right)\\ &= \sum_{k,i} E_{P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right)} \left[ \log \frac{P\left({\omega^{k}_{i}}, s | \lambda_{k}, \theta^{[t+1]}\right)}{P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right)} \right]\\ &- \sum_{k,i} E_{P\left(s | \lambda_{k}, {\omega^{k}_{i}} \right)} \left[ \log \frac{P\left({\omega^{k}_{i}}, s | \lambda_{k}, \theta^{[t]}\right)}{P\left(s | \lambda_{k}, {\omega^{k}_{i}}\right)} \right]\\ &+ \sum_{k,i} KL\left(P\left(s | \lambda_{k}, {\omega^{k}_{i}}\right) || P\left(s | \lambda_{k}, {\omega^{k}_{i}}, \theta^{[t+1]}\right) \right) \\ &- \sum_{k,i} KL\left(P\left(s | \lambda_{k}, {\omega^{k}_{i}}\right) || P\left(s | \lambda_{k}, {\omega^{k}_{i}}, \theta^{[t]}\right) \right) \end{array} $$
The distribution \( P(s | \lambda _{k}, {\omega ^{k}_{i}})\) is assumed to be estimated as \(P(s | \lambda _{k}, {\omega ^{k}_{i}}, \theta ^{[t]})\) based on the motion language model derived at t-th iteration, and the third and fourth terms in Equation (12) take a positive value and zero respectively. Hence, I only have to search for θ [t+1] such that the first term in Equation (12) becomes greater than the second term because the incremental update of θ [t+1] increases the total probability Φ of the training data. More specifically, the first term only has to be maximized by θ [t+1]. Using the probabilities P(s|λ,θ [t+1]) and P(ω|s,θ [t+1]), This maximization can be reduced as follows
$$\begin{array}{*{20}l}{\kern15pt} &\arg \max_{\theta^{[t+1]}} \sum_{k,i,j} P\left(s_{j} | \lambda_{k}, {\omega^{k}_{i}} \right) \left[ \log{P\left({\omega^{k}_{i}}, s_{j} | \lambda_{k}, \theta^{[t+1]}\right)}\right. \\ &\left.\qquad\qquad\qquad\qquad\qquad- \log{P\left(s_{j} | \lambda_{k}, {\omega^{k}_{i}} \right)} \right] \end{array} $$
$$\begin{array}{*{20}l} &=\arg \max_{\theta^{[t+1]}} \sum_{k,i,j} P\left(s_{j} | \lambda_{k}, {\omega^{k}_{i}} \right) \log{P\left({\omega^{k}_{i}}, s_{j} | \lambda_{k}, \theta^{[t+1]}\right)}\\ &=\arg \max_{\theta^{[t+1]}} \sum_{k,i,j} P\left(s_{j} | \lambda_{k}, {\omega^{k}_{i}} \right) \left[ \log{P\left({\omega^{k}_{i}}, | s_{j}, \theta^{[t+1]}\right)}\right.\\ &\left.\qquad\qquad\qquad\qquad\qquad\qquad+\; P\left(s_{j} | \lambda_{k}, \theta^{[t+1]}\right) \right] \end{array} $$
where the terms independent of θ [t+1] are eliminated. The probabilities P(s|λ,θ [t+1]) and P(ω|s,θ [t+1]) are constrained as follows:
$$\begin{array}{*{20}l} \sum_{j} P\left(s_{j} | \lambda, \theta^{[t+1]}\right) = 1 \end{array} $$
$$\begin{array}{*{20}l} \sum_{i} P\left(\omega_{i} | s, \theta^{[t+1]}\right) = 1 \end{array} $$
By applying the method of Lagrange multiplier to Equation (13), the probabilities P(s|λ,θ [t+1]) and P(ω|s,θ [t+1]) at t+1-th iteration can be analytically derived.
$$\begin{array}{*{20}l} P\left(s | \lambda_{k}, \theta^{[t+1]}\right) &= \frac{\displaystyle{\sum_{i}} P\left(s | \lambda_{k}, {\omega^{k}_{i}}\right) }{\displaystyle{\sum_{i,j}} P\left(s_{j} | \lambda_{k}, {\omega^{k}_{i}}\right)} \end{array} $$
$$\begin{array}{*{20}l} P\left(\omega_{i} | s, \theta^{[t+1]}\right) &= \frac{\displaystyle{\sum_{k}} n_{k,i} P\left(s | \lambda_{k}, \omega_{i}\right)} {\displaystyle{\sum_{k,i}} n_{k,i} P\left(s | \lambda_{k}, \omega_{i}\right)} \end{array} $$

where n k,i is the number that the word ω i appears in the sentence ω k assigned to the motion symbol λ k . Note that ω i denotes the i-th word in the set of words, and \({\omega ^{k}_{i}}\) denotes the word at the i-th position in the sentence assigned to the k-th motion symbol. The processes described above are iterated, and consequently the optimal probabilities P(s|λ) and P(ω|s) can be derived.

Figure 5 shows an overview of the natural language module. This module extracts the probability π(ω) of starting at the word ω and the probability P(ω j |ω i ) of transitioning from the word ω i to the word ω j using a training dataset of sentences assigned to the motion symbols. The probabilities π(ω i ) and P(ω j |ω i ) are optimized such that the probability that the natural language module generates the training sentences. The logarithm of this probability is expressed by
$$\begin{array}{*{20}l} \Psi\left(\vartheta\right) &= \sum_{k} \log{P\left(\boldmath{\omega}_{k}\right)} \end{array} $$
Figure 5

A natural language module learns the probability of transitioning between words in sentences.

$$\begin{array}{*{20}l} &= \sum_{k} \log{\pi\left({\omega^{k}_{1}}\right) } + \sum_{k,i} \log{P\left(\omega^{k}_{i+1} | {\omega^{k}_{i}}\right)}. \end{array} $$
where 𝜗 is a set of probabilities π(ω) and P(ω j |ω i ). The optimal 𝜗 can be analytically derived as follows.
$$\begin{array}{*{20}l} \pi(\omega) &= \frac{c\left(\omega\right)}{\displaystyle{\sum_{i}} c\left(\omega_{i}\right)} \end{array} $$
$$\begin{array}{*{20}l} P\left(\omega_{j} | \omega_{i}\right) &= \frac{c\left(\omega_{i}, \omega_{j}\right)}{\displaystyle{\sum_{j}} c\left(\omega_{i}, \omega_{j}\right)} \end{array} $$

where c(ω) is the frequency of the sentence starting at the word ω, and c(ω i ,ω j ) is the frequency of transitions from the word ω i to the word ω j .

The conversion from the motion symbol \(\lambda _{\mathcal {R}}\) to its descriptive sentences \(\boldmath {\omega }_{\mathcal {R}}\) can be treated as the problem of searching for the sentences that are most likely to be generated by the motion symbols. This problem is expressed as follows:
$$\begin{array}{*{20}l} {\boldmath \omega}_{\mathcal{R}} &= \arg \max_{\hat{{\boldmath \omega }}} P\left(\hat{{\boldmath \omega }} | \lambda_{\mathcal {R}}\right) \end{array} $$
$$\begin{array}{*{20}l} &= \arg \max_{\hat{{\boldmath \omega }}} P\left(\hat{\omega}_{1}, \hat{\omega}_{2}, \cdots, \hat{\omega}_{l} | \lambda_{\mathcal{R}}\right) P\left(\hat{{\boldmath \omega}} | \hat{\omega}_{1}, \hat{\omega}_{2}, \cdots, \hat{\omega}_{l}\right) \end{array} $$
where \(P(\hat {\omega }_{1}, \hat {\omega }_{2}, \cdots, \hat {\omega }_{l} | \lambda _{\mathcal {R}})\) is the probability that the motion language module generates a set of words \(\left \{ \hat {\omega }_{1}, \hat {\omega }_{2}, \cdots, \hat {\omega }_{l} \right \}\) from the motion symbol \(\lambda _{\mathcal {R}}\), and \(P(\hat {\boldmath {\omega }} | \hat {\omega }_{1}, \hat {\omega }_{2}, \cdots, \hat {\omega }_{l})\) is the probability that the natural language module arranges the set of words \(\left \{ \hat {\omega }_{1}, \hat {\omega }_{2}, \cdots, \hat {\omega }_{l} \right \}\) into the sentence \(\hat {\boldmath {\omega }}\). Therefore, these two probabilities can be written using the probabilities defining the motion language module and the natural language module.
$$\begin{array}{*{20}l} P\left(\hat{\omega}_{1}, \hat{\omega}_{2}, \cdots, \hat{\omega}_{l} | \lambda_{\mathcal{R}}\right) &= \prod_{i} P\left(\hat{\omega_{i}}| \lambda_{\mathcal{R}}\right) \end{array} $$
$$\begin{array}{*{20}l} P\left(\hat{{\boldmath \omega }} | \hat{\omega}_{1}, \hat{\omega}_{2}, \cdots, \hat{\omega}_{l}\right) &=\pi\left(\hat{\omega}_{1}\right)\prod_{i} P\left(\hat{\omega}_{i+1} | \hat{\omega}_{i}\right) \end{array} $$
where \( P(\hat {\omega _{i}}| \lambda _{\mathcal {R}})\) can be calculated as \(\sum _{j} P(\hat {\omega _{i}}|s_{j})P(s_{j}| \lambda _{\mathcal {R}})\). Substituting Equation (24) and Equation (25) into Equation (23) and taking the logarithm of it, Equation (23) can be reduced to the following equation.
$$ \begin{aligned} {\boldmath \omega }_{\mathcal{R}} = \arg \max_{\hat{{\boldmath \omega }}} &\left[ \sum_{i} \log{P\left(\hat{\omega_{i}}| \lambda_{\mathcal{R}}\right)} + \log{\pi\left(\hat{\omega}_{1}\right)}\right.\\ &\left. + \sum_{i} \log{P\left(\hat{\omega}_{i+1} | \hat{\omega}_{i}\right)} \right] \end{aligned} $$

Equation (26) can be efficiently solved using Dijkstra’s algorithm.

Result and discussion


An experiment on the conversion from the full-body motions of human to the descriptive sentences was conducted by using our proposed statistical framework. The full-body motions were measured using an inertial motion capture system where 17 IMU sensors were attached to a human performer. This measurement was conducted with the approval of the ethical committee of the University of Tokyo. Positions of 34 selected bodied part in the human full-body in the trunk coordinate system were derived via kinematic computation using a human figure model with 34 degrees of freedom. Each measured motion segment is encoded into an HMM. The HMM consists of 30 nodes, each of which has one Gaussian distribution, and the type of node connection is left-to-right. A descriptive sentence is manually assigned to each HMM via crowdsourcing. In this study the full-body motions of one performer were measured during working at the office or giving a lecture, and 621 motion symbols, each of which a sentence is assigned to by five users, were subsequently collected. The number of different words used in the descriptive sentences was 419. Table 1 shows sample parts of the training dataset of motions and their descriptive sentences.
Table 1

Motions λ and Annotations ω in the training dataset






a person is sitting


a person is sitting


a performer is sitting


a person is working at his desk


a performer is working at his desk


a person is sitting in a chair


a person is reaching out a hand


a person sits back


a performer sits back


a person crosses his right leg over the left


he crosses his right leg over the left


a performer crosses his right leg over the left


a person crosses his right leg over the left


a person is operating a computer with his legs crossed


a person is sitting in a chair


a performer is sitting with his legs crossed


a person sits down


a professor sits down


a person is sitting in a chair


a performer is sitting in a chair


he scratches his shoulder


he is reading


he is relaxed


he concentrates on reading


he concentrates


he puts down his book


he puts down


he is crossing his left leg


he is reading


a person is sittiing down


he is writing on a blackboard


he is checking


he is walking


he is checking his notebook


he is writing on a blackboard


he is looking at students


he is teaching


he is writing on a blackboard


he is pointing out


he is explaining


he plants his arm on his chin


he plants his arm on a table


he is drinking


he is drinking


he puts down something


he is resting


he puts his hands on a table


he drinks


he is studying


he is crossing his arms

After learning the motion language module and the natural language module using the training dataset as shown by Table 1, the proposed framework was tested on 100 different full-body motions of human. Each motion is converted to five descriptive sentences that are most likely to be generated by both the motion language module and the natural language module. Figure 6 shows the experimental result of conversion from a full-body motion to sentences, where a sentence containing less than three words is removed as a candidate sentence. A motion “sitting” is converted into sentences “a person sits”, “a person sits down”, “a person sits back” and “he sits down”. A motion “drinking” is converted into sentences “a person is drinking” and “he is drinking”. These sentences were confirmed to correctly represent the full-body motions. A motion “sitting with legs crossed” is correctly converted into sentences “he is sitting”, but it is wrongly converted into another sentence “he is sitting with his legs”. Additionally, it is correctly converted into a long sentence “he is sitting with his legs crossed”, that is ranked lower than the wrong sentence “he is sitting with his legs”. A motion “writing on a blackboard” is converted into a correct sentence “he is writing”, and wrong sentence “he is writing on” and “he is writing a blackboard” that are close to the correct sentence “he is writing on a blackboard”. The several wrong sentences are terminated at the inappropriate words, and longer sentences are unlikely to be generated. The natural language model needs to be extend to word trigrams such that it represents the relations among words that are distant from each other in the sentences, and the conversion from the motion to the sentence, expressed by Equation (26), should be modified to take into account the length of sentences.
Figure 6

A human full-body motion is converted to multiple descriptive sentences. This conversion make it possible for a humanoid robot to understand observation of the human motion in the forms of sentence.

I also quantitatively evaluate the conversion from the motions to sentences. Five users assigned a descriptive sentence to each test motion. The performer and users in this test phase are same as those in the learning phase. Each motion that was converted to several candidate sentences, one of which is exactly same as the sentence assigned to this test motion was counted as the correct. The accuracy of the conversion can be computed as a ratio of the correct motions to the test motions. The number of the candidate sentences was varied. In the case that the number of the candidate sentences was set to 1, the accuracy of the conversion was 0.34. The number of the candidate sentences was set to 2, the accuracy of the conversion reached 0.59. Three, four, and five candidate sentences result in the accuracies of 0.64, 0.68 and 0.71 respectively.


The contributions of this paper are summarized as follows.
  1. 1.

    This paper proposes a novel scheme of collecting a training dataset of human full-body motions and their descriptive sentences via crowdsourcing. Videos containing human activity are made viewable on the Internet. The task of assigning the descriptive annotations to the videos is designed. The task is openly available, and can be carried out by any users. Through this simple task, a training data set of motions and corresponding descriptive sentences can be collected. In this study, there are 621 motions and descriptive sentences with 419 different words in the training dataset.

  2. 2.

    This paper proposes a statistical framework to convert a full-body motion to multiple descriptive sentences. This framework consists of two modules : motion language module and natural language module. The motion language module statistically learns association between motions and words, and the natural language module learns transition between two words in the sentences. The integration of these two modules enables a humanoid robot to convert a human full-body motion to its descriptive sentences.

  3. 3.

    The experiment on the conversion from the human full-body motion to the sentences was conducted using dataset of motions and descriptive annotations derived via the crowdsourcing. I varied the number of candidate sentences converted from the motion. The accuracy of the conversion of 0.34, 0.59, 0.64, 0.68 and 0.71 were obtained from one, two, three, four and five candidate sentences respectively. The experiment shows that the full-body motions are converted to correct descriptive sentences, and demonstrates the validity of the proposed statistical framework for the conversion of the motions to the sentences. Additionally I found several limitations that a long sentence is unlikely to generated, and that many sentences are terminated at the wrong words.




This research was supported by Grant-in-Aid for Young Scientists (A) (26700021), Japan Society for the Promotion of Science.

Authors’ Affiliations

The Univ. of Tokyo, Bunkyoku Hongo


  1. Breazeal C, Scassellati B (2002) Robots that imitate humans. Trends Cognitive Sci 6(11): 481–487.View ArticleGoogle Scholar
  2. Argall B, Chernova S, Veloso M, Browning B (2009) A survey of robot learning from demonstration. Robot Autonomous Syst 57(5): 469–483.View ArticleGoogle Scholar
  3. Okada M, Tatani K, Nakamura Y (2002) Polynomial design of the nonlinear dynamics for the brain-like information processing of whole body motion In: Proceedings of the IEEE International Conference on Robotics and Automation, 1410–1415.
  4. Ijspeert AJ, Nakanishi J, Shaal S (2003) Learning control policies for movement imitation and movement recognition. Neural Inf Process Syst 15: 1547–1554.Google Scholar
  5. Kadone H, Nakamura Y (2005) Symbolic memory for humanoid robots using hierarchical bifurcations of attractors in nonmonotonic neural networks In: Proceedings of the 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2900–2905.
  6. Ito M, Noda K, Hoshino Y, Tani J (2006) Dynamic and interactive generation of object handing behaviors by a small humanoid robot using a dynamic neural network model. Neural Netw 19(3): 323–337.View ArticleMATHGoogle Scholar
  7. Inamura T, Toshima I, Tanie H, Nakamura Y (2004) Embodied symbol emergence based on mimesis theory. Intl J Robot Res 23(4): 363–377.View ArticleGoogle Scholar
  8. Asfour T, Gyarfas F, Azad P, Dillmann R (2006) Imitation learning of dual-arm manipulation task in humanoid robots In: Proceedings of the IEEE-RAS International Conference on Humanoid Robots, 40–47.
  9. Billard A, Calinon S, Guenter F (2006) Discriminative and adaptive imitation in unimanual and bi-manual tasks. Robot Autonomous Syst 54: 370–384.View ArticleGoogle Scholar
  10. Kulic D, Takano W, Nakamura Y (2008) Incremental learning, clustering and hierarchy formation of whole body motion patterns using adaptive hidden markov chains. Intl J Robot Res 27(7): 761–784.View ArticleGoogle Scholar
  11. Takano W, Yamane K, Nakamura Y (2007) Capture database through symbolization, recognition and generation of motion patterns In: Proceedings of the IEEE International Conference on Robotics and Automation, 3092–3097.
  12. Takano W, Nakamura Y (2008) Integrating whole body motion primitives and natural language for humanoid robots In: Proceedings of the IEEE-RAS International Conference on Humanoid Robots, 708–713.
  13. Tuci E, Ferrauto T, Zeschel A, Massera G, Nolfi S (2011) An experiment on behavior generalization and the emergence of linguistic compositionality in evolving robots. IEEE Trans Autonomous Mental Dev 2(2): 176–189.View ArticleGoogle Scholar
  14. Tuci E, Ferrauto T, Zeschel A, Massera G, Nolfi S (2010) The facilitatory role of linguistic instructions on developing manipulation skills. IEEE Comput Intell Mag 5(3): 33–42.View ArticleGoogle Scholar
  15. Howe J (2006) The Rise of Crowdsourcing. Wired Magazine 14(6).
  16. Rabiner L, Juang BH (1993) Fundamentals of speech recognition In: Prentice Hall Signal Processing Series.


© Takano; licensee Springer. 2015

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.