Skip to main content

Optimization algorithm for feedback and feedforward policies towards robot control robust to sensing failures


Background and problem statement

Model-free or learning-based control, in particular, reinforcement learning (RL), is expected to be applied for complex robotic tasks. Traditional RL requires that a policy to be optimized is state-dependent, that means, the policy is a kind of feedback (FB) controllers. Due to the necessity of correct state observation in such a FB controller, it is sensitive to sensing failures. To alleviate this drawback of the FB controllers, feedback error learning integrates one of them with a feedforward (FF) controller. RL can be improved by dealing with the FB/FF policies, but to the best of our knowledge, a methodology for learning them in a unified manner has not been developed.


In this paper, we propose a new optimization problem for optimizing both the FB/FF policies simultaneously. Inspired by control as inference, the proposed optimization problem considers minimization/maximization of divergences between trajectories, one is predicted by the composed policy and a stochastic dynamics model, and others are inferred as optimal/non-optimal ones. By approximating the stochastic dynamics model using variational method, we naturally derive a regularization between the FB/FF policies. In numerical simulations and a robot experiment, we verified that the proposed method can stably optimize the composed policy even with the different learning law from the traditional RL. In addition, we demonstrated that the FF policy is robust to the sensing failures and can hold the optimal motion.


In the last decade, the tasks (or objects) required of robots have become steadily more complex. For such next-generation robot control problems, traditional model-based control like [1] seems to reach its limit due to the difficulty of modeling complex systems. Model-free or learning-based control like [2] is expected to resolve these problems in recent year. In particular, reinforcement learning (RL) [3] is one of the most promising approaches to this end, and indeed, RL integrated with deep neural networks [4], so-called deep RL [5], achieved several complex tasks: e.g. human–robot interaction [6]; manipulation of deformable objects [7]; and manipulation of various general objects from scratch [8].

In principle, RL makes an agent to optimize a policy (a.k.a. controller) to stochastically sample action (a.k.a. control input) depending on state, result of interaction between the agent and environment [3]. Generally speaking, therefore, the policy to be optimized can be regarded as one of the feedback (FB) controllers. Of course, the policy is more conceptual and general than traditional FB controllers such as for regulation and tracking, but it is still a mapping from state to action.

Such a FB policy inherits the drawbacks of the traditional FB controllers, i.e. the sensitivity to sensing failures [9]. For example, if the robot has a camera to detect an object, pose of which is given to be state of RL, the FB policy would sample erroneous action according to a wrong pose by occlusion. Alternatively, if the robot system is connected with a wireless TCP/IP network to sense data from IoT devices, communication loss or delay due to poor signal conditions will occur at irregular intervals, causing erroneous action.

To alleviate this fundamental problem of the FB policy, filtering techniques have often been integrated with the FB controllers. Famous examples (e.g. in aircraft) use redundant sensor and/or communication systems to select the normal signals and ignore the wrong signals in order to be robust to the sensing failures [10, 11]. In addition, Kalman filter, the most popular filtering methodology, relies on a state-space model that can predict the next observation and can replace the sensed values into the predicted ones at the sensing failures [12, 13]. Although the state-space model is not given in RL, recent developments in deep learning technology would make it possible to acquire this in a data-driven manner [14].

In contrast to the above input processing, previous studies have developed the policies that do not depend only on state. In a straightforward way, time-dependent policy has been proposed by directly adding the elapsed time to state [15] or by utilizing recurrent neural networks (RNNs) [16, 17] for approximation of that policy [18]. If the policy is computed according to the phase and spectrum information of the system, instantaneous sensing failures would be ignored [19, 20]. In an extreme case, if the robot learns to episodically generate the trajectory, the adaptive behavior to state is completely lost, but it is never affected by the sensing failures. We focus on these approaches as the output processing.

From the perspective of the traditional control theory and biology, it has been suggested that this problem of the FB policy can be resolved by a feedforward (FF) policy with feedback error learning (FEL) [9, 21,22,23], which can also be regarded as the output processing. FEL is a framework in which the FF controller is updated based on the error signal of the FB controller, and finally the control objective is achieved only by the FF controller. In other words, instead of designing only the single policy as in the previous studies above, FEL has both the FB/FF policies in the system and composes their outputs appropriately to complement each other’s shortcomings: the sensitivity to the sensing failures in the FB policy; and the lack of adaptability to the change of state in the FF policy. The two separated policies are more compact than the integrated one. In addition, although the composition of the outputs in the previous studies is a simple summation, it creates a new room for designing different composition rules, which makes it easier for designers to adjust which of the FB/FF policies is preferred.

The purpose of this study is to take over the benefits of FEL to the RL framework, as shown in Fig. 1. To this end, we have to solve two challenges as below.

  1. 1.

    Since RL is not only for tracking problem, which is the target of FEL, we need to design how to compose the FB/FF policies.

  2. 2.

    Since the FB policy is not fixed unlike FEL, both of the FB/FF policies are required to be optimized simultaneously.

Fig. 1
figure 1

Proposed RL framework: it contains both the FB/FF policies in parallel; policies outputted from them are composed to sample action; according to reward, both the FB/FF policies are optimized in a unified manner; with the appropriate combination of the FB/FF policies, this framework is expected to achieve both robustness to sensing failures and adaptiveness to changes of state

For the first challenge, we assume that the composed policy is designed as mixture distribution of the FB/FF policies since RL policy is stochastically defined. A similar approach is to weight each policy according to its corresponding value function, as in the literature [24]. However, in the proposed framework, this method cannot be adopted because the FB/FF policies are learned by a common value function. Therefore, we heuristically design its mixture ratio depending on confidences of the respective FB/FF policies so that the higher confident policy is prioritized. As a specific implementation of the confidence, this paper uses the negative entropy of each probability distribution.

For the second challenge, inspired by control as inference [25, 26], we derive a new optimization problem to minimize/maximize the divergences between trajectories, one is predicted by the composed policy and a stochastic dynamics model, and others are inferred as optimal/non-optimal ones. Furthermore, by designing the stochastic dynamics model with variational approximation [27], we heuristically find that the regularization between the FB/FF policies is given. This regularization expects us that skill of the FB policy, which can be optimized faster than the FF policy, will be transferred into the FF policy.

To verify that the proposed method can optimize the FB/FF policies in a unified manner, we conduct numerical simulations for statistical evaluation and a robot experiment as demonstration. Through the numerical simulations, we show the capability of the proposed method, namely, stable optimization of the composed policy even with the different learning law from the traditional RL. However, the proposed method occasionally fails to learn the optimal policy. We analyze this reason as the extreme updating of the FF policy (or RNNs) to wrong direction. In addition, after training on the robot experiment, we clarify the value of the proposed method that the optimized FF policy robustly samples valuable actions to the sensing failures even when the FB policy fails to achieve the optimal behavior.


Reinforcement learning

In RL [3], Markov decision process (MDP) is satisfied as shown in the left of Fig. 2. Specifically, an agent interacts with unknown environment using action \(a \in {\mathcal {A}}\) sampled from policy \(\pi\). The environment returns the result of the interaction as state \(s \in {\mathcal {S}}\) (or the next state \(s^\prime\)) and evaluates it according to reward function, which represents the degree of accomplishment of the desired task, \(r(s, a) \in {\mathbb {R}}\). Here, s is sampled from the black-box state transition probability of the environment \(s^\prime \sim p_e(s^\prime \mid s, a)\) (and \(s \leftarrow s^\prime\)). In that time, the policy \(\pi\) can be given as a probability conditional to only s (i.e. a stochastic FB controller), \(\pi (a \mid s)\), theoretically. The optimization problem of RL is to find the optimal policy \(\pi ^*\) that maximizes the sum of rewards in the future from the current time t (or, called return), defined as \(R_t = \sum _{k=0}^\infty \gamma ^k r_{t+k}\) with \(\gamma \in [0, 1)\) discount factor.

Fig. 2
figure 2

Loop of RL with sensing failures: in general RL (left), an agent interacts with environment by action sampled from policy depending on the current state; according to state transition probability, the new state is observed with related reward; however, in practice (right), state observation is probably with risk of sensing failures like occlusion and packet loss

However, in practical use, the state from the environment must be observed using internal/external sensors, and measurement of state causes delay (e.g. due to overload in the communication networks) and/or loss (e.g. occlusion in camera sensors), suggested in the right of Fig. 2. With these sensing failures, \(\pi (a \mid s)\) is no longer enough to acquire the task represented by the reward function because the measured (and lost/delayed) state cannot hold MDP. To solve this problem, this paper therefore proposes a new method to optimize the FB/FF policies in a unified manner by formulating them without necessarily requiring MDP.

In the conventional RL under MDP, the expected value of R is functionalized as V(s) as (state) value function and Q(sa) as (state-)action value function, and V can be learned by the following equation.

$$\begin{aligned} \delta&= Q(s,a) - V(s) \simeq r(s, a) + \gamma V(s^\prime ) - V(s) \end{aligned}$$
$$\begin{aligned} {\mathcal {L}}_{\mathrm{value}}&= \frac{1}{2}\delta ^2 \end{aligned}$$

Note that Q can also be learned with the similar equation, although we do not use Q directly in this paper.

Based on \(\delta\), an actor-critic algorithm [28] updates \(\pi\) according to the following policy gradient.

$$\begin{aligned} \nabla {\mathcal {L}}_\pi = - {\mathbb {E}}_{p_e \pi } [\delta \nabla \ln \pi (a \mid s)] \end{aligned}$$

where \({\mathbb {E}}_{p_e \pi } [\cdot ]\) is approximated by Monte Carlo method.

Introduction of optimality variable in control as inference

Recently, RL can be regarded as inference problem, so-called control as inference [25]. This extension of interpretation introduces a optimality variable, \(o = \{0, 1\}\), which represents whether a pair of s and a is optimal (\(o = 1\)) or not (\(o = 0\)). Since it is defined as random variable, the probability of \(o = 1\), \(p(o=1 \mid s, a)\), is parameterized by reward r to connect the conventional RL with this interpretation.

$$\begin{aligned} p(o = 1 \mid s, a) = \exp \left( \frac{r(s, a) - c} \tau \right) \end{aligned}$$

where \(c = \max (r)\) to satisfy \(e^{r(s, a) - c} \le 1\), and \(\tau\) denotes the hyperparameter to clarify uncertainty, and can be adaptively tuned.

Furthermore, supposing the optimality in the future as \(O = \{0, 1\}\), the following formulations can be defined with the value functions.

$$\begin{aligned} p(O = 1 \mid s)&= \exp \left( \frac{V(s) - C}{\tau } \right) \end{aligned}$$
$$\begin{aligned} p(O = 1 \mid s, a)&= \exp \left( \frac{Q(s, a) - C}{\tau } \right) \end{aligned}$$

where \(C = \max (V) = \max (Q)\) theoretically, although its specific value is generally unknown.

In this way, the optimality can be treated in probabilistic inference problems, facilitating integration with such as Bayesian inference and other methods. This paper utilizes this property to derive a new optimization problem, as derived later.

Inference of optimal/non-optimal policies

With the optimality variable O, we can infer the optimal policy and the non-optimal policy (details are in [26]). With Eqs. (5) and (6), the policy conditioned on O, \(\pi ^*(a \mid s, O)\), can be derived through Bayes theorem.

$$\begin{aligned} \pi ^*(a \mid s, O) = \frac{p(O \mid s, a) b(a \mid s)}{p(O \mid s)} \end{aligned}$$

where \(b(a \mid s)\) denotes the sampler distribution (e.g. the composed policy with old parameters or one approximated by target networks [29]).

By substituting \(\{0,1\}\) for O, the inference of the optimal policy, \(\pi ^+\), and the non-optimal policy, \(\pi ^-\), is given as follows:

$$\begin{aligned} \pi ^+(a \mid s)&= \pi ^*(a \mid s, O=1) = \frac{\exp \left( \frac{Q(s,a) - C}{\tau } \right) }{\exp \left( \frac{V(s) - C}{\tau } \right) } b(a \mid s) \end{aligned}$$
$$\begin{aligned} \pi ^-(a \mid s)&= \pi ^*(a \mid s, O=0) = \frac{1 - \exp \left( \frac{Q(s,a) - C}{\tau } \right) }{1 - \exp \left( \frac{V(s) - C}{\tau } \right) } b(a \mid s) \end{aligned}$$

Although it is difficult to sample action from these policies directly, they can be utilized for analysis later.

Variational recurrent neural network

To reveal state transition probability (i.e. \(p_e\)) as stochastic dynamics model, we derive the method to learn it based on variational recurrent neural network (VRNN) [27]. Therefore, in this section, we briefly introduce the VRNN.

The VRNN considers the maximization problem of log-likelihood of a prediction model of observation (s in the context of RL), \(p_m\). s is assumed to be stochastically decoded from lower-dimensional latent variable z, and z is also sampled according to the history of s, \(h^s\), as time-dependent prior \(p(z \mid h^s)\). Here, \(h^s\) is generally approximated by recurrent neural networks, and this paper employs deep echo state networks [30] for this purpose. Using Jensen’s inequality, a variational lower bound is derived as follows:

$$\begin{aligned} \ln p_m(s \mid h^s)&= \ln \int p(s \mid z) p(z \mid h^s) dz\nonumber \\&= \ln \int q(z \mid s, h^s) p(s \mid z) \frac{p(z \mid h^s)}{q(z \mid s, h^s)} dz\nonumber \\&\ge {\mathbb {E}}_{q(z \mid s, h^s)}[\ln p(s \mid z)]\nonumber \\&\quad - \mathrm{KL}(q(z \mid s, h^s) \Vert p(z \mid h^s))\nonumber \\&= - {\mathcal {L}}_{\mathrm{vrnn}} \end{aligned}$$

where \(p(s \mid z)\) and \(q(z \mid s, h^s)\) denote the decoder and encoder, respectively. \(\mathrm{KL}(\cdot \Vert \cdot )\) is the term for Kullback-Leibler (KL) divergence between two probabilities. \({\mathcal {L}}_{\mathrm{vrnn}}\) is minimized via the optimization of \(p_m\), which consists of \(p(s \mid z)\), \(q(z \mid s, h^s)\), and \(p(z \mid h^s)\).

Note that, in the original implementation[27], the decoder is also depending on \(h^s\), but that is omitted in the above derivation for simplicity and for aggregating time information to z, as well as the literature [31]. In addition, the strength of regularization by the KL term can be controlled by following \(\beta\)-VAE [32] with a hyperparameter \(\beta \ge 0\).

Derivation of proposed method


The outputs of FB/FF policies should eventually coincide, but it is unclear how they will be updated if we directly optimize the composed policy according to the conventional RL. In other words, if the composed policy is trained using a policy-gradient method, the gradients for the FB/FF policies would be different from each other, making the FB/FF policies not coincide. In this paper, we propose a unified optimization problem in which the FB/FF policies naturally coincide and the composed one is properly optimized. To this end, we heuristically find that it is required to be able to generate similar trajectories for both FB/FF policies by extending the optimization problem from the optimization of the composed policy alone to the optimization of the trajectory generated by the policy, as shown in Fig. 3. This requirement leads to simultaneous learning of the FB/FF policies and matching of their outputs. In other words, the key points in the proposed method are two folds:

  1. 1.

    The trajectory predicted with the stochastic dynamics model and the composed policy is expected to be close to/away from optimal/non-optimal trajectories inferred with the optimality variable.

  2. 2.

    The stochastic dynamics model is trained via its variational lower bound, which naturally generates a soft constraint between the FB/FF policies.

However, please keep in mind that this approach is heuristically obtained, and therefore, a more straightforward method may be existed, although it is not easily found.

Fig. 3
figure 3

Trajectory optimization problem: the trajectory can be predicted with the composed policy and the stochastic dynamics model; the optimal/non-optimal trajectories can be inferred with the optimal/non-optimal policies and the true state transition probability; the predicted trajectory is desired to be close to the optimal trajectory, while to be away from the non-optimal trajectory; the divergence between trajectories can be represented by the KL divergence

Here, as an additional preliminary preparation, we define the FB, FF, and composed policies mathematically: \(\pi _{\mathrm{FB}}(a \mid s)\); \(\pi _{\mathrm{FF}}(a \mid h^a)\); and the following mixture distribution, respectively.

$$\begin{aligned} \pi (a \mid s, h^a) = w \pi _{\mathrm{FB}}(a \mid s) + (1 - w) \pi _{\mathrm{FF}}(a \mid h^a) \end{aligned}$$

where \(w \in [0, 1]\) denotes the mixture ratio of the FB/FF policies. That is, for generality, the outputs of the FB/FF policies are composed by a stochastic switching mechanism, rather than a simple summation as in FEL [21]. Note that since the history of action, \(h^a\), can be updated without s, the FF policy is naturally robust to sensing failures.

Optimization problem for optimal/non-optimal trajectories

With the composed policy, \(\pi\), and the stochastic dynamics model, given as \(p_m(s^\prime \mid s, a, h^s, h^a)\), a fragment of trajectory is predicted as \(p_m \pi\). As a reference, we can consider the fragment of optimal/non-optimal trajectory with \(\pi ^*\) in Eq. (7) and the real environment, \(p_e\), as \(p_e \pi ^*\). Note that the original derivation of \(\pi ^*\) has only the state s (and the optimality variable O) as its conditions, but as described above, we need to treat the history of action \(h^a\) explicitly, so we consider \(\pi ^* = \pi (a \mid s, h^a, O)\). The degree of divergence between the two can be evaluated by KL divergence as follows:

$$\begin{aligned} \mathrm{KL}(p_e \pi ^* \Vert p_m \pi )&= {\mathbb {E}}_{p_e \pi ^*} [(\ln p_e + \ln \pi ^*) - (\ln p_m + \ln \pi )]\nonumber \\&= {\mathbb {E}}_{p_e b} \left[ \frac{p(O \mid s, a)}{p(O \mid s)} \{(\ln p_e + \ln \pi ^*) - (\ln p_m + \ln \pi )\} \right] \nonumber \\&\propto - {\mathbb {E}}_{p_e b} \left[ \frac{p(O \mid s, a)}{p(O \mid s)} (\ln p_m + \ln \pi ) \right] \end{aligned}$$

where the term \(\ln p_e \pi ^*\) inside the expectation operation is excluded since it is not related to the learnable \(p_m\) and \(\pi\). The expectation operation with \(p_e\) and b can be approximated by Monte Carlo method, namely, we can optimize \(p_m\) and \(\pi\) using the above KL divergence with the appropriate conditions of O.

As the conditions, our optimization problem considers that \(p_m \pi\) is expected to be close to \(p_e \pi ^+\) (i.e. the optimal trajectory) and be away from \(p_e \pi ^-\) (i.e. the non-optimal trajectory), as shown in Fig. 3. Therefore, the specific loss function to be minimized is given as follows:

$$\begin{aligned} {\mathcal {L}}_{\mathrm{traj}}&= \mathrm{KL}(p_e \pi ^+ \mid p_m \pi ) -\mathrm{KL}(p_e \pi ^- \mid p_m \pi )\nonumber \\&\propto - {\mathbb {E}}_{p_e b}\left[ \left\{ \frac{\exp \left( \frac{Q - C}{\tau } \right) }{\exp \left( \frac{V - C}{\tau } \right) } -\frac{1 - \exp \left( \frac{Q - C}{\tau } \right) }{1 - \exp \left( \frac{V - C}{\tau } \right) } \right\} (\ln p_m + \ln \pi ) \right] \nonumber \\&= - {\mathbb {E}}_{p_e b}\left[ \frac{\exp \left( \frac{Q - V}{\tau } \right) - 1}{1 - \exp \left( \frac{V - C}{\tau } \right) } (\ln p_m + \ln \pi ) \right] \nonumber \\&\propto - {\mathbb {E}}_{p_e b}\left[ \tau \left\{ \exp \left( \frac{\delta }{\tau } \right) - 1 \right\} (\ln p_m + \ln \pi ) \right] \end{aligned}$$

where \(1 - \exp \{(V - C)\tau ^{-1}\}\) and \(\tau\) are multiplied to eliminate unknown C and to scale the gradient at \(\delta = 0\) to be one, respectively. Note that the derived result is similar to Eq. (3), but with a different coefficient from \(\delta\) and a different sampler from \(\pi\).

Stochastic dynamics model with variational lower bound

In Eq. (13), \(\ln p_m\), i.e. the stochastic dynamics model, is included and it should be modeled. Indeed, inspired by the literature [31], we found that the model based on the VRNN [27] shown in Eq. (10) can naturally yield an additional regularization between the FB/FF policies. In addition, such a method is regarded as one for extracting latent Markovian dynamics in problems for which MDP is not established in the observed state, and is similar to the latest model-based RL [33, 34].

Specifically, we consider the dynamics of latent variable z as \(z^\prime = f(z, a)\) with f learnable function, and a can be sampled from time-dependent prior (i.e. the FF policy). In that time, Eq. (10) is modified through the following derivation.

$$\begin{aligned} \ln p_m(s^\prime \mid h^s, h^a)&= \ln \iint p(s^\prime \mid z^\prime ) p(z \mid h^s) \pi _{\mathrm{FF}}(a \mid h^a) dz da\nonumber \\&= \ln \iint q(z \mid s, h^s) \pi (a \mid s, h^a) p(s^\prime \mid z^\prime )\nonumber \\&\times \frac{p(z \mid h^s)}{q(z \mid s, h^s)} \frac{\pi _{\mathrm{FF}} (a \mid h^a)}{\pi (a \mid s, h^a)} dz da\nonumber \\&\ge {\mathbb {E}}_{q(z \mid s, h^s) \pi (a \mid s, h^a)} [\ln p(s^\prime \mid z^\prime ) ]\nonumber \\&- \mathrm{KL}(q(z \mid s, h^s) \Vert p(z \mid h^s)) - \mathrm{KL}(\pi (a \mid s, h^a) \Vert \pi _{\mathrm{FF}}(a \mid h^a))\nonumber \\&= - {\mathcal {L}}_{\mathrm{model}} \end{aligned}$$

Since we know the composed policy \(\pi\) is mixture of the FB/FF policies defined in Eq. (11), the KL term between \(\pi\) and \(\pi _{\mathrm{FF}}\) can be decomposed using variational approximation [35] and Jensen’s inequality.

$$\begin{aligned} \mathrm{KL}(\pi \Vert \pi _{\mathrm{FF}})&\ge w \ln \frac{w e^{-\mathrm{KL}(\pi _{\mathrm{FF}} \Vert \pi _{\mathrm{FF}})} + (1 - w) e^{-\mathrm{KL}(\pi _{\mathrm{FB}} \Vert \pi _{\mathrm{FF}})}}{e^{-\mathrm{KL}(\pi _{\mathrm{FB}} \Vert \pi _{\mathrm{FF}})}}\nonumber \\&\quad + (1 - w) \ln \frac{w e^{-\mathrm{KL}(\pi _{\mathrm{FF}} \Vert \pi _{\mathrm{FB}})} + (1 - w) e^{-\mathrm{KL}(\pi _{\mathrm{FF}} \Vert \pi _{\mathrm{FF}})}}{e^{-\mathrm{KL}(\pi _{\mathrm{FF}} \Vert \pi _{\mathrm{FF}})}}\nonumber \\&= w \ln \{ w e^{\mathrm{KL}(\pi _{\mathrm{FB}} \Vert \pi _{\mathrm{FF}})} + (1 - w) \}\nonumber \\&\quad + (1 - w) \ln \{ w e^{-\mathrm{KL}(\pi _{\mathrm{FF}} \Vert \pi _{\mathrm{FB}})} + (1 - w) \}\nonumber \\&\ge w^2 \mathrm{KL}(\pi _{\mathrm{FB}} \Vert \pi _{\mathrm{FF}}) - (1 - w) w \mathrm{KL}(\pi _{\mathrm{FF}} \Vert \pi _{\mathrm{FB}})\nonumber \\&= w^2 \{H(\pi _{\mathrm{FB}} \Vert \pi _{\mathrm{FF}}) - H(\pi _{\mathrm{FB}})\} - (1 - w) w \mathrm{KL}(\pi _{\mathrm{FF}} \Vert \pi _{\mathrm{FB}})\nonumber \\&\propto w^2 H(\pi _{\mathrm{FB}} \Vert \pi _{\mathrm{FF}}) \end{aligned}$$

where we use the fact that \(\mathrm{KL}(p \Vert q) = H(p \Vert q) - H(p)\) with \(H(\cdot \Vert \cdot )\) cross entropy and \(H(\cdot )\) (differential) entropy. By eliminating the negative KL term and the negative entropy term, which are unnecessary for regularization, only the cross entropy remains.

The general case of VAE omits the expectation operation by sampling only one z (and a in the above case) according to s. In addition, as explained before, the strength of regularization can be controlled by adding \(\beta\) [32]. With this fact, we can modify \({\mathcal {L}}_{\mathrm{model}}\) as follows:

$$\begin{aligned} {\mathcal {L}}_{\mathrm{model}} = - \ln p(s^\prime \mid z^\prime ) + \beta _z \mathrm{KL}(q(z \mid s, h^s) \Vert p(z \mid h^s)) + \beta _a w^2 H(\pi _{\mathrm{FB}} \Vert \pi _{\mathrm{FF}}) \end{aligned}$$

where \(z \sim q(z \mid s, h^s), a \sim \pi (a \mid s, h^a), z^\prime = f(z, a)\), and \(\beta _{z,a}\) denote the strength of regularization for each. Finally, the above \({\mathcal {L}}_{\mathrm{model}}\) can be substituted into Eq. (13) as \(- \ln p_m\).

$$\begin{aligned} {\mathcal {L}}_{\mathrm{traj}} = - {\mathbb {E}}_{p_e b}\left[ \tau \left\{ \exp \left( \frac{\delta }{\tau } \right) - 1 \right\} (- {\mathcal {L}}_{\mathrm{model}}+ \ln \pi ) \right] \end{aligned}$$

As can be seen in Eq. (16), the regularization between the FB/FF policies is naturally added. Its strength is depending on \(w^2\), that is, as the FB policy is prioritized (i.e. w is increased), this regularization is reinforced. In addition, since \({\mathcal {L}}_{\mathrm{model}}\) is now inside of \({\mathcal {L}}_{\mathrm{traj}}\), the regularization becomes strong only when \(\delta > 0\) enough, that is, the agent knows the optimal direction for updating \(\pi\). Usually, at the beginning of RL, the policy generates random actions, which make optimization of the FF policy difficult; in contrast, the FB policy can be optimized under weak regularization (if the observation is sufficiently performed). Afterwards, if w is adaptively given (as introduced in the next section), the FB policy will be strongly connected with the FF policy. In summary, with this formulation, we can expect that the FB policy will be optimized first while regularization is weakened, and that its skill will gradually be transferred to the FF policy as like FEL [21].

Additional design for implementation

Design of mixture ratio based on policy entropy

For the practical implementation, we first design the mixture ratio \(w \in [0, 1]\) heuristically. As its requirements, the composed policy should prioritize the policy with higher confidence from the FB/FF policies. In addition, if the FB/FF policies are similar to each other, either can be selected. Finally, even for arbitrary distribution model of the FB/FF policies, w must be computable. Note that the similar study has proposed a method of weighting the policies according to the value function corresponding to each policy [24], but this method cannot be used in this framework because there is only a single value function.

As one of the solutions for these requirements, we design the following w with the entropies for the FB/FF policies, \(H_{\mathrm{FB}}, H_{\mathrm{FF}}\), and the L2 norm between the means of these policies, \(d = \Vert \mu _{\mathrm{FB}} - \mu _{\mathrm{FF}} \Vert _2\).

$$\begin{aligned} w = \frac{\exp (-H_{\mathrm{FB}} d \beta _T)}{\exp (-H_{\mathrm{FB}} d \beta _T) +\exp (-H_{\mathrm{FF}} d \beta _T)} \end{aligned}$$

where \(\beta _T > 0\) denotes the inverse temperature parameter, i.e. w tends to be deterministic at 0 or 1 with higher \(\beta _T\); and vice versa. Note that as lower entropy has higher confidence, the negative entropies are applied into softmax function.

If one of the entropies is sufficiently smaller than another, w will converge on 1 or 0 for prioritizing the FB/FF policies, respectively. However, if these policies output similar values on average, the robot can select action from either policy, so the inverse temperature is adaptively lowered by d to make w converge to \(w \simeq 0.5\).

Partial cut of computational graph

In general, VAE-based architecture holds the computational graph, which gives paths for backpropagation, of latent variable z by reparameterization trick. If this trick is applied to a in our dynamics model as it is, the policy \(\pi\) will be updated toward one for improving the prediction accuracy, not for maximizing the return, which is the original purpose of policy optimization in RL.

To mitigate the wrong updates of \(\pi\) while preserving the capability to backpropagate the gradients to the whole network as in VAE, we partially cut the computational graph as follows:

$$\begin{aligned} a \leftarrow \eta a + (1 - \eta ) \hat{a} \end{aligned}$$

where \(\eta\) denotes the hyperparameter and \(\hat{\cdot }\) cuts the computational graph and represents merely value.

Auxiliary loss functions

As can be seen in Eq. (17), if \(\delta < 0\), \(- {\mathcal {L}}_{\mathrm{model}}\) will be minimized, reducing the prediction accuracy of dynamics. As for the policy, it is desirable to have a sign reversal of its loss according to \(\delta\) to determine whether the update direction is good or bad. On the other hand, since the dynamics model should ideally have a high prediction accuracy for any state, this update rule may cause the failure of optimization.

In order not to reduce the prediction accuracy, we add an auxiliary loss function. We focus on the fact that the lower bound of the coefficient in Eq. (17), \(\tau (\exp (\delta \tau ^{-1}) - 1)\), is bounded and can be found analytically to be \(-\tau\) when \(\delta \rightarrow - \infty\). That is, by adding \(\tau {\mathcal {L}}_{\mathrm{model}}\) as the auxiliary loss function, the dynamics model should be updated toward one with higher prediction accuracy, while its update amount is still weighted by \(\tau \exp (\delta \tau ^{-1})\).

To update the value function, V, the conventional RL uses Eq. (2). Instead of it, the minimization problem of the KL divergence between \(p(O \mid s, a)\) and \(p(O \mid s)\) is derived in the literature [26] as the following loss function similar to Eq. (17).

$$\begin{aligned} {\mathcal {L}}_{\mathrm{value}} = - {\mathbb {E}}_{p_e b} \left[ \tau \left\{ \exp \left( \frac{\delta }{\tau } \right) - 1 \right\} V \right] \end{aligned}$$

Note that, in this formula (and Eq. (17)), \(\delta\) has no computational graph for backpropagation, i.e. it is merely coefficient.

Finally, the loss function to be minimized for updating \(\pi\) (i.e. \(\pi _{\mathrm{FB}}\) and \(\pi _{\mathrm{FF}}\)), V, and \(p_m\) can be summarized as follows:

$$\begin{aligned} {\mathcal {L}}_{\mathrm{all}} = {\mathcal {L}}_{\mathrm{traj}} + {\mathcal {L}}_{\mathrm{value}} +\tau {\mathcal {L}}_{\mathrm{model}} \end{aligned}$$

where \({\mathcal {L}}_{\mathrm{traj}}\), \({\mathcal {L}}_{\mathrm{value}}\), and \({\mathcal {L}}_{\mathrm{model}}\) are given in Eqs. (17), (20), and (16), respectively. This loss function can be minimized by one of the stochastic gradient descent (SGD) methods like [36].

Results and discussion


We verify the validity of the proposed method derived in this paper. This verification is done through a numerical simulation of a cart-pole inverted pendulum and an experiment of a snake robot forward locomotion, which is driven by central pattern generators (CPGs) [37].

Four specific objectives are listed as below.

  1. 1.

    Through the simulation and the robot experiment, we verify that the proposed method can optimize the composed policy, optimization process of which is also revealed.

  2. 2.

    By comparing the successful and failing cases in the simulation, we clarify an open issue of the proposed method.

  3. 3.

    We compare two behaviors with the decomposed FB/FF policies to make sure there is little difference between them.

  4. 4.

    By intentionally causing sensing failures in the robot experiment, we illustrate the sensitivity/robustness of FB/FF policies to it, respectively.

Note that the purpose of this paper is to analyze the learnability of the proposed method and its characteristics, since it is difficult to make a fair comparison with similar studies that are robust to sensing failures [15, 18,19,20] due to differences in their inputs, network architectures, and so on.

Setup of proposed method

The network architecture for the proposed method is designed using PyTorch [38], as illustrated in Fig. 4. All the modules (i.e. the encoder \(q(z \mid s, h^s)\), decoder \(p(s^\prime \mid z^\prime )\), time-dependent prior \(q(z \mid h^s)\), dynamics f(za), value function V(s), and the FB/FF policies \(\pi _{\mathrm{FB}}(a \mid s)\), \(\pi _{\mathrm{FF}}(a \mid h^a)\)) are represented by three fully connected layers with 100 neurons for each. As nonlinear activation functions for them, we apply layer normalization [39] and Swish function [40]. To represent the histories, \(h^s\) and \(h^a\), as mentioned before, we employ deep echo state networks [30] (three layers with 100 neurons for each). Probability density function outputted from all the stochastic model is given as student-t distribution with reference to [41,42,43].

Fig. 4
figure 4

Network architecture of the proposed method: it contains seven modules for the encoder \(q(z \mid s, h^s)\), decoder \(p(s^\prime \mid z^\prime )\), time-dependent prior \(q(z \mid h^s)\), dynamics f(za), value function V(s), and the FB/FF policies \(\pi _{\mathrm{FB}}(a \mid s)\), \(\pi _{\mathrm{FF}}(a \mid h^a)\) with two RNN features, \(h^s\) and \(h^a\); \(\pi _{\mathrm{FB}}\) and \(\pi _{\mathrm{FF}}\) are composed as \(\pi\), while being regularized between each other

To optimize the above network architecture, a robust SGD, i.e., LaProp [36] with t-momentum [44] and d-AmsGrad [45] (so-called td-AmsProp), is employed with their default parameters except the learning rate. In addition, optimization of V and \(\pi\) can be accelerated by using adaptive eligibility traces [46], and stabilized by using t-soft target network [29].

The parameters for the above implementation, including those unique to the proposed method, are summarized in Table 1. Many of these were empirically adjusted based on values from previous studies. Because of the large number of parameters involved, the influence of these parameters on the behavior of the proposed method is not examined in this paper. However, it should be remarked that a meta-optimization of them can be easily performed with packages such as Optuna [47], although such a meta-optimization requires a great deal of time.

Table 1 Parameter configuration

Simulation for statistical evaluation

For the simulation, we employ Pybullet dynamics engine wrapped by OpenAI Gym [48, 49]. A task (a.k.a. environment), InvertedPendulumBullet-v0, where a cart tries to keep a pole standing on it, is tried to be solved. With different random seeds, 30 trials involving 300 episodes for each are performed.

First of all, we depict the learning curves about the score (a.k.a. the sum of rewards) and the mixture ratio in Fig. 5. Since five trials were obvious failures, for further analysis, we separately depicted Failure (5) for the five failures and Success (25) for the remaining successful trials. We can see in the successful trials that the agent could solve this balancing task stably after 150 episodes, even with stochastic actions. Furthermore, further stabilization and making the composed policy deterministic were accelerated, and in the end, the task was almost certainly accomplished by the proposed method in the successful 25 trials.

Fig. 5
figure 5

Simulation results: 30 trials were divided into 5 failure and 25 successful cases; around 150 episodes, the proposed method mostly succeeded in balancing the pole on the cart, mainly using the FB policy shown in the mixture ratio close to 1; afterwards, the composed policy was made deterministic with further stabilization; in that time, the skill of the FB policy was probably transferred into the FF policy, as can be seen in the decrease of the mixture ratio

Focusing on the mixture ratio, the FB policy was dominant in the early stages of learning, as expected. Then, as the episodes passed, the FF policy was optimized toward the FB policy, and the mixture ratio gradually approached 0.5. Finally, it seems to have converged to around 0.7, suggesting that the proposed method is basically dominated by the FB policy under stable observation.

Although all the trials obtained almost the same curves until 50 episodes in both figures, the failure trials suddenly decreased their scores. In addition, probably due to the failure of optimization of the FF policy, the mixture ratio in the failure trials fixed on almost 1. It is necessary to clarify the cause of this apparent difference from the successful trials, i.e. the open issue of the proposed method.

To this end, we decompose the mixture ratio into the distance between the FB/FF policies, d, and the entropies of the respective policies, \(H_{\mathrm{FB}}\) and \(H_{\mathrm{FF}}\), in Fig. 6. Extreme behavior can be observed around 80th episode in d and \(H_{\mathrm{FF}}\). This suggests that the FF policy (or its base RNNs) was updated extremely wrong direction, and could not be reverted from there. As a consequence, the FB policy was also constantly regularized to the FF policy, i.e. the wrong direction, causing the failures of the balancing task. Indeed, \(H_{\mathrm{FB}}\) was gradually increased toward \(H_{\mathrm{FF}}\). In summary, the proposed method lacks the stabilization of learning of the FF policy (or its base RNNs). It is however expected to be improved by suppressing the amount of policy updates like the latest RL [50], regularization of RNNs [51], and/or promoting initialization of the FF policy.

Fig. 6
figure 6

Decomposition of mixture ratio: 30 trials were divided into 5 failure and 25 successful cases; around 80th episode on the five failure cases, d and \(H_{\mathrm{FF}}\) were suddenly jumped to higher values; this suggests the wrong updates of the FF policy (or its base RNNs); according to this erroneous behavior, \(H_{\mathrm{FB}}\) was pulled into the wrong direction by the FF policy, thereby resulting in the failures of the balancing task

Robot experiment

The following robot experiment is conducted to illustrate the practical value of the proposed method. Since the statistical properties of the proposed method are verified via the above simulation, we demonstrate one successful case here.

Setup of robot and task

A snake robot used in this experiment is shown in Fig. 7. This robot has eight Qbmove actuators developed by QbRobotics, which can control the stiffness in hardware level, i.e. variable stiffness actuator (VSA) [52]. As can be seen in the figure, all the actuators are serially connected and on casters to easily drive by snaking locomotion. On the head of the robot, an AR marker is attached to detect its coordinates using a camera (ZED2 developed by Stereolabs).

Fig. 7
figure 7

Snake robot with eight VSAs serially connected: as its actuator, we use Qbmove developed QbRobotics, which can control its stiffness; this robot is on casters to easily drive forward by snaking locomotion, base of which is generated by CPGs

To generate the primitive snaking locomotion, we employ CPGs [37] as mentioned before. Each CPG follows Cohen’s model with sine function as follows:

$$\begin{aligned} \zeta _i&= \zeta _i + \left\{ u_i^r + \sum _{ij} \alpha (\zeta _j + \zeta _i - u_i^\eta ) \right\} dt \end{aligned}$$
$$\begin{aligned} \theta _i&= u_i^A \sin (\zeta _i) \end{aligned}$$

where \(\zeta _i\) denotes the internal state, and \(\theta _i\) is consistent with the reference angle of i-th actuator. \(\alpha\), \(u_i^r\), \(u_i^\eta\), and \(u_i^A\) denote the internal parameters of this CPG model. For all the CPGs (a.k.a. actuators), we set the same parameters, \(\alpha = 2\), \(u_i^r = 10\), \(u_i^\eta = 1\), and \(u_i^A = \pi / 4\), respectively. dt is the discrete time step and set to be 0.02 sec.

Even with this CPG model, the robot has room for optimization of the stiffness of each actuator, \(k_i\). Therefore, the proposed method is applied to the optimization of \(k_i \in [0, 1]\) (\(i = 1, 2, \ldots , 8\)). Let us introduce the state and action spaces of the robot.

As for the state of the robot s, the robot observes the internal state of each actuator: \(\theta _i\) angle; \(\dot{\theta }_i\) angular velocity; \(\tau _i\) torque; and \(k_i\) stiffness (different from the command value due to control accuracy). To evaluate its locomotion, the coordinates of its head, x and y, are additionally observed (see Fig. 8). In addition, as mentioned before, the action of the robot a is set to be \(k_i\). In summary, 34-dimensional s and 8-dimensional a are summarized as follows:

$$\begin{aligned} s&= [\theta _1, \dot{\theta }_1, \tau _1, k_1; \theta _2, \dot{\theta }_2, \tau _2, k_2; \ldots ; \theta _8, \dot{\theta }_8, \tau _8, k_8; x, y]^\top \end{aligned}$$
$$\begin{aligned} a&= [k_1, k_2, \ldots , k_8]^\top \end{aligned}$$
Fig. 8
figure 8

Experimental field: on the top of this field, a camera to detect the robot head by the AR marker is placed; by controlling the stiffness of each actuator, the robot tries to move forward, i.e. x-direction

For the definition of task, i.e. the design of reward function, we consider forward locomotion. Since the primitive motion is already generated by the CPG model, this task can be accomplished only by restraining the sideward deviation. Therefore, we define the reward function as follows:

$$\begin{aligned} r(s, a) = - |y| \end{aligned}$$

The proposed method learns the composed policy for the above task. At the beginning of each episode, the robot is initialized to the same place with \(\theta _i = 0\) and \(k_i = 0.5\). Afterwards, the robot starts to move forward, and if it goes outside of observable area (including a goal) or spends 2000 time steps, that episode is terminated. We tried 100 episodes in total.

Learning results

We depict the learning curves about the score (a.k.a. the sum of rewards) and the mixture ratio in Fig. 9. Note that the moving average with 5 window size is applied to make it easier to see the learning trends. From the score, we say that the proposed method improved straightness of the snaking locomotion. Indeed, Fig. 10, which illustrates the snapshots of experiment before and after learning, clearly indicates that the robot could succeeded in forward locomotion only after learning.

Fig. 9
figure 9

Experimental results: for visibility of learning trends, moving average with 5 window size is applied; the proposed method successfully improved the straightness of the snaking motion by optimizing the stiffness; we found the skill transfer from the FB policy to the FF policy, as can be seen in the mixture ratio as well as Fig. 5; as a remarkable point, during this transfer (10–30 episodes), the score temporarily decreased probably due to the increased frequency of use of the non-optimal FF policy

Fig. 10
figure 10

Snapshots before and after learning: the yellow horizontal dashed lines represents the target where \(y=0\); before learning, the initial policy failed to make the snaking locomotion forward; in contrast, the proposed method yielded the forward locomotion using the optimized composed policy

As well as the successful trials in Fig. 5, this experiment also increased the mixture ratio at first, and afterwards, the FF policy was optimized, reducing the mixture ratio toward 0.5 (but converged on around 0.7). We found the additional feature that during 10–30 episodes, probably when the transfer of skill from the FB to FF policies was active, the score temporarily decreased. This would be due to the increased frequency of use of the non-optimal FF policy, resulting in erroneous behaviors. After that period, however, the score became stably high, and we expect that the above skill transfer was almost complete and the optimal actions could be sampled even from the FF policy.

Demonstration with learned policies

To see the accomplishment of the skill transfer, after the above learning, we apply the decomposed FB/FF policies individually into the robot. On the top of Fig. 11, we shows the overlapped snapshots (red/blue robots correspond to the FB/FF policies, respectively). With the FF policy, of course, randomness in the initial state were gradually increased and accumulated, namely the two results can never be completely consistent. However, the difference at the goal was only a few centimeters. This result suggests that the skill transfer from the FB to FF policies has been achieved as expected, although there is room for further performance improvement.

Fig. 11
figure 11

Snapshots with/without the sensing failures: the yellow horizontal dashed lines represents the target where \(y=0\); the robot was controlled by the decomposed FB (red) or FF (blue) policy; without the sensing failures, both the policies generated almost the same forward locomotion, which indicates the proper skill transfer; with the sensing failures to detect the AR marker, indicated as the shaded area, the FB policy drifted the robot to the side due to the wrong signal; in contrast, the FF policy could achieve the forward locomotion by ignoring the wrong signal in principle

Finally, we emulate occlusion as a sensing failure for detecting the AR marker on the head. When the robot is in the left side of the video frame, the detection of the AR marker is forcibly failed, and returns wrong (and constant) x and y. In that case, the FB policy would collapse, while the FF policy is never affected by this emulated sensing failure. On the bottom of Fig. 11, we shows the overlapped snapshots, where the left side with the sensing failure is shaded. Until the robot escaped the left side, the locomotion obtained by the FB policy drifted to the bottom of the video frame, and it was apparent that the robot could not recovered by the goal (Additional file 1).

In detail, Fig. 12 illustrates the stiffness during this test. Note that the vertical axis is the unbounded version of \(k_i\), and can be encoded into the original \(k_i\) through sigmoid function. As can be seen in the figure, the sensing failure absolutely affected the outputs by the FB policy, while the FF policy ignored it and outputted periodically. Although this test is a proof-of-concept, it clearly shows the sensitivity/robustness of the FB/FF policies to sensing failures that may occur in real environment. We then conclude that a framework that can learn both the FB/FF policies in a unified manner, such as the proposed method, is useful in practice.

Fig. 12
figure 12

Stiffness of each actuator when the sensing failures were intentionally caused: the vertical axis depicts the unbounded version of \(k_i\), which can be encoded by sigmoid function; during the sensing failures, the FB policy outputted obviously erroneous stiffness; in contrast, the FF policy could hold the periodic outputs; note that the phase and amplitude deviations in the area without the sensing failures can be attributed to incomplete skill transfer and recovery attempts from lateral deviation


In this paper, we derive a new optimization problem of both the FB/FF policies in a unified manner. Its point is to consider minimization/maximization of the KL divergences between the trajectories, one is predicted by the composed policy and the stochastic dynamics model, and others is inferred as the optimal/non-optimal ones based on control as inference. With the composed policy as mixture distribution, the stochastic dynamics model that is approximated by variational method yields the soft regularization, i.e. the cross entropy between the FB/FF policies. In addition, by designing the mixture ratio to prioritize the policy with higher confidence, we can expect that the FB policy is first optimized since its state dependency can easily be found, then its skill is transferred to the FF policy via the regularization. Indeed, the numerical simulation and the robot experiment verified that the proposed method can stably solve the given tasks, that is, it has capability to optimize the composed policy even with the different learning law from the traditional RL. In addition, we demonstrated that using our method, the FF policy can be appropriately optimized to generate the similar behavior to one with the FB policy. As a proof-of-concept, we finally illustrated the robustness of the FF policy to the sensing failures when the AR marker could not be detected.

However, we also found that the FF policy (or its base RNNs) occasionally failed to be optimized due to the cause of extreme updates toward wrong direction. To alleviate this problem, in the near future, we need to make the FF policy conservatively update, for example, using a soft regularization to its prior. Alternatively, we will seek the other formulations for the simultaneous learning of the FB/FF policies, which can avoid this problem. With more stable learning capability, the proposed method will be applied to various robotic tasks with potential for the sensing failures. Especially, since the demonstration in this paper only focused on the sensing failure by occlusion, we need to investigate the robustness to the other types of sensing failures (e.g. the packet loss). As part of this evaluation, we will also consider the system in combination with the conventional techniques such as filtering, and show that they can complement each other.

Availability of data and materials

The data that support the findings of this study are available from the corresponding author, TK, upon reasonable request.


  1. Kobayashi T, Sekiyama K, Hasegawa Y, Aoyama T, Fukuda T (2018) Unified bipedal gait for autonomous transition between walking and running in pursuit of energy minimization. Robot Auton Syst 103:27–41

    Article  Google Scholar 

  2. Itadera S, Kobayashi T, Nakanishi J, Aoyama T, Hasegawa Y (2021) Towards physical interaction-based sequential mobility assistance using latent generative model of movement state. Adv Robot 35(1):64–79

    Article  Google Scholar 

  3. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge

    MATH  Google Scholar 

  4. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436

    Article  Google Scholar 

  5. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  6. Modares H, Ranatunga I, Lewis FL, Popa DO (2015) Optimized assistive human-robot interaction using reinforcement learning. IEEE Trans Cybern 46(3):655–667

    Article  Google Scholar 

  7. Tsurumine Y, Cui Y, Uchibe E, Matsubara T (2019) Deep reinforcement learning with smooth policy update: application to robotic cloth manipulation. Robot Auton Syst 112:72–83

    Article  Google Scholar 

  8. Kalashnikov D, Irpan A, Pastor P, Ibarz J, Herzog A, Jang E, Quillen D, Holly E, Kalakrishnan M, Vanhoucke V, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, pp. 651–673

  9. Sugimoto K, Imahayashi W, Arimoto R (2020) Relaxation of strictly positive real condition for tuning feedforward control. In: IEEE Conference on Decision and Control, pp. 1441–1447. IEEE

  10. Kerr T (1987) Decentralized filtering and redundancy management for multisensor navigation. IEEE Trans Aerospace Elect Syst (1):83–119

    Article  Google Scholar 

  11. Zhang L, Ning Z, Wang Z (2015) Distributed filtering for fuzzy time-delay systems with packet dropouts and redundant channels. IEEE Trans Syst Man Cybern Syst 46(4):559–572

    Article  Google Scholar 

  12. Kalman RE, Bucy RS (1961) New results in linear filtering and prediction theory. J Basic Eng 83(1):95–108

    Article  MathSciNet  Google Scholar 

  13. Mu H-Q, Yuen K-V (2015) Novel outlier-resistant extended Kalman filter for robust online structural identification. J Eng Mech 141(1):04014100

    Article  Google Scholar 

  14. Kloss A, Martius G, Bohg J (2021) How to train your differentiable filter. Auton Robots 45(4):561–578

    Article  Google Scholar 

  15. Musial M, Lemke F (2007) Feed-forward learning: Fast reinforcement learning of controllers. In: International Work-Conference on the Interplay Between Natural and Artificial Computation, pp. 277–286. Springer

  16. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  17. Murata S, Namikawa J, Arie H, Sugano S, Tani J (2013) Learning to reproduce fluctuating time series by inferring their time-dependent stochastic properties: application in robot learning via tutoring. IEEE Trans Auton Mental Dev 5(4):298–310

    Article  Google Scholar 

  18. Lee A, Nagabandi A, Abbeel P, Levine S (2020) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst. 33:741–52

    Google Scholar 

  19. Sharma A, Kitani KM (2018) Phase-parametric policies for reinforcement learning in cyclic environments. In: AAAI Conference on Artificial Intelligence, pp. 6540–6547

  20. Azizzadenesheli K, Lazaric A, Anandkumar A (2016) Reinforcement learning of pomdps using spectral methods. In: Conference on Learning Theory, pp. 193–256

  21. Miyamoto H, Kawato M, Setoyama T, Suzuki R (1988) Feedback-error-learning neural network for trajectory control of a robotic manipulator. Neural Netw 1(3):251–265

    Article  Google Scholar 

  22. Nakanishi J, Schaal S (2004) Feedback error learning and nonlinear adaptive control. Neural Netw 17(10):1453–1465

    Article  Google Scholar 

  23. Sugimoto K, Alali B, Hirata K (2008) Feedback error learning with insufficient excitation. In: IEEE Conference on Decision and Control, pp. 714–719. IEEE

  24. Uchibe E (2018) Cooperative and competitive reinforcement and imitation learning for a mixture of heterogeneous learning modules. Front Neurorobot. 12:61

    Article  Google Scholar 

  25. Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909

  26. Kobayashi T (2022) Optimistic reinforcement learning by forward kullback-leibler divergence optimization. Neural Netw 152:169–180

    Article  Google Scholar 

  27. Chung J, Kastner K, Dinh L, Goel K, Courville AC, Bengio Y (2015) A recurrent latent variable model for sequential data. In: Advances in Neural Information Processing Systems, pp. 2980–2988

  28. Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014. Citeseer

  29. Kobayashi T, Ilboudo WEL (2021) t-soft update of target network for deep reinforcement learning. Neural Netw 136:63–71

    Article  Google Scholar 

  30. Gallicchio C, Micheli A, Pedrelli L (2018) Design of deep echo state networks. Neural Netw 108:33–47

    Article  Google Scholar 

  31. Kobayashi T, Murata S, Inamura T (2021) Latent representation in human-robot interaction with explicit consideration of periodic dynamics. arXiv preprint arXiv:2106.08531

  32. Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2017) beta-vae: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations

  33. Chua K, Calandra R, McAllister R, Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in Neural Information Processing Systems, pp. 4754–4765

  34. Clavera I, Fu Y, Abbeel P (2020) Model-augmented actor-critic: Backpropagating through paths. In: International Conference on Learning Representations

  35. Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 317–320. IEEE

  36. Ziyin L, Wang ZT, Ueda M (2020) Laprop: a better way to combine momentum with adaptive gradient. arXiv preprint arXiv:2002.04839

  37. Cohen AH, Holmes PJ, Rand RH (1982) The nature of the coupling between segmental oscillators of the lamprey spinal generator for locomotion: a mathematical model. J Math Biol 13(3):345–369

    Article  MathSciNet  Google Scholar 

  38. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In: Advances in Neural Information Processing Systems Workshop

  39. Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450

  40. Elfwing S, Uchibe E, Doya K (2018) Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw 107:3–11

    Article  Google Scholar 

  41. Takahashi H, Iwata T, Yamanaka Y, Yamada M, Yagi S (2018) Student-t variational autoencoder for robust density estimation. In: International Joint Conference on Artificial Intelligence, pp. 2696–2702

  42. Kobayashi T (2019) Variational deep embedding with regularized student-t mixture model. In: International Conference on Artificial Neural Networks, pp. 443–455. Springer

  43. Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Appl Intell 49(12):4335–4347

    Article  Google Scholar 

  44. Ilboudo WEL, Kobayashi T, Sugimoto K (2020) Robust stochastic gradient descent with student-t distribution based first-order momentum. IEEE Transactions on Neural Networks and Learning Systems

  45. Kobayashi T (2021) Towards deep robot learning with optimizer applicable to non-stationary problems. In: 2021 IEEE/SICE International Symposium on System Integration (SII), pp. 190–194. IEEE

  46. Kobayashi T (2020) Adaptive and multiple time-scale eligibility traces for online deep reinforcement learning. arXiv preprint arXiv:2008.10040

  47. Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: A next-generation hyperparameter optimization framework. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631

  48. Coumans E, Bai Y (2016) Pybullet, a python module for physics simulation for games. Robot Mach Learn. GitHub repository

    Google Scholar 

  49. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. arXiv preprint arXiv:1606.01540

  50. Kobayashi T (2020) Proximal policy optimization with relative pearson divergence. arXiv preprint arXiv:2010.03290

  51. Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329

  52. Catalano MG, Grioli G, Garabini M, Bonomo F, Mancini M, Tsagarakis N, Bicchi A (2011) Vsa-cubebot: A modular variable stiffness platform for multiple degrees of freedom robots. In: IEEE International Conference on Robotics and Automation, pp. 5090–5095. IEEE

Download references


Not applicable.


This work was supported by Telecommunications Advancement Foundation Research Grant and JSPS KAKENHI, Grant-in-Aid for Scientific Research (B), Grant Number JP20H04265.

Author information

Authors and Affiliations



TK proposed the algorithm and wrote this manuscript. KY developed the hardware and performed the experiments. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Taisuke Kobayashi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1. Experimental video. This video summarized all the experiments using the snake robot for forward snaking locomotion. At first, we confirmed that the constant (maximum, more specifically) stiffness failed the forward locomotion to clarify the necessity of its optimization. At the beginning of learning, the robot could not keep the forward locomotion naturally. By learning with the proposed method, the robot could achieve the forward locomotion by using the composed policy. Even with the decomposed FB (red) or FF (blue) policy, we found almost the same motion. However, when the detection failure was intentionally applied, the FB policy failed to keep the locomotion forward, while the FF policy could do so.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kobayashi, T., Yoshizawa, K. Optimization algorithm for feedback and feedforward policies towards robot control robust to sensing failures. Robomech J 9, 18 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Feedback-feedforward policies
  • Control as inference
  • Variational lower bound of stochastic dynamics
  • Sensing failures