 Research Article
 Open Access
 Published:
Optimization algorithm for feedback and feedforward policies towards robot control robust to sensing failures
ROBOMECH Journal volumeÂ 9, ArticleÂ number:Â 18 (2022)
Abstract
Background and problem statement
Modelfree or learningbased control, in particular, reinforcement learning (RL), is expected to be applied for complex robotic tasks. Traditional RL requires that a policy to be optimized is statedependent, that means, the policy is a kind of feedback (FB) controllers. Due to the necessity of correct state observation in such a FB controller, it is sensitive to sensing failures. To alleviate this drawback of the FB controllers, feedback error learning integrates one of them with a feedforward (FF) controller. RL can be improved by dealing with the FB/FF policies, but to the best of our knowledge, a methodology for learning them in a unified manner has not been developed.
Contribution
In this paper, we propose a new optimization problem for optimizing both the FB/FF policies simultaneously. Inspired by control as inference, the proposed optimization problem considers minimization/maximization of divergences between trajectories, one is predicted by the composed policy and a stochastic dynamics model, and others are inferred as optimal/nonoptimal ones. By approximating the stochastic dynamics model using variational method, we naturally derive a regularization between the FB/FF policies. In numerical simulations and a robot experiment, we verified that the proposed method can stably optimize the composed policy even with the different learning law from the traditional RL. In addition, we demonstrated that the FF policy is robust to the sensing failures and can hold the optimal motion.
Introduction
In the last decade, the tasks (or objects) required of robots have become steadily more complex. For such nextgeneration robot control problems, traditional modelbased control like [1] seems to reach its limit due to the difficulty of modeling complex systems. Modelfree or learningbased control like [2] is expected to resolve these problems in recent year. In particular, reinforcement learning (RL) [3] is one of the most promising approaches to this end, and indeed, RL integrated with deep neural networks [4], socalled deep RL [5], achieved several complex tasks: e.g. humanâ€“robot interaction [6]; manipulation of deformable objects [7]; and manipulation of various general objects from scratch [8].
In principle, RL makes an agent to optimize a policy (a.k.a. controller) to stochastically sample action (a.k.a. control input) depending on state, result of interaction between the agent and environment [3]. Generally speaking, therefore, the policy to be optimized can be regarded as one of the feedback (FB) controllers. Of course, the policy is more conceptual and general than traditional FB controllers such as for regulation and tracking, but it is still a mapping from state to action.
Such a FB policy inherits the drawbacks of the traditional FB controllers, i.e. the sensitivity to sensing failures [9]. For example, if the robot has a camera to detect an object, pose of which is given to be state of RL, the FB policy would sample erroneous action according to a wrong pose by occlusion. Alternatively, if the robot system is connected with a wireless TCP/IP network to sense data from IoT devices, communication loss or delay due to poor signal conditions will occur at irregular intervals, causing erroneous action.
To alleviate this fundamental problem of the FB policy, filtering techniques have often been integrated with the FB controllers. Famous examples (e.g. in aircraft) use redundant sensor and/or communication systems to select the normal signals and ignore the wrong signals in order to be robust to the sensing failures [10, 11]. In addition, Kalman filter, the most popular filtering methodology, relies on a statespace model that can predict the next observation and can replace the sensed values into the predicted ones at the sensing failures [12, 13]. Although the statespace model is not given in RL, recent developments in deep learning technology would make it possible to acquire this in a datadriven manner [14].
In contrast to the above input processing, previous studies have developed the policies that do not depend only on state. In a straightforward way, timedependent policy has been proposed by directly adding the elapsed time to state [15] or by utilizing recurrent neural networks (RNNs) [16, 17] for approximation of that policy [18]. If the policy is computed according to the phase and spectrum information of the system, instantaneous sensing failures would be ignored [19, 20]. In an extreme case, if the robot learns to episodically generate the trajectory, the adaptive behavior to state is completely lost, but it is never affected by the sensing failures. We focus on these approaches as the output processing.
From the perspective of the traditional control theory and biology, it has been suggested that this problem of the FB policy can be resolved by a feedforward (FF) policy with feedback error learning (FEL) [9, 21,22,23], which can also be regarded as the output processing. FEL is a framework in which the FF controller is updated based on the error signal of the FB controller, and finally the control objective is achieved only by the FF controller. In other words, instead of designing only the single policy as in the previous studies above, FEL has both the FB/FF policies in the system and composes their outputs appropriately to complement each otherâ€™s shortcomings: the sensitivity to the sensing failures in the FB policy; and the lack of adaptability to the change of state in the FF policy. The two separated policies are more compact than the integrated one. In addition, although the composition of the outputs in the previous studies is a simple summation, it creates a new room for designing different composition rules, which makes it easier for designers to adjust which of the FB/FF policies is preferred.
The purpose of this study is to take over the benefits of FEL to the RL framework, as shown in Fig.Â 1. To this end, we have to solve two challenges as below.

1.
Since RL is not only for tracking problem, which is the target of FEL, we need to design how to compose the FB/FF policies.

2.
Since the FB policy is not fixed unlike FEL, both of the FB/FF policies are required to be optimized simultaneously.
For the first challenge, we assume that the composed policy is designed as mixture distribution of the FB/FF policies since RL policy is stochastically defined. A similar approach is to weight each policy according to its corresponding value function, as in the literature [24]. However, in the proposed framework, this method cannot be adopted because the FB/FF policies are learned by a common value function. Therefore, we heuristically design its mixture ratio depending on confidences of the respective FB/FF policies so that the higher confident policy is prioritized. As a specific implementation of the confidence, this paper uses the negative entropy of each probability distribution.
For the second challenge, inspired by control as inference [25, 26], we derive a new optimization problem to minimize/maximize the divergences between trajectories, one is predicted by the composed policy and a stochastic dynamics model, and others are inferred as optimal/nonoptimal ones. Furthermore, by designing the stochastic dynamics model with variational approximation [27], we heuristically find that the regularization between the FB/FF policies is given. This regularization expects us that skill of the FB policy, which can be optimized faster than the FF policy, will be transferred into the FF policy.
To verify that the proposed method can optimize the FB/FF policies in a unified manner, we conduct numerical simulations for statistical evaluation and a robot experiment as demonstration. Through the numerical simulations, we show the capability of the proposed method, namely, stable optimization of the composed policy even with the different learning law from the traditional RL. However, the proposed method occasionally fails to learn the optimal policy. We analyze this reason as the extreme updating of the FF policy (or RNNs) to wrong direction. In addition, after training on the robot experiment, we clarify the value of the proposed method that the optimized FF policy robustly samples valuable actions to the sensing failures even when the FB policy fails to achieve the optimal behavior.
Preliminaries
Reinforcement learning
In RL [3], Markov decision process (MDP) is satisfied as shown in the left of Fig.Â 2. Specifically, an agent interacts with unknown environment using action \(a \in {\mathcal {A}}\) sampled from policy \(\pi\). The environment returns the result of the interaction as state \(s \in {\mathcal {S}}\) (or the next state \(s^\prime\)) and evaluates it according to reward function, which represents the degree of accomplishment of the desired task, \(r(s, a) \in {\mathbb {R}}\). Here, s is sampled from the blackbox state transition probability of the environment \(s^\prime \sim p_e(s^\prime \mid s, a)\) (and \(s \leftarrow s^\prime\)). In that time, the policy \(\pi\) can be given as a probability conditional to only s (i.e. a stochastic FB controller), \(\pi (a \mid s)\), theoretically. The optimization problem of RL is to find the optimal policy \(\pi ^*\) that maximizes the sum of rewards in the future from the current time t (or, called return), defined as \(R_t = \sum _{k=0}^\infty \gamma ^k r_{t+k}\) with \(\gamma \in [0, 1)\) discount factor.
However, in practical use, the state from the environment must be observed using internal/external sensors, and measurement of state causes delay (e.g. due to overload in the communication networks) and/or loss (e.g. occlusion in camera sensors), suggested in the right of Fig.Â 2. With these sensing failures, \(\pi (a \mid s)\) is no longer enough to acquire the task represented by the reward function because the measured (and lost/delayed) state cannot hold MDP. To solve this problem, this paper therefore proposes a new method to optimize the FB/FF policies in a unified manner by formulating them without necessarily requiring MDP.
In the conventional RL under MDP, the expected value of R is functionalized as V(s) as (state) value function and Q(s,Â a) as (state)action value function, and V can be learned by the following equation.
Note that Q can also be learned with the similar equation, although we do not use Q directly in this paper.
Based on \(\delta\), an actorcritic algorithm [28] updates \(\pi\) according to the following policy gradient.
where \({\mathbb {E}}_{p_e \pi } [\cdot ]\) is approximated by Monte Carlo method.
Introduction of optimality variable in control as inference
Recently, RL can be regarded as inference problem, socalled control as inference [25]. This extension of interpretation introduces a optimality variable, \(o = \{0, 1\}\), which represents whether a pair of s and a is optimal (\(o = 1\)) or not (\(o = 0\)). Since it is defined as random variable, the probability of \(o = 1\), \(p(o=1 \mid s, a)\), is parameterized by reward r to connect the conventional RL with this interpretation.
where \(c = \max (r)\) to satisfy \(e^{r(s, a)  c} \le 1\), and \(\tau\) denotes the hyperparameter to clarify uncertainty, and can be adaptively tuned.
Furthermore, supposing the optimality in the future as \(O = \{0, 1\}\), the following formulations can be defined with the value functions.
where \(C = \max (V) = \max (Q)\) theoretically, although its specific value is generally unknown.
In this way, the optimality can be treated in probabilistic inference problems, facilitating integration with such as Bayesian inference and other methods. This paper utilizes this property to derive a new optimization problem, as derived later.
Inference of optimal/nonoptimal policies
With the optimality variable O, we can infer the optimal policy and the nonoptimal policy (details are in [26]). With Eqs.Â (5) and (6), the policy conditioned on O, \(\pi ^*(a \mid s, O)\), can be derived through Bayes theorem.
where \(b(a \mid s)\) denotes the sampler distribution (e.g. the composed policy with old parameters or one approximated by target networks [29]).
By substituting \(\{0,1\}\) for O, the inference of the optimal policy, \(\pi ^+\), and the nonoptimal policy, \(\pi ^\), is given as follows:
Although it is difficult to sample action from these policies directly, they can be utilized for analysis later.
Variational recurrent neural network
To reveal state transition probability (i.e. \(p_e\)) as stochastic dynamics model, we derive the method to learn it based on variational recurrent neural network (VRNN) [27]. Therefore, in this section, we briefly introduce the VRNN.
The VRNN considers the maximization problem of loglikelihood of a prediction model of observation (s in the context of RL), \(p_m\). s is assumed to be stochastically decoded from lowerdimensional latent variable z, and z is also sampled according to the history of s, \(h^s\), as timedependent prior \(p(z \mid h^s)\). Here, \(h^s\) is generally approximated by recurrent neural networks, and this paper employs deep echo state networks [30] for this purpose. Using Jensenâ€™s inequality, a variational lower bound is derived as follows:
where \(p(s \mid z)\) and \(q(z \mid s, h^s)\) denote the decoder and encoder, respectively. \(\mathrm{KL}(\cdot \Vert \cdot )\) is the term for KullbackLeibler (KL) divergence between two probabilities. \({\mathcal {L}}_{\mathrm{vrnn}}\) is minimized via the optimization of \(p_m\), which consists of \(p(s \mid z)\), \(q(z \mid s, h^s)\), and \(p(z \mid h^s)\).
Note that, in the original implementation[27], the decoder is also depending on \(h^s\), but that is omitted in the above derivation for simplicity and for aggregating time information to z, as well as the literature [31]. In addition, the strength of regularization by the KL term can be controlled by following \(\beta\)VAE [32] with a hyperparameter \(\beta \ge 0\).
Derivation of proposed method
Overview
The outputs of FB/FF policies should eventually coincide, but it is unclear how they will be updated if we directly optimize the composed policy according to the conventional RL. In other words, if the composed policy is trained using a policygradient method, the gradients for the FB/FF policies would be different from each other, making the FB/FF policies not coincide. In this paper, we propose a unified optimization problem in which the FB/FF policies naturally coincide and the composed one is properly optimized. To this end, we heuristically find that it is required to be able to generate similar trajectories for both FB/FF policies by extending the optimization problem from the optimization of the composed policy alone to the optimization of the trajectory generated by the policy, as shown in Fig.Â 3. This requirement leads to simultaneous learning of the FB/FF policies and matching of their outputs. In other words, the key points in the proposed method are two folds:

1.
The trajectory predicted with the stochastic dynamics model and the composed policy is expected to be close to/away from optimal/nonoptimal trajectories inferred with the optimality variable.

2.
The stochastic dynamics model is trained via its variational lower bound, which naturally generates a soft constraint between the FB/FF policies.
However, please keep in mind that this approach is heuristically obtained, and therefore, a more straightforward method may be existed, although it is not easily found.
Here, as an additional preliminary preparation, we define the FB, FF, and composed policies mathematically: \(\pi _{\mathrm{FB}}(a \mid s)\); \(\pi _{\mathrm{FF}}(a \mid h^a)\); and the following mixture distribution, respectively.
where \(w \in [0, 1]\) denotes the mixture ratio of the FB/FF policies. That is, for generality, the outputs of the FB/FF policies are composed by a stochastic switching mechanism, rather than a simple summation as in FEL [21]. Note that since the history of action, \(h^a\), can be updated without s, the FF policy is naturally robust to sensing failures.
Optimization problem for optimal/nonoptimal trajectories
With the composed policy, \(\pi\), and the stochastic dynamics model, given as \(p_m(s^\prime \mid s, a, h^s, h^a)\), a fragment of trajectory is predicted as \(p_m \pi\). As a reference, we can consider the fragment of optimal/nonoptimal trajectory with \(\pi ^*\) in Eq.Â (7) and the real environment, \(p_e\), as \(p_e \pi ^*\). Note that the original derivation of \(\pi ^*\) has only the state s (and the optimality variable O) as its conditions, but as described above, we need to treat the history of action \(h^a\) explicitly, so we consider \(\pi ^* = \pi (a \mid s, h^a, O)\). The degree of divergence between the two can be evaluated by KL divergence as follows:
where the term \(\ln p_e \pi ^*\) inside the expectation operation is excluded since it is not related to the learnable \(p_m\) and \(\pi\). The expectation operation with \(p_e\) and b can be approximated by Monte Carlo method, namely, we can optimize \(p_m\) and \(\pi\) using the above KL divergence with the appropriate conditions of O.
As the conditions, our optimization problem considers that \(p_m \pi\) is expected to be close to \(p_e \pi ^+\) (i.e. the optimal trajectory) and be away from \(p_e \pi ^\) (i.e. the nonoptimal trajectory), as shown in Fig.Â 3. Therefore, the specific loss function to be minimized is given as follows:
where \(1  \exp \{(V  C)\tau ^{1}\}\) and \(\tau\) are multiplied to eliminate unknown C and to scale the gradient at \(\delta = 0\) to be one, respectively. Note that the derived result is similar to Eq.Â (3), but with a different coefficient from \(\delta\) and a different sampler from \(\pi\).
Stochastic dynamics model with variational lower bound
In Eq.Â (13), \(\ln p_m\), i.e. the stochastic dynamics model, is included and it should be modeled. Indeed, inspired by the literature [31], we found that the model based on the VRNN [27] shown in Eq.Â (10) can naturally yield an additional regularization between the FB/FF policies. In addition, such a method is regarded as one for extracting latent Markovian dynamics in problems for which MDP is not established in the observed state, and is similar to the latest modelbased RL [33, 34].
Specifically, we consider the dynamics of latent variable z as \(z^\prime = f(z, a)\) with f learnable function, and a can be sampled from timedependent prior (i.e. the FF policy). In that time, Eq.Â (10) is modified through the following derivation.
Since we know the composed policy \(\pi\) is mixture of the FB/FF policies defined in Eq.Â (11), the KL term between \(\pi\) and \(\pi _{\mathrm{FF}}\) can be decomposed using variational approximation [35] and Jensenâ€™s inequality.
where we use the fact that \(\mathrm{KL}(p \Vert q) = H(p \Vert q)  H(p)\) with \(H(\cdot \Vert \cdot )\) cross entropy and \(H(\cdot )\) (differential) entropy. By eliminating the negative KL term and the negative entropy term, which are unnecessary for regularization, only the cross entropy remains.
The general case of VAE omits the expectation operation by sampling only one z (and a in the above case) according to s. In addition, as explained before, the strength of regularization can be controlled by adding \(\beta\) [32]. With this fact, we can modify \({\mathcal {L}}_{\mathrm{model}}\) as follows:
where \(z \sim q(z \mid s, h^s), a \sim \pi (a \mid s, h^a), z^\prime = f(z, a)\), and \(\beta _{z,a}\) denote the strength of regularization for each. Finally, the above \({\mathcal {L}}_{\mathrm{model}}\) can be substituted into Eq.Â (13) as \( \ln p_m\).
As can be seen in Eq.Â (16), the regularization between the FB/FF policies is naturally added. Its strength is depending on \(w^2\), that is, as the FB policy is prioritized (i.e. w is increased), this regularization is reinforced. In addition, since \({\mathcal {L}}_{\mathrm{model}}\) is now inside of \({\mathcal {L}}_{\mathrm{traj}}\), the regularization becomes strong only when \(\delta > 0\) enough, that is, the agent knows the optimal direction for updating \(\pi\). Usually, at the beginning of RL, the policy generates random actions, which make optimization of the FF policy difficult; in contrast, the FB policy can be optimized under weak regularization (if the observation is sufficiently performed). Afterwards, if w is adaptively given (as introduced in the next section), the FB policy will be strongly connected with the FF policy. In summary, with this formulation, we can expect that the FB policy will be optimized first while regularization is weakened, and that its skill will gradually be transferred to the FF policy as like FEL [21].
Additional design for implementation
Design of mixture ratio based on policy entropy
For the practical implementation, we first design the mixture ratio \(w \in [0, 1]\) heuristically. As its requirements, the composed policy should prioritize the policy with higher confidence from the FB/FF policies. In addition, if the FB/FF policies are similar to each other, either can be selected. Finally, even for arbitrary distribution model of the FB/FF policies, w must be computable. Note that the similar study has proposed a method of weighting the policies according to the value function corresponding to each policy [24], but this method cannot be used in this framework because there is only a single value function.
As one of the solutions for these requirements, we design the following w with the entropies for the FB/FF policies, \(H_{\mathrm{FB}}, H_{\mathrm{FF}}\), and the L2 norm between the means of these policies, \(d = \Vert \mu _{\mathrm{FB}}  \mu _{\mathrm{FF}} \Vert _2\).
where \(\beta _T > 0\) denotes the inverse temperature parameter, i.e. w tends to be deterministic at 0 or 1 with higher \(\beta _T\); and vice versa. Note that as lower entropy has higher confidence, the negative entropies are applied into softmax function.
If one of the entropies is sufficiently smaller than another, w will converge on 1 or 0 for prioritizing the FB/FF policies, respectively. However, if these policies output similar values on average, the robot can select action from either policy, so the inverse temperature is adaptively lowered by d to make w converge to \(w \simeq 0.5\).
Partial cut of computational graph
In general, VAEbased architecture holds the computational graph, which gives paths for backpropagation, of latent variable z by reparameterization trick. If this trick is applied to a in our dynamics model as it is, the policy \(\pi\) will be updated toward one for improving the prediction accuracy, not for maximizing the return, which is the original purpose of policy optimization in RL.
To mitigate the wrong updates of \(\pi\) while preserving the capability to backpropagate the gradients to the whole network as in VAE, we partially cut the computational graph as follows:
where \(\eta\) denotes the hyperparameter and \(\hat{\cdot }\) cuts the computational graph and represents merely value.
Auxiliary loss functions
As can be seen in Eq.Â (17), if \(\delta < 0\), \( {\mathcal {L}}_{\mathrm{model}}\) will be minimized, reducing the prediction accuracy of dynamics. As for the policy, it is desirable to have a sign reversal of its loss according to \(\delta\) to determine whether the update direction is good or bad. On the other hand, since the dynamics model should ideally have a high prediction accuracy for any state, this update rule may cause the failure of optimization.
In order not to reduce the prediction accuracy, we add an auxiliary loss function. We focus on the fact that the lower bound of the coefficient in Eq.Â (17), \(\tau (\exp (\delta \tau ^{1})  1)\), is bounded and can be found analytically to be \(\tau\) when \(\delta \rightarrow  \infty\). That is, by adding \(\tau {\mathcal {L}}_{\mathrm{model}}\) as the auxiliary loss function, the dynamics model should be updated toward one with higher prediction accuracy, while its update amount is still weighted by \(\tau \exp (\delta \tau ^{1})\).
To update the value function, V, the conventional RL uses Eq.Â (2). Instead of it, the minimization problem of the KL divergence between \(p(O \mid s, a)\) and \(p(O \mid s)\) is derived in the literature [26] as the following loss function similar to Eq.Â (17).
Note that, in this formula (and Eq.Â (17)), \(\delta\) has no computational graph for backpropagation, i.e. it is merely coefficient.
Finally, the loss function to be minimized for updating \(\pi\) (i.e. \(\pi _{\mathrm{FB}}\) and \(\pi _{\mathrm{FF}}\)), V, and \(p_m\) can be summarized as follows:
where \({\mathcal {L}}_{\mathrm{traj}}\), \({\mathcal {L}}_{\mathrm{value}}\), and \({\mathcal {L}}_{\mathrm{model}}\) are given in Eqs.Â (17), (20), and (16), respectively. This loss function can be minimized by one of the stochastic gradient descent (SGD) methods like [36].
Results and discussion
Objective
We verify the validity of the proposed method derived in this paper. This verification is done through a numerical simulation of a cartpole inverted pendulum and an experiment of a snake robot forward locomotion, which is driven by central pattern generators (CPGs) [37].
Four specific objectives are listed as below.

1.
Through the simulation and the robot experiment, we verify that the proposed method can optimize the composed policy, optimization process of which is also revealed.

2.
By comparing the successful and failing cases in the simulation, we clarify an open issue of the proposed method.

3.
We compare two behaviors with the decomposed FB/FF policies to make sure there is little difference between them.

4.
By intentionally causing sensing failures in the robot experiment, we illustrate the sensitivity/robustness of FB/FF policies to it, respectively.
Note that the purpose of this paper is to analyze the learnability of the proposed method and its characteristics, since it is difficult to make a fair comparison with similar studies that are robust to sensing failures [15, 18,19,20] due to differences in their inputs, network architectures, and so on.
Setup of proposed method
The network architecture for the proposed method is designed using PyTorch [38], as illustrated in Fig.Â 4. All the modules (i.e. the encoder \(q(z \mid s, h^s)\), decoder \(p(s^\prime \mid z^\prime )\), timedependent prior \(q(z \mid h^s)\), dynamics f(z,Â a), value function V(s), and the FB/FF policies \(\pi _{\mathrm{FB}}(a \mid s)\), \(\pi _{\mathrm{FF}}(a \mid h^a)\)) are represented by three fully connected layers with 100 neurons for each. As nonlinear activation functions for them, we apply layer normalization [39] and Swish function [40]. To represent the histories, \(h^s\) and \(h^a\), as mentioned before, we employ deep echo state networks [30] (three layers with 100 neurons for each). Probability density function outputted from all the stochastic model is given as studentt distribution with reference to [41,42,43].
To optimize the above network architecture, a robust SGD, i.e., LaProp [36] with tmomentum [44] and dAmsGrad [45] (socalled tdAmsProp), is employed with their default parameters except the learning rate. In addition, optimization of V and \(\pi\) can be accelerated by using adaptive eligibility traces [46], and stabilized by using tsoft target network [29].
The parameters for the above implementation, including those unique to the proposed method, are summarized in TableÂ 1. Many of these were empirically adjusted based on values from previous studies. Because of the large number of parameters involved, the influence of these parameters on the behavior of the proposed method is not examined in this paper. However, it should be remarked that a metaoptimization of them can be easily performed with packages such as Optuna [47], although such a metaoptimization requires a great deal of time.
Simulation for statistical evaluation
For the simulation, we employ Pybullet dynamics engine wrapped by OpenAI Gym [48, 49]. A task (a.k.a. environment), InvertedPendulumBulletv0, where a cart tries to keep a pole standing on it, is tried to be solved. With different random seeds, 30 trials involving 300 episodes for each are performed.
First of all, we depict the learning curves about the score (a.k.a. the sum of rewards) and the mixture ratio in Fig.Â 5. Since five trials were obvious failures, for further analysis, we separately depicted Failure (5) for the five failures and Success (25) for the remaining successful trials. We can see in the successful trials that the agent could solve this balancing task stably after 150 episodes, even with stochastic actions. Furthermore, further stabilization and making the composed policy deterministic were accelerated, and in the end, the task was almost certainly accomplished by the proposed method in the successful 25 trials.
Focusing on the mixture ratio, the FB policy was dominant in the early stages of learning, as expected. Then, as the episodes passed, the FF policy was optimized toward the FB policy, and the mixture ratio gradually approached 0.5. Finally, it seems to have converged to around 0.7, suggesting that the proposed method is basically dominated by the FB policy under stable observation.
Although all the trials obtained almost the same curves until 50 episodes in both figures, the failure trials suddenly decreased their scores. In addition, probably due to the failure of optimization of the FF policy, the mixture ratio in the failure trials fixed on almost 1. It is necessary to clarify the cause of this apparent difference from the successful trials, i.e. the open issue of the proposed method.
To this end, we decompose the mixture ratio into the distance between the FB/FF policies, d, and the entropies of the respective policies, \(H_{\mathrm{FB}}\) and \(H_{\mathrm{FF}}\), in Fig.Â 6. Extreme behavior can be observed around 80th episode in d and \(H_{\mathrm{FF}}\). This suggests that the FF policy (or its base RNNs) was updated extremely wrong direction, and could not be reverted from there. As a consequence, the FB policy was also constantly regularized to the FF policy, i.e. the wrong direction, causing the failures of the balancing task. Indeed, \(H_{\mathrm{FB}}\) was gradually increased toward \(H_{\mathrm{FF}}\). In summary, the proposed method lacks the stabilization of learning of the FF policy (or its base RNNs). It is however expected to be improved by suppressing the amount of policy updates like the latest RL [50], regularization of RNNs [51], and/or promoting initialization of the FF policy.
Robot experiment
The following robot experiment is conducted to illustrate the practical value of the proposed method. Since the statistical properties of the proposed method are verified via the above simulation, we demonstrate one successful case here.
Setup of robot and task
A snake robot used in this experiment is shown in Fig.Â 7. This robot has eight Qbmove actuators developed by QbRobotics, which can control the stiffness in hardware level, i.e. variable stiffness actuator (VSA) [52]. As can be seen in the figure, all the actuators are serially connected and on casters to easily drive by snaking locomotion. On the head of the robot, an AR marker is attached to detect its coordinates using a camera (ZED2 developed by Stereolabs).
To generate the primitive snaking locomotion, we employ CPGs [37] as mentioned before. Each CPG follows Cohenâ€™s model with sine function as follows:
where \(\zeta _i\) denotes the internal state, and \(\theta _i\) is consistent with the reference angle of ith actuator. \(\alpha\), \(u_i^r\), \(u_i^\eta\), and \(u_i^A\) denote the internal parameters of this CPG model. For all the CPGs (a.k.a. actuators), we set the same parameters, \(\alpha = 2\), \(u_i^r = 10\), \(u_i^\eta = 1\), and \(u_i^A = \pi / 4\), respectively. dt is the discrete time step and set to be 0.02 sec.
Even with this CPG model, the robot has room for optimization of the stiffness of each actuator, \(k_i\). Therefore, the proposed method is applied to the optimization of \(k_i \in [0, 1]\) (\(i = 1, 2, \ldots , 8\)). Let us introduce the state and action spaces of the robot.
As for the state of the robot s, the robot observes the internal state of each actuator: \(\theta _i\) angle; \(\dot{\theta }_i\) angular velocity; \(\tau _i\) torque; and \(k_i\) stiffness (different from the command value due to control accuracy). To evaluate its locomotion, the coordinates of its head, x and y, are additionally observed (see Fig.Â 8). In addition, as mentioned before, the action of the robot a is set to be \(k_i\). In summary, 34dimensional s and 8dimensional a are summarized as follows:
For the definition of task, i.e. the design of reward function, we consider forward locomotion. Since the primitive motion is already generated by the CPG model, this task can be accomplished only by restraining the sideward deviation. Therefore, we define the reward function as follows:
The proposed method learns the composed policy for the above task. At the beginning of each episode, the robot is initialized to the same place with \(\theta _i = 0\) and \(k_i = 0.5\). Afterwards, the robot starts to move forward, and if it goes outside of observable area (including a goal) or spends 2000 time steps, that episode is terminated. We tried 100 episodes in total.
Learning results
We depict the learning curves about the score (a.k.a. the sum of rewards) and the mixture ratio in Fig.Â 9. Note that the moving average with 5 window size is applied to make it easier to see the learning trends. From the score, we say that the proposed method improved straightness of the snaking locomotion. Indeed, Fig.Â 10, which illustrates the snapshots of experiment before and after learning, clearly indicates that the robot could succeeded in forward locomotion only after learning.
As well as the successful trials in Fig.Â 5, this experiment also increased the mixture ratio at first, and afterwards, the FF policy was optimized, reducing the mixture ratio toward 0.5 (but converged on around 0.7). We found the additional feature that during 10â€“30 episodes, probably when the transfer of skill from the FB to FF policies was active, the score temporarily decreased. This would be due to the increased frequency of use of the nonoptimal FF policy, resulting in erroneous behaviors. After that period, however, the score became stably high, and we expect that the above skill transfer was almost complete and the optimal actions could be sampled even from the FF policy.
Demonstration with learned policies
To see the accomplishment of the skill transfer, after the above learning, we apply the decomposed FB/FF policies individually into the robot. On the top of Fig.Â 11, we shows the overlapped snapshots (red/blue robots correspond to the FB/FF policies, respectively). With the FF policy, of course, randomness in the initial state were gradually increased and accumulated, namely the two results can never be completely consistent. However, the difference at the goal was only a few centimeters. This result suggests that the skill transfer from the FB to FF policies has been achieved as expected, although there is room for further performance improvement.
Finally, we emulate occlusion as a sensing failure for detecting the AR marker on the head. When the robot is in the left side of the video frame, the detection of the AR marker is forcibly failed, and returns wrong (and constant) x and y. In that case, the FB policy would collapse, while the FF policy is never affected by this emulated sensing failure. On the bottom of Fig.Â 11, we shows the overlapped snapshots, where the left side with the sensing failure is shaded. Until the robot escaped the left side, the locomotion obtained by the FB policy drifted to the bottom of the video frame, and it was apparent that the robot could not recovered by the goal (Additional file 1).
In detail, Fig.Â 12 illustrates the stiffness during this test. Note that the vertical axis is the unbounded version of \(k_i\), and can be encoded into the original \(k_i\) through sigmoid function. As can be seen in the figure, the sensing failure absolutely affected the outputs by the FB policy, while the FF policy ignored it and outputted periodically. Although this test is a proofofconcept, it clearly shows the sensitivity/robustness of the FB/FF policies to sensing failures that may occur in real environment. We then conclude that a framework that can learn both the FB/FF policies in a unified manner, such as the proposed method, is useful in practice.
Conclusion
In this paper, we derive a new optimization problem of both the FB/FF policies in a unified manner. Its point is to consider minimization/maximization of the KL divergences between the trajectories, one is predicted by the composed policy and the stochastic dynamics model, and others is inferred as the optimal/nonoptimal ones based on control as inference. With the composed policy as mixture distribution, the stochastic dynamics model that is approximated by variational method yields the soft regularization, i.e. the cross entropy between the FB/FF policies. In addition, by designing the mixture ratio to prioritize the policy with higher confidence, we can expect that the FB policy is first optimized since its state dependency can easily be found, then its skill is transferred to the FF policy via the regularization. Indeed, the numerical simulation and the robot experiment verified that the proposed method can stably solve the given tasks, that is, it has capability to optimize the composed policy even with the different learning law from the traditional RL. In addition, we demonstrated that using our method, the FF policy can be appropriately optimized to generate the similar behavior to one with the FB policy. As a proofofconcept, we finally illustrated the robustness of the FF policy to the sensing failures when the AR marker could not be detected.
However, we also found that the FF policy (or its base RNNs) occasionally failed to be optimized due to the cause of extreme updates toward wrong direction. To alleviate this problem, in the near future, we need to make the FF policy conservatively update, for example, using a soft regularization to its prior. Alternatively, we will seek the other formulations for the simultaneous learning of the FB/FF policies, which can avoid this problem. With more stable learning capability, the proposed method will be applied to various robotic tasks with potential for the sensing failures. Especially, since the demonstration in this paper only focused on the sensing failure by occlusion, we need to investigate the robustness to the other types of sensing failures (e.g. the packet loss). As part of this evaluation, we will also consider the system in combination with the conventional techniques such as filtering, and show that they can complement each other.
Availability of data and materials
The data that support the findings of this study are available from the corresponding author, TK, upon reasonable request.
References
Kobayashi T, Sekiyama K, Hasegawa Y, Aoyama T, Fukuda T (2018) Unified bipedal gait for autonomous transition between walking and running in pursuit of energy minimization. Robot Auton Syst 103:27â€“41
Itadera S, Kobayashi T, Nakanishi J, Aoyama T, Hasegawa Y (2021) Towards physical interactionbased sequential mobility assistance using latent generative model of movement state. Adv Robot 35(1):64â€“79
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Humanlevel control through deep reinforcement learning. Nature 518(7540):529â€“533
Modares H, Ranatunga I, Lewis FL, Popa DO (2015) Optimized assistive humanrobot interaction using reinforcement learning. IEEE Trans Cybern 46(3):655â€“667
Tsurumine Y, Cui Y, Uchibe E, Matsubara T (2019) Deep reinforcement learning with smooth policy update: application to robotic cloth manipulation. Robot Auton Syst 112:72â€“83
Kalashnikov D, Irpan A, Pastor P, Ibarz J, Herzog A, Jang E, Quillen D, Holly E, Kalakrishnan M, Vanhoucke V, et al (2018) Scalable deep reinforcement learning for visionbased robotic manipulation. In: Conference on Robot Learning, pp. 651â€“673
Sugimoto K, Imahayashi W, Arimoto R (2020) Relaxation of strictly positive real condition for tuning feedforward control. In: IEEE Conference on Decision and Control, pp. 1441â€“1447. IEEE
Kerr T (1987) Decentralized filtering and redundancy management for multisensor navigation. IEEE Trans Aerospace Elect Syst (1):83â€“119
Zhang L, Ning Z, Wang Z (2015) Distributed filtering for fuzzy timedelay systems with packet dropouts and redundant channels. IEEE Trans Syst Man Cybern Syst 46(4):559â€“572
Kalman RE, Bucy RS (1961) New results in linear filtering and prediction theory. J Basic Eng 83(1):95â€“108
Mu HQ, Yuen KV (2015) Novel outlierresistant extended Kalman filter for robust online structural identification. J Eng Mech 141(1):04014100
Kloss A, Martius G, Bohg J (2021) How to train your differentiable filter. Auton Robots 45(4):561â€“578
Musial M, Lemke F (2007) Feedforward learning: Fast reinforcement learning of controllers. In: International WorkConference on the Interplay Between Natural and Artificial Computation, pp. 277â€“286. Springer
Hochreiter S, Schmidhuber J (1997) Long shortterm memory. Neural Comput 9(8):1735â€“1780
Murata S, Namikawa J, Arie H, Sugano S, Tani J (2013) Learning to reproduce fluctuating time series by inferring their timedependent stochastic properties: application in robot learning via tutoring. IEEE Trans Auton Mental Dev 5(4):298â€“310
Lee A, Nagabandi A, Abbeel P, Levine S (2020) Stochastic latent actorcritic: deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst. 33:741â€“52
Sharma A, Kitani KM (2018) Phaseparametric policies for reinforcement learning in cyclic environments. In: AAAI Conference on Artificial Intelligence, pp. 6540â€“6547
Azizzadenesheli K, Lazaric A, Anandkumar A (2016) Reinforcement learning of pomdps using spectral methods. In: Conference on Learning Theory, pp. 193â€“256
Miyamoto H, Kawato M, Setoyama T, Suzuki R (1988) Feedbackerrorlearning neural network for trajectory control of a robotic manipulator. Neural Netw 1(3):251â€“265
Nakanishi J, Schaal S (2004) Feedback error learning and nonlinear adaptive control. Neural Netw 17(10):1453â€“1465
Sugimoto K, Alali B, Hirata K (2008) Feedback error learning with insufficient excitation. In: IEEE Conference on Decision and Control, pp. 714â€“719. IEEE
Uchibe E (2018) Cooperative and competitive reinforcement and imitation learning for a mixture of heterogeneous learning modules. Front Neurorobot. 12:61
Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909
Kobayashi T (2022) Optimistic reinforcement learning by forward kullbackleibler divergence optimization. Neural Netw 152:169â€“180
Chung J, Kastner K, Dinh L, Goel K, Courville AC, Bengio Y (2015) A recurrent latent variable model for sequential data. In: Advances in Neural Information Processing Systems, pp. 2980â€“2988
Konda VR, Tsitsiklis JN (2000) Actorcritic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008â€“1014. Citeseer
Kobayashi T, Ilboudo WEL (2021) tsoft update of target network for deep reinforcement learning. Neural Netw 136:63â€“71
Gallicchio C, Micheli A, Pedrelli L (2018) Design of deep echo state networks. Neural Netw 108:33â€“47
Kobayashi T, Murata S, Inamura T (2021) Latent representation in humanrobot interaction with explicit consideration of periodic dynamics. arXiv preprint arXiv:2106.08531
Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2017) betavae: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations
Chua K, Calandra R, McAllister R, Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in Neural Information Processing Systems, pp. 4754â€“4765
Clavera I, Fu Y, Abbeel P (2020) Modelaugmented actorcritic: Backpropagating through paths. In: International Conference on Learning Representations
Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 317â€“320. IEEE
Ziyin L, Wang ZT, Ueda M (2020) Laprop: a better way to combine momentum with adaptive gradient. arXiv preprint arXiv:2002.04839
Cohen AH, Holmes PJ, Rand RH (1982) The nature of the coupling between segmental oscillators of the lamprey spinal generator for locomotion: a mathematical model. J Math Biol 13(3):345â€“369
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In: Advances in Neural Information Processing Systems Workshop
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Elfwing S, Uchibe E, Doya K (2018) Sigmoidweighted linear units for neural network function approximation in reinforcement learning. Neural Netw 107:3â€“11
Takahashi H, Iwata T, Yamanaka Y, Yamada M, Yagi S (2018) Studentt variational autoencoder for robust density estimation. In: International Joint Conference on Artificial Intelligence, pp. 2696â€“2702
Kobayashi T (2019) Variational deep embedding with regularized studentt mixture model. In: International Conference on Artificial Neural Networks, pp. 443â€“455. Springer
Kobayashi T (2019) Studentt policy in reinforcement learning to acquire global optimum of robot control. Appl Intell 49(12):4335â€“4347
Ilboudo WEL, Kobayashi T, Sugimoto K (2020) Robust stochastic gradient descent with studentt distribution based firstorder momentum. IEEE Transactions on Neural Networks and Learning Systems
Kobayashi T (2021) Towards deep robot learning with optimizer applicable to nonstationary problems. In: 2021 IEEE/SICE International Symposium on System Integration (SII), pp. 190â€“194. IEEE
Kobayashi T (2020) Adaptive and multiple timescale eligibility traces for online deep reinforcement learning. arXiv preprint arXiv:2008.10040
Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: A nextgeneration hyperparameter optimization framework. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623â€“2631
Coumans E, Bai Y (2016) Pybullet, a python module for physics simulation for games. Robot Mach Learn. GitHub repository
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. arXiv preprint arXiv:1606.01540
Kobayashi T (2020) Proximal policy optimization with relative pearson divergence. arXiv preprint arXiv:2010.03290
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329
Catalano MG, Grioli G, Garabini M, Bonomo F, Mancini M, Tsagarakis N, Bicchi A (2011) Vsacubebot: A modular variable stiffness platform for multiple degrees of freedom robots. In: IEEE International Conference on Robotics and Automation, pp. 5090â€“5095. IEEE
Acknowledgements
Not applicable.
Funding
This work was supported by Telecommunications Advancement Foundation Research Grant and JSPS KAKENHI, GrantinAid for Scientific Research (B), Grant Number JP20H04265.
Author information
Authors and Affiliations
Contributions
TK proposed the algorithm and wrote this manuscript. KY developed the hardware and performed the experiments. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1. Experimental video. This video summarized all the experiments using the snake robot for forward snaking locomotion. At first, we confirmed that the constant (maximum, more specifically) stiffness failed the forward locomotion to clarify the necessity of its optimization. At the beginning of learning, the robot could not keep the forward locomotion naturally. By learning with the proposed method, the robot could achieve the forward locomotion by using the composed policy. Even with the decomposed FB (red) or FF (blue) policy, we found almost the same motion. However, when the detection failure was intentionally applied, the FB policy failed to keep the locomotion forward, while the FF policy could do so.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kobayashi, T., Yoshizawa, K. Optimization algorithm for feedback and feedforward policies towards robot control robust to sensing failures. Robomech J 9, 18 (2022). https://doi.org/10.1186/s4064802200232w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4064802200232w
Keywords
 Feedbackfeedforward policies
 Control as inference
 Variational lower bound of stochastic dynamics
 Sensing failures