Optimization algorithm for feedback and feedforward policies towards robot control robust to sensing failures

Model-free or learning-based control, in particular, reinforcement learning (RL), is expected to be applied for complex robotic tasks. Traditional RL requires that a policy to be optimized is state-dependent, that means, the policy is a kind of feedback (FB) controllers. Due to the necessity of correct state observation in such a FB controller, it is sensitive to sensing failures. To alleviate this drawback of the FB controllers, feedback error learning integrates one of them with a feedforward (FF) controller. RL can be improved by dealing with the FB/FF policies, but to the best of our knowledge, a methodology for learning them in a unified manner has not been developed. In this paper, we propose a new optimization problem for optimizing both the FB/FF policies simultaneously. Inspired by control as inference, the proposed optimization problem considers minimization/maximization of divergences between trajectories, one is predicted by the composed policy and a stochastic dynamics model, and others are inferred as optimal/non-optimal ones. By approximating the stochastic dynamics model using variational method, we naturally derive a regularization between the FB/FF policies. In numerical simulations and a robot experiment, we verified that the proposed method can stably optimize the composed policy even with the different learning law from the traditional RL. In addition, we demonstrated that the FF policy is robust to the sensing failures and can hold the optimal motion.


Introduction
In the last decade, the tasks (or objects) required of robots have become steadily more complex.For such next-generation robot control problems, traditional modelbased control like [1] seems to reach its limit due to the difficulty of modeling complex systems.Model-free or learning-based control like [2] is expected to resolve these problems in recent year.In particular, reinforcement learning (RL) [3] is one of the most promising approaches to this end, and indeed, RL integrated with deep neural networks [4], so-called deep RL [5], achieved several complex tasks: e.g.human-robot interaction [6]; manipulation of deformable objects [7]; and manipulation of various general objects from scratch [8].
In principle, RL makes an agent to optimize a policy (a.k.a.controller) to stochastically sample action (a.k.a.control input) depending on state, result of interaction arXiv:2104.00385v1[cs.LG] 1 Apr 2021 between the agent and environment [3].Generally speaking, therefore, the policy to be optimized can be regarded as one of the feedback (FB) controllers.Of course, the policy is more conceptual and general than traditional FB controllers such as for regulation and tracking, but it is still a mapping from state to action.
Such a FB policy inherits the drawbacks of the traditional FB controllers, i.e. the sensitivity to sensing failures [9].For example, if the robot has a camera to detect an object, pose of which is given to be state of RL, the FB policy would sample erroneous action according to a wrong pose by occlusion.Alternatively, if the robot system is connected with a wireless TCP/IP network to sense data from IoT devices, communication loss or delay due to poor signal conditions will occur at irregular intervals, causing erroneous action.
To alleviate this fundamental problem of the FB policy, previous studies have developed the policies that do not depend only on state.In a straightforward way, timedependent policy has been proposed by directly adding the elapsed time to state [10] or by utilizing recurrent neural networks (RNNs) [11,12] for approximation of that policy [13].If the policy is computed according to the phase and spectrum information of the system, instantaneous sensing failures can be ignored [14,15].In an extreme case, if the robot learns to episodically generate the trajectory, the adaptive behavior to state is completely lost, but it is never affected by the sensing failures.
From the perspective of the traditional control theory and biology, it has been suggested that this problem of the FB policy can be resolved by a feedforward (FF) policy with feedback error learning (FEL) [16,17,18,9].FEL is a framework in which the FF controller is updated based on the error signal of the FB controller, and finally the control objective is achieved only by the FF controller.In other words, instead of designing only the single policy as in the previous studies above, FEL has both the FB/FF policies in the system and composes their outputs appropriately to complement each other's shortcomings: the sensitivity to the sensing failures in the FB policy; and the lack of adaptability to the change of state in the FF policy.The two separate policies are more compact than the integrated one.In addition, although the composition of the outputs in the previous studies is a simple summation, it creates a new room for designing different composition rules, which makes it easier for designers to adjust which of the FB/FF policies is preferred.
The purpose of this study is to take over the benefits of FEL to the RL framework, as shown in Fig. 1.To this end, we have to solve two challenges as below.
1 Since RL is not only for tracking problem, which is the target of FEL, we need to design how to compose the FB/FF policies.2 Since the FB policy is not fixed unlike FEL, both of the FB/FF policies are required to be optimized simultaneously.For the first challenge, we assumes that the composed policy is designed as mixture distribution of the FB/FF policies since RL policy is stochastically defined.In addition, we heuristically design its mixture ratio depending on confidences of the respective FB/FF policies so that the higher confident policy is prioritized.
For the second challenge, inspired by control as inference [19], we derive a new optimization problem to minimize/maximize the divergences between trajectory, predicted by the composed policy and a stochastic dynamics model, and optimal/nonoptimal trajectories.Furthermore, by designing the stochastic dynamics model with variational approximation [20], we yield regularization between the FB/FF policies.We expect that skill of the FB policy, which can be optimized faster than the FF policy, will be transferred into the FF policy via this regularization.
To verify that the proposed method can optimize the FB/FF policies in a unified manner, we conduct numerical simulations for statistical evaluation and a robot experiment as demonstration.Through the numerical simulations, we show the capability of the proposed method, namely, stable optimization of the composed policy even with the different learning law from the traditional RL.However, the proposed method occasionally fails to learn the optimal policy.We analyze this reason as the extreme updating of the FF policy (or RNNs) to wrong direction.In addition, after training on the robot experiment, we clarify the value of the proposed method that the optimized FF policy robustly samples valuable actions to the sensing failures even when the FB policy fails to achieve the optimal behavior.

Reinforcement learning
In RL [3], an agent interacts with unknown environment using action a ∈ A sampled from policy π.The environment returns the result of the interaction as state s ∈ S and evaluates it according to reward function r(s, a) ∈ R. The optimization problem of RL is to find the optimal policy π * that maximizes the sum of rewards in the future from the current time t (or, called return), defined as RL generally assumes that the environment follows Markov process, i.e. the next state s is sampled from s ∼ p e (s | s, a).By additionally limiting the policy as π(a | s), Markov decision process (MDP) is satisfied.In that case, RL can be illustrated as the agent-environment-loop at the top of Fig. 2.However, in practical use, measurement of state causes delay (e.g.due to overload in the communication networks) and/or loss (e.g.occlusion in camera sensors), suggested in the bottom of Fig. 2. To solve this problem, this paper therefore proposes a new method to optimize the FB/FF policies in a unified manner by formulating them without necessarily requiring MDP.
In the conventional RL under MDP, the expected value of R is functionalized as V (s) as (state) value function and Q(s, a) as (state-)action value function, and V can be learned by the following equation.
Note that Q can also be learned with the similar equation, although we do not use Q directly in this paper.
Based on δ, an actor-critic algorithm [21] updates π according to the following policy gradient.
where E peπ [•] is approximated by Monte Carlo method.

Introduction of optimality variable in control as inference
Recently, RL can be regarded as inference problem, so-called control as inference [19].This extension of interpretation is realized by introducing a optimality variable, o = {0, 1}, which represents whether the current state s and action a are optimal (o = 1) or not (o = 0).Since it is defined as random variable, the probability , is parameterized by reward r to connect the conventional RL with this interpretation.
where c = max(r) to satisfy e r(s,a)−c ≤ 1, and τ denotes the hyperparameter to clarify uncertainty, and can be adaptively tuned.
Furthermore, by considering the optimality in the future as O, we can connect this formulation with the conventional value functions.Specifically, the following probability can be derived.
where C = max(V ) = max(Q) theoretically, although its specific value is generally unknown.
In this way, the optimality can be treated in probabilistic inference problems, facilitating integration with such as Bayesian inference and other methods.This paper utilizes this property to derive a new optimization problem, as derived later.

Variational recurrent neural network
To reveal state transition probability (i.e.p e ) as stochastic dynamics model, we derive the method to learn it based on variational recurrent neural network (VRNN) [20].Therefore, in this section, we briefly introduce the VRNN.
The VRNN considers the maximization problem of log likelihood of a prediction model of observation (s in the context of RL), p m .s is assumed to be stochastically decoded from lower-dimensional latent variable z, and z is also sampled according to the history of s, h s , as time-dependent prior p(z | h s ).Here, h s is generally approximated by recurrent neural networks, and this paper employs deep echo state networks [22] for this purpose.Using Jensen's inequality, a variational lower bound is derived as follows: where p(s | z) and q(z | s, h s ) denote the decoder and encoder, respectively.KL(• •) is the term for Kullback-Leibler (KL) divergence between two probabilities.L vrnn is minimized via the optimization of p m , which consists of p(s | z), q(z | s, h s ), and p(z | h s ).Note that, in the original implementation [20], the decoder is also depending on h s , but that is omitted in the above derivation for simplicity and for aggregating time information to z.In addition, the strength of regularization by the KL term can be controlled by following β-VAE [23] with a hyperparameter β ≥ 0.
3 Derivation of proposed method

Overview
The outputs of FB/FF policies should eventually coincide, but it is unclear how they will be updated if we directly optimize the composed policy according to the conventional RL.In this paper, we propose a unified optimization problem in which the FB/FF policies naturally coincide and the composed one is properly optimized.The key points in the proposed method are two folds: 1 The trajectory predicted with the stochastic dynamics model and the composed policy is expected to be close to/away from optimal/non-optimal trajectories inferred with the optimality variable.2 The stochastic dynamics model is trained via its variational lower bound, which naturally generates a soft constraint between the FB/FF policies.Here, as an additional preliminary preparation, we define the FB, FF, and composed policies mathematically: π FB (a | s); π FF (a | h a ); and the following mixture distribution, respectively.
where w ∈ [0, 1] denotes the mixture ratio of the FB/FF policies.That is, for generality, the outputs of the FB/FF policies are composed by a stochastic switching mechanism, rather than a simple summation as in FEL [16].Note that since the history of action, h a , can be updated without s, the FF policy is naturally robust to sensing failures.

Inference of optimal/non-optimal policies
First of all, we infer the optimal policy, which yields the optimal trajectory by interacting with the real environment p e , and the non-optimal policy, which causes the non-optimal trajectory on the contrary.With eqs.( 5) and ( 6), the policy conditioned on O, π * (a | s, h a , O), can be derived through Bayes theorem.
where b(a | s, h a ) denotes the sampler distribution (e.g. the composed policy with old parameters or one approximated by target networks [24]).
By substituting {0, 1} for O, the inference of the optimal policy, π + , and the non-optimal policy, π − is given as follows: Although it is difficult to sample action from these policies directly, they can be utilized for analysis in the next section.

Optimization problem for optimal/non-optimal trajectories
With the composed policy, π, and the stochastic dynamics model, given as p m (s | s, a, h s , h a ), a part of trajectory is predicted as p m π.As a reference, we can consider the part of trajectory with π * in eq. ( 9) and the real environment, p e , as p e π * .The degree of divergence between the two can be evaluated by KL divergence as follows: where the term ln p e π * inside the expectation operation is excluded since it is not related to the learnable p m and π.The expectation operation with p e and b can be approximated by Monte Carlo method, namely, we can optimize p m and π using the above KL divergence with the appropriate conditions of O.
As the conditions, our optimization problem considers that p m π is expected to be close to p e π + (i.e. the optimal trajectory) and be away from p e π − (i.e. the nonoptimal trajectory), as shown in Fig. 3. Therefore, the specific loss function to be minimized is given as follows: where 1 − exp{(V − C)τ −1 } and τ are multiplied to eliminate unknown C and to scale the gradient at δ = 0 to be one, respectively.Note that the derived result is similar to eq. ( 3), but with a different coefficient from δ and a different sampler from π.

Stochastic dynamics model with variational lower bound
In eq. ( 13), ln p m , i.e. the stochastic dynamics model, is included and it should be modeled.Indeed, we found that the model based on the VRNN [20] shown in eq. ( 7) can naturally yield an additional regularization between the FB/FF policies.
In addition, such a method is regarded as one for extracting latent Markovian dynamics in problems for which MDP is not established in the observed state, and is similar to the latest model-based RL [25,26].
Specifically, we consider the dynamics of latent variable z as z = f (z, a) with f learnable function, and a can be sampled from time-dependent prior (i.e. the FF policy).In that time, eq. ( 7) is modified through the following derivation.
Since we know the composed policy π is mixture of the FB/FF policies defined in eq. ( 8), the KL term between π and π FF can be decomposed using variational approximation [27] and Jensen's inequality.
The general case of VAE omits the expectation operation by sampling only one z (and a in the above case) according to s.In addition, as explained before, the strength of regularization can be controlled by adding β [23].With this fact, we can modify L model as follows: where z ∼ q(z | s, h s ), a ∼ π(a | s, h a ), z = f (z, a), and β z,a denote the strength of regularization for each.Finally, the above L model can be substituted into eq.( 13) as − ln p m .
As can be seen in eq. ( 16), the regularization between the FB/FF policies is naturally added.Its strength is depending on w 2 , that is, as the FB policy is prioritized (i.e.w is increased), this regularization is reinforced.In addition, since L model is now inside of L traj , the regularization becomes strong only when δ > 0 enough, that is, the agent knows the optimal direction for updating π.Usually, at the beginning of RL, the policy generates random actions, which make the FF policy be optimized; in contrast, the FB policy can be optimized under weak regularization (if the observation is sufficiently performed).Afterwards, if w is adaptively given (as introduced in the next section), the FB policy will be strongly connected with the FF policy.In summary, with this formulation, we can expect that the FB policy will be optimized first while regularization is weakened, and that its skill will gradually be transferred to the FF policy as like FEL [16].

Design of mixture ratio based on policy entropy
For the practical implementation, we first design the mixture ratio w ∈ [0, 1] heuristically.As its requirements, the composed policy should prioritize the policy with higher confidence from the FB/FF policies.In addition, if the FB/FF policies are similar to each other, either can be selected.Finally, even for arbitrary distribution model of the FB/FF policies, w must be computable.
As one of the solutions for these requirements, we design the following w with the entropies for the FB/FF policies, H FB , H FF , and the L2 norm between the means of these policies, where β T > 0 denotes the inverse temperature parameter, i.e. w tends to be deterministic at 0 or 1 with higher β T ; and vice versa.Note that as lower entropy has higher confidence, the negative entropies are applied into softmax function.
If one of the entropies is sufficiently smaller than another, w will converge on 1 or 0 for prioritizing the FB/FF policies, respectively.However, if these policies output similar values on average, the robot can select action from either policy, so the inverse temperature is adaptively lowered by d to make w converge to w 0.5.

Partial cut of computational graph
In general, VAE-based architecture holds the computational graph, which gives paths for backpropagation, of latent variable z by reparameterization trick.If this trick is applied to a in our dynamics model as it is, the policy π will be updated toward one for improving the prediction accuracy, not for maximizing the return, which is the original purpose of policy optimization in RL.
To mitigate the wrong updates of π while preserving the capability to backpropagate the gradients to the whole network as in VAE, we partially cut the computational graph as follows: where η denotes the hyperparameter and • cuts the computational graph and represents merely value.

Auxiliary loss functions
As can be seen in eq. ( 17), if δ < 0, −L model will be minimized, reducing the prediction accuracy of dynamics.As for the policy, it is desirable to have a sign reversal of its loss according to δ to determine whether the update direction is good or bad.On the other hand, since the dynamics model should ideally have a high prediction accuracy for any state, this update rule may cause the failure of optimization.
In order not to reduce the prediction accuracy, we add an auxiliary loss function.We focus on the fact that the lower bound of the coefficient in eq. ( 17), τ (exp(δτ −1 )− 1), is bounded and can be found analytically to be −τ when δ → −∞.That is, by adding τ L model as the auxiliary loss function, the dynamics model should be updated toward one with higher prediction accuracy, while its update amount is still weighted by exp(δτ −1 ).
To update the value function, V , the conventional RL uses eq. ( 2).Instead of it, we found that the minimization problem of the KL divergence between p(O | s, a) and p(O | s) yields the following loss function similar to eq. (17).
Note that, in this formula (and eq. ( 17)), δ has no computational graph for backpropagation, i.e. it is merely coefficient.
Finally, the loss function to be minimized for updating π (i.e.π FB and π FF ), V , and p m can be summarized as follows: where L traj , L value , and L model are given in eqs.( 17), (20), and ( 16), respectively.This loss function can be minimized by one of the stochastic gradient descent (SGD) methods like [28].

Objective
We verify the validity of the proposed method derived in this paper.This verification is done through a numerical simulation of a cart-pole inverted pendulum and an experiment of a snake robot forward locomotion, which is driven by central pattern generators (CPGs) [29].Four specific objectives are listed as below.
1 Through the simulation and the robot experiment, we verify that the proposed method can optimize the composed policy, optimization process of which is also revealed.2 By comparing the successful and failing cases in the simulation, we clarify an open issue of the proposed method.3 We compare two behaviors with the decomposed FB/FF policies to make sure there is little difference between them.4 By intentionally causing sensing failures in the robot experiment, we illustrate the sensitivity/robustness of FB/FF policies to it, respectively.

Setup of proposed method
The network architecture for the proposed method is designed using PyTorch [30], as illustrated in Fig. 4. All the modules (i.e. the encoder q(z | s, h s ), decoder p(s | z ), time-dependent prior q(z | h s ), dynamics f (z, a), value function V (s), and the FB/FF policies π FB (a | s), π FF (a | h a )) are represented by three fully connected layers with 100 neurons for each.As nonlinear activation functions for them, we apply layer normalization [31] and Swish function [32].To represent the histories, h s and h a , as mentioned before, we employ deep echo state networks [22] (three layers with 100 neurons for each).Probability density function outputted from all the stochastic model is given as student-t distribution with reference to [33,34,35].To optimize the above network architecture, a robust SGD, i.e., LaProp [28] with t-momentum [36] and d-AmsGrad [37] (so-called td-AmsProp), is employed with their default parameters except the learning rate.In addition, optimization of V and π can be accelerated by using adaptve eligibility traces [38], and stabilized by using t-soft target network [24].
The parameters for the above implementation, including those unique to the proposed method, are summarized in Table 1.Many of these were empirically adjusted based on values from previous studies.Because of the large number of parameters involved, the influence of these parameters on the behavior of the proposed method is not examined in this paper.However, it should be remarked that a metaoptimization of them can be easily performed with packages such as Optuna [39], although such a meta-optimization requires a great deal of time.

Simulation for statistical evaluation
For the simulation, we employ Pybullet dynamics engine wrapped by OpenAI Gym [40,41].A task (a.k.a.environment), InvertedPendulumBullet-v0, where a cart tries to keep a pole standing on it, is tried to be solved.With different random seeds, 30 trials involving 300 episodes for each are performed.
First of all, we depict the learning curves about the score (a.k.a. the sum of rewards) and the mixture ratio in Fig. 5. Since five of them were obvious failures, for further analysis, we separately depicted Failure (5) for the five failures and Success (25) for the remaining successful trials.We can see in the successful trials that the agent could solve this balancing task stably after 150 episodes, even with stochastic actions.Furthermore, further stabilization and making the composed policy deterministic were accelerated, and in the end, the task was almost certainly accomplished by the proposed method in the successful 25 trials.
Focusing on the mixture ratio, the FB policy was dominant in the early stages of learning, as expected.Then, as the episodes passed, the FF policy was optimized toward the FB policy, and the mixture ratio gradually approached 0.5.Finally, it seems to have converged to around 0.7, suggesting that the proposed method is basically dominated by the FB policy under stable observation.
Although all the trials obtained almost the same curves until 50 episodes in both figures, the failure trials suddenly decreased their scores.In addition, probably due to the failure of optimization of the FF policy, the mixture ratio in the failure trials fixed on almost 1.It is necessary to clarify the cause of this apparent difference from the successful trials, i.e. the open issue of the proposed method.
To this end, we decompose the mixture ratio into the distance between the FB/FF policies, d, and the entropies of the respective policies, H FB and H FF , in Fig. 6.Extreme behavior can be observed around 80th episode in d and H FF .This suggests that the FF policy (or its base RNNs) was updated extremely wrong direction, and could not be reverted from there.As a consequence, the FB policy was also constantly regularized to the FF policy, i.e. the wrong direction, causing the failures of the balancing task.Indeed, H FB was gradually increased toward H FF .In summary, the proposed method lacks the stabilization of learning of the FF policy (or its base RNNs).It is however expected to be improved by suppressing the amount of policy updates like the latest RL [42], regularization of RNNs [43], and/or promoting initialization of the FF policy.

Robot experiment
The following robot experiment is conducted to illustrate the practical value of the proposed method.Since the statistical properties of the proposed method are verified via the above simulation, we analyze one successful case here.

Setup of robot and task
A snake robot used in this experiment is shown in Fig. 7.This robot has eight Qbmove actuators developed by QbRobotics, which can control the stiffness in hardware level, i.e. variable stiffness actuator (VSA) [44].As can be seen the figure, all the actuators are serially connected and on casters to easily drive by snaking locomotion.On the head of the robot, a AR marker is attached to detect its coordinates using a camera (ZED2 developed by Stereolabs).
To generate the primitive snaking locomotion, we employ CPGs [29] as mentioned before.Each CPG follows Cohen's model with sine function as follows: where ζ i denotes the internal state, and θ i is consistent with the reference angle of i-th actuator.α, u r i , u η i , and u A i denote the internal parameters of this CPG model.For all the CPGs (a.k.a.actuators), we set the same parameters, α = 2, u r i = 10, u η i = 1, and u A i = π/4, respectively.dt is the discrete time step and set to be 0.02 sec.
Even with this CPG model, the robot has room for optimization of the stiffness of each actuator, k i .Therefore, the proposed method is applied to the optimization of k i ∈ [0, 1] (i = 1, 2, . . ., 8).Let us introduce the state and action spaces of the robot.
As for the state of the robot s, the robot observes the internal state of each actuator: θ i angle; θi angular velocity; τ i torque; and k i stiffness (different from the command value due to control accuracy).To evaluate its locomotion, the coordinates of its head, x and y, are additionally observed (see Fig. 8).In addition, as mentioned before, the action of the robot a is set to be k i .In summary, 34-dimensional s and 8-dimensional a are summarized as follows: For the definition of task, i.e. the design of reward function, we consider forward locomotion.Since the primitive motion is already generated by the CPG model, this task can be accomplished only by restraining the sideward deviation.Therefore, we define the reward function as follows: The proposed method learns the composed policy for the above task.At the beginning of each episode, the robot is initialized to the same place with θ i = 0 and k i = 0.5.Afterwards, the robot starts to move forward, and if it goes outside of observable area (including a goal) or spends 2000 time steps, that episode is terminated.We tried 100 episodes in total.

Learning results
We depict the learning curves about the score (a.k.a. the sum of rewards) and the mixture ratio in Fig. 9.Note that the moving average with 5 window size is applied to make it easier to see the learning trends.From the score, we say that the proposed method improved straightness of the snaking locomotion.Indeed, Fig. 10, which illustrates the snapshots of experiment before and after learning, clearly indicates that the robot could succeeded in forward locomotion only after learning.
As well as the successful trials in Fig. 5, this experiment also increased the mixture ratio at first, and afterwards, the FF policy was optimized, reducing the mixture ratio toward 0.5 (but converged on around 0.7).We found the additional feature that during 10-30 episodes, probably when the transfer of skill from the FB to FF policies was active, the score temporarily decreased.This would be due to the increased frequency of use of the non-optimal FF policy, resulting in erroneous behaviors.After that period, however, the score became stably high, and we expect that the above skill transfer was almost complete and the optimal actions could be sampled even from the FF policy.

Demonstration with learned policies
To see the accomplishment of the skill transfer, after the above learning, we apply the decomposed FB/FF policies individually into the robot.On the top of Fig. 11, we shows the overlapped snapshots (red/blue robots correspond to the FB/FF policies, respectively).With the FF policy, of course, errors in the initial state were gradually increased and accumulated, namely the two results can never be completely consistent.However, the difference at the goal was only a few centimeters.This result suggests that the skill transfer from the FB to FF policies has been achieved as expected, although there is room for further performance improvement.
Finally, we emulate a sensing failure for detecting the AR marker on the head.When the robot is in the left side of the video frame, the detection of the AR marker is forcibly failed, and returns wrong (and constant) x and y.In that case, the FB policy would collapse, while the FF policy is never affected by the sensing failure.On the bottom of Fig. 11, we shows the overlapped snapshots, where the left side with the sensing failure is shaded.Until the robot left the left side, the locomotion obtained by the FB policy drifted in front of the video frame, and it was apparent that the robot could not recovered by the goal.
In detail, Fig. 12 illustrates the stiffness during this test.Note that the vertical axis is the unbounded version of k i , and can be encoded into the original k i through sigmoid function.As can be seen in the figure, the sensing failure absolutely affected the outputs by the FB policy, while the FF policy ignored it and outputted periodically.Although this test is a proof-of-concept, it clearly shows the sensitivity/robustness of the FB/FF policies to sensing failures that may occur in real environment.We then conclude that a framework that can learn both the FB/FF policies in a unified manner, such as the proposed method, is useful in practice.

Conclusion
In this paper, we derive a new optimization problem of both the FB/FF policies in a unified manner.Its point is to consider minimization/maximization of the KL divergences between the trajectory, predicted by the composed policy and the stochastic dynamics model, and the optimal/non-optimal trajectories, inferred based on control as inference.With the composed policy as mixture distribution, the stochastic dynamics model that is approximated by variational method yields the soft regularization, i.e. the cross entropy between the FB/FF policies.In addition, by designing the mixture ratio to prioritize the policy with higher confidence, we can expect that the FB policy is first optimized since its state dependency can easily be found, then its skill is transferred to the FF policy via the regularization.Indeed, the numerical simulation and the robot experiment verified that the proposed method can stably solve the given tasks, that is, it has capability to optimize the composed policy even with the different learning law from the traditional RL.In addition, we demonstrated that using our method, the FF policy can be appropriately optimized to generate the similar behavior to one with the FB policy.As a proof-of-concept, we finally illustrated the robustness of the FF policy to the sensing failures when the AR marker could not be detected.
However, we also found that the FF policy (or its base RNNs) occasionally failed to be optimized due to the cause of extreme updates toward wrong direction.To alleviate this problem, in the near future, we need to make the FF policy conservatively update, for example, using a soft regularization to its prior.With more stable learning capability, the proposed method will be applied to various robotic tasks with potential for the sensing failures.Optimal trajectory w/ real dynamics + optimal policy One of the non-optimal trajectories w/ real dynamics + non-optimal policy Predicted trajectory w/ predicted dynamics + agent's policy Attraction Repulsion Figure 3 Trajectory optimization problem: the trajectory can be predicted with the composed policy and the stochastic dynamics model; the optimal/non-optimal trajectories can be inferred with the optimal/non-optimal policies and the true state transition probability; the predicted trajectory is desired to be close to the optimal trajectory, while to be away from the non-optimal trajectory; the divergence between trajectories can be represented by the KL divergence.Simulation results: 30 trials were divided into 5 failure and 25 successful cases; around 150 episodes, the proposed method mostly succeeded in balancing the pole on the cart, mainly using the FB policy shown in the mixture ratio close to 1; afterwards, the composed policy was made deterministic with further stabilization; in that time, the skill of the FB policy was probably transferred into the FF policy, as can be seen in the decrease of the mixture ratio.x y Camera to detect AR marker  Experimental results: for visibility of learning trends, moving average with 5 window size is applied; the proposed method successfully improved the straightness of the snaking motion by optimizing the stiffness; we found the skill transfer from the FB policy to the FF policy, as can be seen in the mixture ratio as well as Fig. 5; as a remarkable point, during this transfer episodes), the score temporarily decreased probably due to the increased frequency of use of the non-optimal FF policy.during the sensing failures, the FB policy outputted obviously erroneous stiffness; in contrast, the FF policy could hold the periodic outputs; note that the phase and amplitude deviations in the area without the sensing failures can be attributed to incomplete skill transfer and recovery attempts from lateral deviation.

Figure 4
Figure 4  Network architecture of the proposed method: it contains seven modules for the encoder q(z | s, h s ), decoder p(s | z ), time-dependent prior q(z | h s ), dynamics f (z, a), value function V (s), and the FB/FF policies π FB (a | s), π FF (a | h a ) with two RNN features, h s and h a ; π FB and π FF are composed as π, while being regularized between each other.

Figure 5
Figure5Simulation results: 30 trials were divided into 5 failure and 25 successful cases; around 150 episodes, the proposed method mostly succeeded in balancing the pole on the cart, mainly using the FB policy shown in the mixture ratio close to 1; afterwards, the composed policy was made deterministic with further stabilization; in that time, the skill of the FB policy was probably transferred into the FF policy, as can be seen in the decrease of the mixture ratio.

Figure 6
Figure 6 Decomposition of mixture ratio: 30 trials were divided into 5 failure and 25 successful cases; around 80th episode on the five failure cases, d and H FF were suddenly jumped to higher values; this suggests the wrong updates of the FF policy (or its base RNNs); according to this erroneous behavior, H FB was pulled into the wrong direction by the FF policy, thereby resulting in the failures of the balancing task.

Figure 7
Figure 7  Snake robot with eight VSAs serially connected: as its actuator, we use Qbmove developed QbRobotics, which can control its stiffness; this robot is on casters to easily drive forward by snaking locomotion, base of which is generated by CPGs.

Figure 8
Figure8Experimental field: on the top of this field, a camera to detect the robot head by the AR marker is placed; by controlling the stiffness of each actuator, the robot tries to move forward, i.e. x-direction.

Figure 9
Figure 9  Experimental results: for visibility of learning trends, moving average with 5 window size is applied; the proposed method successfully improved the straightness of the snaking motion by optimizing the stiffness; we found the skill transfer from the FB policy to the FF policy, as can be seen in the mixture ratio as well as Fig.5; as a remarkable point, during this transfer episodes), the score temporarily decreased probably due to the increased frequency of use of the non-optimal FF policy.

( a )Figure 10
Figure10Snapshots before and after learning: before learning, the initial policy failed to make the snaking locomotion forward; in contrast, the proposed method yielded the forward locomotion using the optimized composed policy.

Figure 11 8 Figure 12
Figure 11  Snapshots with/without the sensing failures: the robot was controlled by the decomposed FB (red) or FF (blue) policy; without the sensing failures, both the policies generated almost the same forward locomotion, which indicates the proper skill transfer; with the sensing failures to detect the AR marker, indicated as the shaded area, the FB policy drifted the robot to the side due to the wrong signal; in contrast, the FF policy could achieve the forward locomotion by ignoring the wrong signal in principle.
Figure 1 Proposed RL framework: it contains both the FB/FF policies in parallel; policies outputted from them are composed to sample action; according to reward, both the FB/FF policies are optimized in a unified manner.Delay and/or Loss action state, reward Loop of RL with sensing failures: in general RL, an agent of the left interacts with environment on the right by action sampled from policy depending on the current state; according to state transition probability, the new state is observed with related reward; however, in practice, state observation is probably with risk of sensing failures like occlusion and packet loss.
• Measurement by external sensors • Communication via networkFigure 2