Uniaxial attitude control of uncrewed aerial vehicle with thrust vectoring under model variations by deep reinforcement learning and domain randomization

The application of neural networks for nonlinear control has been actively studied in the field of aeronautics. Successful techniques have been demonstrated to achieve improved control performance in simulation using deep-reinforcement learning. To transfer the controller learnt in the simulation of real systems, domain randomization is an approach that encourages the adaptiveness of neural networks to changing environments through training with randomized parameters in environments. This approach applies to an extended context, with changing working environments, including model configurations. In previous studies, the adaptive performance of the domain-randomization-based controllers was studied in a comparative fashion over the model variations. To understand the practical applicability of this feature, further studies are necessary to quantitatively evaluate the learnt adaptiveness with respect to the training conditions. This study evaluates deep-reinforcement-learning and the domain-randomization-based controller, with a focus on its adaptive performance over the model variations. The model variations were designed to allow quantitative comparisons. The control performances were examined, with a specific highlight of whether the model variation ranges fell within or exceeded the randomization range in training.


Introduction
In aeronautics, neural network(NN)-based controllers that utilize machine learning have the potential to replace existing controllers in terms of improving the efficiency of control system construction and responding flexibly to disturbances.Supervised learning is often applied to replace the role of existing controllers and is expected to provide benefits such as more efficient gain scheduling and smoother control [1].In contrast, deep reinforcement learning is expected to be viable when a teaching controller does not exist or to improve the performance beyond that of the existing one [2,3].
The application of NNs for nonlinear control has been actively studied in robotics and aeronautics.Particularly, simulations have demonstrated techniques that enable complex tasks such as trajectory planning [4], aircraft landing under wind-induced disturbance [5], swarm flight [6], aerobatics [7], and fixed-wing aircraft attitude control [8] by optimizing the NN through deep reinforcement learning.
However, it is necessary to design an appropriate approach to transfer the controllers learnt in the simulations to real systems.This is because deep reinforcement learning generally optimizes the NN for a single environment represented by the simulator, and the theoretical control performance is often not achieved when the real environment differs from the simulation (Reality gap), such as friction or time delay [9].Sim-to-real transfer is an approach that applies a simulation-generated controller to a real-world environment without compromising its performance, and has been actively studied, specifically in the field of robotics [10].Domain randomization is one of the approaches in the sim-to-real transfer technique.For example, domain randomization randomly changed the appearance [11,12] and dynamics [13,14] of objects from episode to episode in the simulation during training.NN is optimized to adaptively control the agent in these changing environments.In aeronautics, there have been examples of domain randomization in multirotor flight control [15] and attitude control using thrust vectoring [16][17][18].
Generally, domain randomization is used to maintain theoretical control performance in the real world; however, its feature of making controllers adaptive can be utilized to an extended application, where working environments, including model configurations, change.In previous studies, sim-to-real transfer was applied to the flight control of uncrewed aerial vehicles (UAVs), which took multiple forms of model configurations [17,18].The control performance of the domain-randomizationbased controllers has been successfully studied in a comparative fashion over the multiple model variations with different physical characteristics [17,18].Moreover, it will also be essential to evaluate the "working range" to what extent the controller can apply to changes in the environment.The relationship between the range of randomization in training and the learnt tolerance to environmental variations should be examined quantitatively.These understandings would be a practical guideline for the application of domain-randomization-based controllers to a real situation, where a certain degree of uncertainties is expected, such as payload variations as well as UAV failure or damage.The aim of this research is to apply deep-reinforcement-learning and the domain-randomization-based controller to aircraft control systems.In the development of aircraft control systems, it is difficult to construct a precise model of the entire aircraft due to turbulence and increasingly complex systems, and furthermore, verification through experiments is costly and carries high safety risks.Therefore, if it becomes possible to develop control systems based on simulators without the need for precise models, there is an expectation that it could lead to cost reduction and shortened development time, among other benefits.As a first step, This study evaluates deep-reinforcement-learning and the domain-randomization-based controller, with a focus on its adaptive performance over the model variations.The model variations were designed to allow quantitative comparisons.The attitude control system for UAV based on thrust vectoring, which was constructed in our previous research [16], was utilized.The weight, inertia, and the location of the center of gravity(CoG) of the model were varied.The control performances were examined with a specific highlight of whether the model variation ranges fell within or exceeded the randomization range in training.

UAV with thrust vectoring
UAV with thrust vectoring and experimental system Figure 1 shows a developed UAV with thrust vectorings and Fig. 2 shows the system configuration.The UAV has four control outputs, which are two Electric Ducted Fans (EDFs)(JP Hobby, 120 mm and 14 CELL Motor 673 KV) and two actuators (FUTABA, HPS-A700) for thrust vectoring to deflect the thrust directions of each EDF.Two 11.1 V batteries (MATRIX, LiPo 6 s-5100 mAh 35 C) were connected in series for each of these EDFs to provide a 22.2 V power supply.The maximum thrust per EDF was 7.25 N and the maximum run time was 5 min.The operating range of thrust vectorings is ±25 • .
The UAV is controlled by a Raspberry Pi 3 Model B and Pixhawk 4. The Pixhawk senses the UAV states (position, attitude, etc. ) and controls the actuators.The Raspberry Pi performs control operations using NN.First, all sensor information measured by Pixhawk is sent to Raspberry Pi.Next, the NN that is implemented in the Raspberry Pi calculates the control outputs for the actuators based on the sensor information from the Pixhawk, which are sent again to the Pixhawk.The operating cycles of Pixhawk and Raspberry Pi are 200 Hz and 50 Hz, respectively.
Moreover, to enable the verification of the response of the control system to changes in the CoG, weights can be mounted outside of the fuselage.The weight, center of gravity, and moment of inertia of the UAV can be varied by changing the number and mounting position of the weights.These control devices, batteries, and control equipment are fixed to a plastic fuselage created by a 3D printer.Consequently, the total weight of the UAV without the additional weights is 7.0 kg.The experimental system is shown in Fig. 3.It system consists of the UAV and a uniaxial rotation experimental device.The UAV is attached to the experimental device so that it can only rotate around the pitch ( Y b ) axis.If the UAV is not weighted, the distance between the CoG and axis of rotation is +49 mm in the X-axis direction and +28 mm in the Z-axis direction.The behaviors of the two EDFs and thrust vectors are set to be the same, which sufficed the control task where the UAV only rotated around the pitch axis Therefore, the control output is effectively two variables.

Theoretical model
Since this experiment is a uniaxial rotation test without translational movement, the experimental system of the UAV is modeled in the XZ plane in the global coordinate system.The rotational motion model in the XZ plane is shown in Fig. 4, and the equation of rotational motion around the Y b axis is shown below.
(1) where I yy is the fuselage moment of inertia around the Y b axis of the aircraft body coordinates, m is the mass of the entire aircraft, g is the gravitational acceleration, F x and F z are the forces that divide the thrust F in the XZ plane, G(x g , z g ) is the position of the COG of the UAV with respect to the center of rotation in the XZ plane.Because the center of rotation of the experimental setup does not coincide with the center of gravity of the UAV, R G , R T , θ G , and θ T were defined and transformed as in Eqs.(1)(2)(3)(4)(5).
Second, since there is a time delay between the command input and the generation of thrust force in the EDF, the thrust generated by the EDFs is also modeled.It is noted that the counter torque of the EDF is ignored because it does not affect the motion of the UAV in this experiment.Thrust response measurement experiments are conducted to evaluate the relationship between the throttle commands and thrust force.The thruster is connected to a force sensor, and the throttle command and the generated thrust are measured in time series.The measurement results are shown in Fig. 5.The first axis shows thrust and the second axis shows throttle opening, and the solid line indicates the result of measurement by the force sensor, the dashed line indicates the result of filtering the measured values, and the single-dot dashed line indicates the throttle opening.As a result of the analysis of the measurement results, the transfer function of the thruster is approximated by a first order lag model as follows, where R(S) is the transfer function from throttle signal to thrust force, A is the gain for converting the throttle signal to thrust, which was obtained from experimental results by system identification, T d is the dead time, and τ is the damping time constant.In this EDF units, A is 0.58, T d is 0.08 s and τ is 0.19 s.
Next, the response of the vectoring servo to the input was modeled.With a vectoring nozzle attached to the servo, a command (pulse width modulation signal) was applied using a step signal spanning from -50% to +50%, and the resulting servo angle response was measured.In this servo, the servo rotates -25 • when the command is set to -50%, and the servo rotates +25 • when the command is set to +50%.Furthermore, system identification was performed based on the measurement results, and the response of the servo was modeled as a first order lag system.The model of the servo is shown in Eq. (7).
where G(s) represents the transfer function from the servo command signal to the servo angle (in radians), K s denotes the gain with a value of 1.024, and τ s represents the damping time constant with a value of 0.109.

Comparison of experimental systems and theoretical model
The differences between the experimental system and theoretical model are evaluated.The constructed theoretical model is implemented in a dynamics simulator developed in python to simulate the rotational motion of the UAV.In the two models, the models were released from the initial pitch angle and oscillated without external forces.The free vibration response is shown in Fig. 6.
Figure 6 shows that the responses of the experimental system and theoretical model are different.The solid and broken lines depict the experimental system and theoretical model vibrations, respectively.The vibration period is 6.7% faster for the simulation and the damping ratio is 32% higher for the simulator.This difference is the typical reality gap.In the domain randomization approach, rather than tuning the theoretical model towards higher fidelity, the controller is trained to be robust to the difference.

Generation of controller with deep reinforcement learning
The controller was generated with the NN structure shown in Fig. 7.The inputs (State) to the NN is "pitch angle"; "difference between the pitch angle and target pitch angle"; "pitch angle velocity"; "thrust vectoring angle"; and "thrust".Here, the two EDFs and the two thrust vectorings are set to have the same behavior; therefore, the "thrust vectoring angle, " and "thrust" are one element each.Moreover, the "pitch angle" and "difference between pitch angle and target pitch angle" are separated into sine and cosine components so that they have smooth value transitions within ±1 , resulting in a total of seven input components.The controller outputs are the "thrust vectoring angle" and "thrust force"; however, the NN outputs were designed as the rate of thrust vectoring and force, which are the differences between the next and current states.By this definition, the NN outputs were directly used in a penalty function as described later.
The long short-term memory (LSTM) layer is set in the NN for effective learning when using domain randomization [13,14,19].Under domain randomization, the Markov decision process is not necessarily guaranteed.Therefore, it becomes essential to use memory-augmented policies [13,14,20], or input time series of states to policies [21,22].This study employs the approach of using LSTM by following its recent success in domain randomizaiton [13,14].It is assumed that the LSTM layer enables adaptive control by implicitly recognizing changing dynamics from the stored histories of the system's inputs and outputs.The hidden layers other than the LSTM are with the exponential linear unit (ELU) activation function, the output layer of the policy function is with the hyperbolic tangent (tanh) activation function, and the output layer of the value function is with the linear activation function.
During the NN training, the simulation is conducted at 50 Hz for 5 s, that is, one episode of 250 steps.Different initial conditions (initial attitude, initial angular velocity, etc.), dynamics, and data processing delay time are given for each episode.Initial conditions are generated by uniform random numbers, with ranges of ±25 • for the target pitch angle, ±180 • for the pitch angle, and ±18 • for the pitch angle velocity, respectively.Particularly, the thrust vectoring angle and thrust force are generated in the range of ±25 • and 8-66 N, corresponding to the range of motion of the actual UAV.The data processing delay time ranges from 50-300 ms, with a fixed value per episode.
The concept of the reward function is shown in Eq. 8.
where r is the reward.The aim of the reward function is to make the pitch follow the target angle by giving a penalty for the error �θ between the pitch angle and target pitch angle.By penalizing the pitch angular velocity θ and the control actions, which are the thrust rate a F and vec- toring rate a T , the oscillatory motion of the actuator is suppressed to achieve smooth and lean motion.Furthermore, by penalizing the thrust F, the priority is designed so that the attitude is controlled by vectoring the operation rather than the change in the thrust.The weights were set by the coefficients c for the individual terms so that approximately 60% of the total penalty is contributed by �θ and the rest was evenly shared by others.Proximal policy optimization (PPO) is applied as the learning algorithm [3,14,23], where the clip threshold is 0.2 and the learning rates of the policy and value functions are 2e-4 and 5e-4, respectively.The dynamics (8)  variables varied within the range of Table 1 in each episode during learning, and the controller was optimized to adapt to those variables.

Experimental conditions
The pitch control experiments were conducted using the trained NN controller.The initial pitch angle was where the UAV was at the equilibrium by its own weight, and the target pitch angle was given at 0 • .As shown in Fig. 3, the upright state has a pitch angle of 0 • , whereas the equilibrium state is tilted by approximately 110 • under its own weight as shown in the photo labeled "0.1s" in Fig. 9. Four variations of the UAV model were prepared by different weights.The variations in the parameters are presented in Table 2.The same NN was used to control these models.
As shown in Fig. 3, Model 1 was the state with no weights added, namely, the state closest to the theoretical model.Weights were added to Model 1, and the weights were made heavier in the order of Model 2, 3, and 4. As a quantitative reference to the randomized range in training,, Model 2 was set at approximately 50% of the randomization range, Model 3 was set at approximately 100% of the range, and Model 4 exceeded the range of randomization.

Experimental results and discussion
Figure 8 shows the experimental results and Figs. 9, 10, 11, 12 show the behavior of the UAVs during the experiment for each model.In Fig. 8, from left to right, the pitch angle, vectoring angle, and total thrust of Models 1, 2, 3, and 4 are shown in time series, respectively.The time series of the experimental results for 6 s (8 s for Model 4 only) is shown, with experimental time t = 0 s at 0.1 s before the start of control.
Figures 8a and 9 showed that in Model 1, the pitch smoothly transitioned to the target pitch angle of 0 • as soon as the control started, and that the pitch remained static after the target attitude was reached.The thrust increased to approximately 80 N at the initial start-up, but it remained at approximately 50 N in the static state.However, the vectoring angle was manipulated rapidly in the range from -25 • to 20 • during the initial startup, and it was adjusted agile during the subsequent static state.This behavior of maneuvering with vectoring more than thrust indicated the successful effect of the thrust penalty.Additionally, it was reasonable because the thrust operation included the uncertainty in the time constant, which should discourage the quick control of the thrust force.
In Models 2 and 3, the attitude control is as successful as in Model 1, despite the addition of more weights than in the Model 1 situation.The thrust and vectoring angle responses behaved similarly to Model 1; however, comparing the maximum thrust, the rise in thrust at the start of the experiment increased to 80, 100, and 130N as the weight increased for Models 1, 2, and 3.These experimental results suggest that the NN adapts to changes in the model and adjusts its output during control.
In Model 4, a case that exceeded the training range, the pitch angle did not reach the target angle, as shown in Fig. 12. Figure 8d showed a repetition of the action of trying to raise the pitch to the target attitude but failing to reach it and falling back.This behavior is an example of control failure when the physical model deviates from the domain randomization range.The thrust did   not reach the maximum nor even the value observed in Model 3. It was reckoned that the thrust penalty discouraged the controller to explore the control scenario using higher thrust in training.It was noted that even though the thrust was not sufficient, the vectoring was controlled to maximize the pitching moment, which was reasonable to reach the target pitch.These experiments showed that by including the uncertainties of the model in training, it was possible to generate an NN controller that could work in the real environment.The successful control was observed for the model variations when the variations fell in the range assumed in the training.These results indicate the possibility of providing a robust control system for pilots in aircraft with a high ratio of fuel to airframe weight, such as small aircraft and flying cars, even if the aircraft's COG changes due to the amount of remaining fuel, resulting in changes in control performance.Notably, as a nature of deep reinforcement learning, the lessons learnt from the results tend to be case-specific, specifically for the failure behavior for Model 4. The results show the possibility of building a controller that can use the limits of the control devices (control surfaces and thrusts) to bring the current value close to the target value without breaking control, even for system that are difficult to control due to their physical characteristics.(In other words, like humans, they could have demonstrated the possibility of avoiding a crash by using available control devices in response to airframe damage or controller failure.)More detailed evaluations of the relationship between the randomization range, the model variations and the control performance are awaited to gain a generalizable understanding.

Conclusion
In this study, deep reinforcement learning utilizing domain randomization was applied to the pitch control of the UAV to demonstrate a control technique that adapts to changes in the physical model.First, the UAV equipped with EDFs and thrust vectorings was developed, and the theoretical model was constructed.Secondly, the NN with the LSTM layer was designed and optimized by deep reinforcement learning with domain randomization.Finally, the control tests were conducted using the same NN on multiple models with different COG positions, inertia and weights.These experimental results showed that control succeeded for model variations within the randomization range in training.The control failed for the model of which the variation was exceeded the randomization range.Through the results, the work range of the controller trained by domain randomization was successfully depicted.
In the future, we aim to expand the control target from one axis of pitch to six degrees-of-freedom.In such an extended control task, in addition to the variation of mass, other factors such as delay and thrusts uncertainties should possibly be investigated to verify the control performance in reality.Furthermore, the future research could look into a control technique for UAVs to perform complex tasks by controlling thrust vectoring and multiple control surfaces.The external disturbance such as wind could also be a factor to be considered.

7 Fig. 6
Fig.6 Comparison of free oscillation between experimental system and theoretical model

Table 1
Range of dynamics variables in domain randomization

Table 2
Model differences