EKF-based self-attitude estimation with DNN learning landscape information

This paper presents an EKF-based self-attitude estimation with a DNN (deep neural network) learning landscape information. The method integrates gyroscopic angular velocity and DNN inference in the EKF. The DNN predicts a gravity vector in a camera frame. The input of the network is a camera image, the outputs are a mean vector and a covariance matrix of the gravity. It is trained and validated with a dataset of images and corresponded gravity vectors. The dataset is collected in a flight simulator because we can easily obtain various gravity vectors, although the method is not only for UAVs. Using a simulator breaks the limitation of amount of collecting data with ground truth. The validation shows the network can predict the gravity vector from only a single shot image. It also shows that the covariance matrix expresses the uncertainty of the inference. The covariance matrix is used for integrating the inference in the EKF. Flight data of a drone is also recorded in the simulator, and the EKF-based method is tested with it. It shows the method suppresses accumulative error by integrating the network outputs.


Introduction
Estimating attitude of a robot or a UAV is one of the classic problems of mobile robotics. Especially, real-time estimation is required for real-time attitude control. The attitude is generally estimated with inertial sensors such as accelerometers and gyroscopes. However, mobile robots have their own acceleration. Moreover, on-road robots also receive pulses from the ground, and UAVs suffer from vibration of their rotors. These need to be filtered out from the accelerometer. On the other hand, integration of gyroscopic angular rate has problems of drift and bias. These disturbances worsen the accuracy of the estimation. To complement each other, these inertial data are fused, generally [1]. Nevertheless, dealing the disturbances with only inertial sensors is quite difficult.
To reduce the influence of these disturbances, many kinds of SLAM (Simultaneous Localization And Mapping) [2] have been proposed. SLAM with LiDAR registers point clouds by ICP [3], NDT [4], and so on. Many visual SLAM with cameras have also been proposed [5,6]. SLAM often contains accumulative error since relative pose changes with error are summed up. Some methods use prior information such as 3D maps for estimating self-pose [7]. These methods correct the error by matching the prior information against data from the sensor. However, they work only in environments where maps are available. Moreover, creating a map is timeconsuming and also requires update. Some methods [8,9] estimate attitude under Manhattan world assumption. They assume that planes and edges in the environment are orthogonal to each other. It helps achieving drift-free estimation. However, it is difficult for this kind of methods to avoid being affected by objects which do not satisfy the assumption.
Deep learning has been used for attitude estimation in recent years. In [10], IMU-based odometry by end-toend learning has been proposed. In [11], a deep neural network identifies the measurement noise characteristics of IMU. In [12], a neural network estimates angular rates from sequential images. It was trained with synthetic and real images. The large synthetic dataset was collected in AirSim [13] which offers visually realistic graphics. In [14], a gravity vector is directly estimated from a single shot image. This is based on expectation that the network can learn edge, context, and landscape information; for example, most of all artificial buildings should be built vertically, the sky should be seen when the camera orients upper, and so on. The method does not depend on time sequence since only a single shot image is used to estimate the attitude. It helps suppressing drift, noise, and accumulative error. This method is the most similar to our proposed method. However, this method contains some problems. It cannot express uncertainty of the prediction; for instance, the network outputs estimation even when the camera is all covered by obstacles, when less features are captured, and so on. These outputs with large error worsen estimation when it is used in filter functions such as Kalman filter [15]. Therefore, they should be detected and be rejected before the integration.
To address these issues above, this paper presents selfattitude estimation with DNN which predicts a gravity vector from a single shot image, where the outputs are mean and variance. Only outputs with small variance are absorbed in the EKF (extended Kalman filter) to suppress accumulative error. Moreover, the output covariance matrices are used to adjust the process uncertainty in the EKF. The differences from the related work [14] are noted below: • A larger dataset is collected in our work by using a simulator. • Our network is trained with reliable ground truth while the related work collects a dataset with a hand carried phone. • L2 normalization is applied to the output gravity vector while ReLU is applied in the related work. • Our network outputs mean and a covariance matrix while the related work directly outputs the gravity vector. • Our method filters out DNN inferences with large variance before processing the integration in the EKF. • Our method uses the covariance matrix output from the DNN as update process noise in the EKF.
The dataset and the source code used in this paper for deep learning, and for the EKF have been released in open repositories.

DNN estimating gravity vector
The proposed method makes the DNN learn landscape information for estimating a gravity vector in a camera frame. The gravity is expressed as mean and covariance to consider uncertainty of the prediction.

Coordinate definition
A camera frame is defined as a right-handed coordinate system which is fixed on the camera pose. It is shown in Fig. 1.

Dataset collection
A dataset is collected in AirSim [13]. AirSim is a simulator for drones, cars and more, built on Unreal Engine, which provides visually realistic graphics. The dataset consists of images and corresponded gravity vectors g in the camera frame.

Data transformation and augmentation
Input data and label data are transformed, and are augmented in each epoch of training.

Image (input)
The image is randomly rotated for augmenting the roll data. The rotation angle �φ is limited as [-10 deg, 10 deg]. The image is resized to 224 × 224 . RGB values are normalized. In this work, this normalization is done following mean = (0.5, 0.5, 0.5) and std = (0.5, 0.5, 0.5) . Fig. 4 shows an example of the data augmentation.

Gravity vector (label)
A gravity vector in the camera frame is rotated according to �φ . Since the network does not need to learn norm of the gravity, L2 normalization is also applied to the vector in order to make training efficient. (1)

Network
The proposed DNN is shown in Fig. 5. It consists of CNN layers and FC layers. Its input is a resized image, and its outputs are a mean vector of a gravity vector and a covariance matrix. Technically, the output of the FC layers is (µ g x , µ g y , µ g z , L 0 , · · · , L 5 ) , and the mean vector µ and the covariance matrix are computed as Eqs. 2, 3, respectively. Since the lower-triangular matrix L is required to have positive-valued diagonal entries, an exponential function is applied to the diagonal elements.
It is expected that the CNN layers lean extracting features such as edges, and the FC layers learn landscape information. Feature module of VGG16 [16] pre-trained on Ima-geNet [17] is adopted as the CNN layers of the proposed method. All layers, except the final output layer, use the ReLU function [18] as activation function. All FC layers, except the final output layer, use the 10% Dropout [19].

Loss function
By learning the distribution of the dataset and updating the weights to maximize the probability density for the outputs, the DNN can output mean and variance. Assuming that the estimation follows a multivariate normal distribution, the probability density is computed as Eq. 4. where k denotes the dimensions of the variables, i.e. k = 3 in the proposed method. To make the network learn the probabilistic model, it is trained maximizing P(x label |µ, �) , where x label (= g trans ) is a label of the dataset. The total probability density P total of a dataset D label is computed by multiplying as Eq. 5.
Since the natural logarithm is a monotonically increasing function, maximizing P total can be simplified by taking the natural logarithm. This avoids the value becoming too small by multiplying, and decreases computational cost.
Here, P log total denotes a total value of the log-probability.
Moreover, maximizing P log total is equivalent to minimizing P log total . Finally, the loss function of the proposed method is Eq. 7.
where w denotes parameters of the network. The network minimizes the loss by updating w.

Optimization
Adaptative Moment Estimation (Adam) [20] is used to optimize the parameters. The learning rates are set as lr CNN = 0.00001 , lr FC = 0.0001 , where lr CNN is a value for the CNN layers, lr FC is a value for the FC layers.

Validation of DNN
The proposed DNN was validated by comparing to other methods. The datasets used in this validation are listed in Table 1.

Compared methods
Definitions of methods which were used in this validation are summarized here.

MLE (ours, all)
"MLE (ours, all)" denotes the proposed method described in "DNN estimating gravity vector" section . MLE is short for Maximum Likelihood Estimation.

MLE (ours, selected)
"MLE (ours, selected)" denotes the method uses the exactly same network and the same parameters as "MLE (ours, all)" does, but only samples which output small variance are used for validation of attitude estimation. It means samples with large variance are filtered out as outliers. Assuming β in Eq. 8 expresses uncertainty of the prediction, samples with small variance are selected with a threshold TH β . In this validation, the threshold is set as   Regression w/ L2 normalization "Regression w/ L2 normalization" denotes the network which the final FC layer is difference from "MLE (ours)". It outputs a 3d vector without covariance. It is a similar architecture as [14]. L2 normalization is applied to the final layer while ReLU is applied in [14]. Mean square error (MSE) between labels and outputs is used as a loss function. Fig. 6 shows the network architecture.

Regression w/o L2 normalization
"Regression w/o L2 normalization" denotes the regression network without L2 normalization. It is require to learn not only the direction, but also norm of the gravity vector, i.e. approximately 9.8 m/s 2 , in order to minimize the loss. This information is not required to estimate attitude φ , θ.

Training and validation
The network was trained with 10000 samples (#1) with a batch size of 200 samples for 200 epochs. Another 1000 samples (#2) were used for test. They were collected in "Neighborhood" of AirSim. The training dataset and the test dataset were not mixed.
Loss values during training are plotted in Fig. 7. The regression models converged much faster than the MLE model did. The regression model with L2 normalization converged a bit faster than the regression model without L2 normalization did. Tables 2, 3 and 4 respectively show the loss values after 200 epoch training. It is noted that gradient was not computed with the test datasets. It is also noted a loss function of the MLE model and one of the regression models are difference.
Another 1000 samples (#3) were also collected in "Soc-cerField" of AirSim as data in an unknown environment. 1108 real data samples (#4) were also collected with the sensors in Fig. 3. Note that these samples are just for test, thus the DNN was not trained with them. The loss values of these samples are larger than ones of the known environment in which the network was trained.

Attitude estimation
Roll φ and pitch θ of the camera pose in the gravitational coordinate are estimated by using µ.
Mean absolute error (MAE) of the estimation of the test dataset is shown in Table 5. Variance of the estimated attitude error is shown in Table 6. Both of the error and the variance of "MLE (ours, selected)" are smaller than the others. 715 samples which has β < TH β = 0.00008814 m 3 /s 6 were selected from 1000 test samples (#2) with "MLE (ours, selected)".
Comparing "MLE (ours, all)" and "MLE (ours, selected)", filtering by TH β is found valid, which means the network expresses the uncertainty by outputting the covariance matrix. In order to see this, the samples are sorted in Fig. 8. In Fig. 8a, top 50 samples with largest estimation error with "MLE (ours)" are shown in order. In Fig. 8b, top 50 samples with the largest β with "MLE (ours)" are shown in order. 21 samples of the top 50 samples are mutual of both groups. In Fig. 8a, most of the sample images with large error are covered by obstacles, and they are dark with much less landscape information. There is no way to detect them by the regression model. Correlation between error and β are not complete, but many samples with large error were detected by sorting samples with β . A good example with large β and one (9) Fig. 6 Comparative network architecture. This network is almost same as [14]. Whereas the proposed method outputs the mean and variance, this one outputs only the mean with small β are shown in Fig. 9, respectively. Obviously, the sample in Fig. 9a does not have enough landscape information to estimate the gravity direction, and the proposed network expresses the uncertainty with large β.    Comparing "Regression w/ L2 normalization" and "Regression w/o L2 normalization", L2 normalization does not contribute to the accuracy, although the one with L2 normalization converged a bit faster than the one without L2 normalization did.
A dataset of the unknown environment "Soccer-Field" (#3) and one of real world (#4) were also used for validation. Note that the neural networks were not trained in these environments. MAE and variance of the error are shown in Tables 7, 8, 9 and 10, respectively. Filtering by the threshold TH β made error smaller even in the unknown environments. However, the errors and the variances of errors are larger than the known environment (Tables 5, 6). To reduce difference of results between in known environments and in unknown ones, wider variety of datasets are needed for training.

EKF-based self-attitude estimstion with DNN
The proposed method integrates angular velocities from a gyroscope and the DNN outputs in the EKF. The proposed EKF architecture is shown in Fig. 10. The state vector x of the proposed Kalman filter consists of roll and pitch of the robot. Both of the vector x and the covariance matrix P are computed in a prediction process and in an update process. The prediction process is computed by integrating angular velocity from a gyroscope. The update process is computed by observing the output of the DNN.
Here, k denotes time step, S φ , C φ , T φ are short for sin φ , cos φ , tan φ , respectively in the following sections.

Prediction process
The state vector x and the covariance matrix P are respectively computed as following.
where f is a state transition model, u denotes a control vector, and Rot rpy denotes a rotation matrix for angular velocities.
where J f denotes f Jacobean, and Q denotes a covariance matrix of the process noise.

Update process
Outputs of the DNN with a large variance value β are rejected. A threshold TH β is set for judging β , and only outputs with β < TH β are observed in the EKF. The observation vector is z.
where µ denotes a mean vector of gravity which is output from the DNN. The observation model is h. Fig. 9 Examples with large β and small β. Obviously, it is hard to estimate the gravity direction from the sample (a). The proposed network expresses uncertainty of the prediction of the sample (a) by outputting large covariance    where g world denotes a gravity vector in the world frame i.e. g world 9.8 m/s 2 , Rot xyz denotes a rotation matrix for a vector. The covariance matrix of the process noise is R.
where γ denotes a hyperparameter for adjusting variance. The state vector x and the covariance matrix P are respectively computed as following.
where J h denotes h Jacobean, K denotes a gain matrix, and I denotes an identity matrix.

Validation of EKF
The proposed system was validated in real-time. Since ground truth is available in the simulator, we used synthetic flight data of a drone.

Compared methods
Definitions of methods which were used in this validation are summarized here.

Gyro
"Gyro" denotes an estimation method integrating angular velocity from a gyroscope.

Gyro+Acc
"Gyro+Acc" denotes an EKF-based estimation method integrating angular velocity and linear acceleration from IMU.
Gyro+NDT "Gyro+NDT" denotes NDT SLAM [4] using 32 layers of LiDAR. Angular velocity from a gyroscope, linear velocity of ground truth and NDT output are integrated in the EKF. Note that linear velocity of ground truth is available because the environment is a simulator.

Gyro+Regression
"Gyro+Regression" denotes an EKF-based estimation method integrating angular velocity from a gyroscope and gravity vectors from the regression network (Fig. 6). All outputs from the network are integrated in the EKF.

Experimental conditions
Flight data of a drone ware recorded in "Neighborhood" and "SoccerField" of AirSim. "Neighborhood" is the same environment where the DNN was trained. An IMU and a camera are installed on a drone in the simulator. A LiDAR with 32 layers is also used for "Gyro+NDT". The sampling frequency of the IMU, the camera and LiDAR are approximately 100 Hz, 12 Hz and 20 Hz, respectively. The camera's horizontal FOV is 70 deg. Virtual noise was add to the IMU's 6-axis data. It was randomly added following a normal distribution with a mean of 0 rad/s , 0 m/s 2 and a standard deviation of 0.1 rad/s , 0.1 m/s 2 , respectively. The flight courses of "Neighborhood" and "SoccerField" are shown in Fig. 11. A computer which has i7-6700 CPU  and GTX1080 GPU with 16 GB memory was used for the estimation. The DNN inference computation takes around 0.01-0.02 seconds with the computer. The proposed method is not only for UAVs, but it needs to be noted that this computer may be over-performance for real UAVs. Therefore, testing with a general UAV's computer specification should be necessary when the method is used on real UAVs, and it is not the focus of this study.

Experimental results
The estimated attitudes are plotted in Fig. 12a. Table 11 shows MAE of the estimated attitudes in "Neighborhood". MAE of "Gyro+MLE (ours)" is smaller than ones of the other methods. "Gyro" had large accumulative error. That is natural because noise was added and the method does not have any other observation. "Gyro+Acc" does not have accumulative error. However it has error constantly, since the acceleration values of the sensor contain own acceleration of the robot. "Gyro+NDT" accumulated error slower than "Gyro" did by using LiDAR, but it could not remove the accumulation. "Gyro+Regression" and "Gyro+MLE (ours)" corrected the accumulative error by observing the estimated gravity. Comparing "Gyro+Regression" and "Gyro+MLE (ours)", filtering out the DNN outputs with large β is found valid. With "Gyro+MLE (ours)", 31 % of the outputs is rejected by the threshold in the flight. MAE of the estimation in "SoccerField" is shown in Table 12. "Gyro+Regression" and "Gyro+MLE (ours)" suppressed accumulative error even in the environment where the DNN was not trained. MAE of "Gyro+MLE (ours)" is smaller than ones of the other methods. With "Gyro+MLE (ours)", more outputs (58 %) from the DNN are filtered out by the threshold TH β in the unknown environment ("SoccerField") than in the known environment ("Neighborhood"). This can avoid observing outputs with high uncertainty. However, it leads chances of correcting error less. In other words, the proposed method can not correct error when the large variance is continuing. To compensate this decrease of the chances, the proposed method actually can integrates other observations such as IMU's acceleration, SLAM, and so on in a future work. Whereas, in this experiment, the proposed method integrates only angular rate and the DNN outputs in order to make the validation simple.

Conclusions and future work
EKF-based self-attitude estimation with DNN learning landscape information was proposed. The DNN estimates direction of gravity from a single shot image. The network outputs not only the gravity vector, but also a covariance matrix. Training was done with synthetic data of AirSim, and validation was done with both of the synthetic data and real sensors' data. In the validation, many samples with large error are filtered out by judging variance values. It means the proposed network expresses uncertainty of the prediction by outputting covariance matrices. The outputs from the DNN and angular velocity from a gyroscope are integrated in the EKF. The covariance matrix is used adjusting process noise. Moreover, outputs with too large variance are filtered out by a threshold. This EKF-based proposed method was validated with flight data of a drone in AirSim environments. In the experiments, the proposed method suppressed accumulative error by using the DNN.
In a future work, wider variety of datasets including real data are needed for the DNN to close the gap between the results in known environments and in unknown environments. Simulator data can be used for pre-training of the DNN before fine tuning with the real data. Only the roll augmentation is applied in this paper because it is easy and simple, but pitch augmentation by the homography transformation should also be valid for fine-tuning with the less real samples. Using other sensors or combining other methods is another future work. The experiments in this paper were done only with the daytime condition, but real robots should work in day and night. Therefor the future work needs to compensate the camera's weak point for night operations.