Skip to main content

EKF-based self-attitude estimation with DNN learning landscape information

Abstract

This paper presents an EKF-based self-attitude estimation with a DNN (deep neural network) learning landscape information. The method integrates gyroscopic angular velocity and DNN inference in the EKF. The DNN predicts a gravity vector in a camera frame. The input of the network is a camera image, the outputs are a mean vector and a covariance matrix of the gravity. It is trained and validated with a dataset of images and corresponded gravity vectors. The dataset is collected in a flight simulator because we can easily obtain various gravity vectors, although the method is not only for UAVs. Using a simulator breaks the limitation of amount of collecting data with ground truth. The validation shows the network can predict the gravity vector from only a single shot image. It also shows that the covariance matrix expresses the uncertainty of the inference. The covariance matrix is used for integrating the inference in the EKF. Flight data of a drone is also recorded in the simulator, and the EKF-based method is tested with it. It shows the method suppresses accumulative error by integrating the network outputs.

Introduction

Estimating attitude of a robot or a UAV is one of the classic problems of mobile robotics. Especially, real-time estimation is required for real-time attitude control. The attitude is generally estimated with inertial sensors such as accelerometers and gyroscopes. However, mobile robots have their own acceleration. Moreover, on-road robots also receive pulses from the ground, and UAVs suffer from vibration of their rotors. These need to be filtered out from the accelerometer. On the other hand, integration of gyroscopic angular rate has problems of drift and bias. These disturbances worsen the accuracy of the estimation. To complement each other, these inertial data are fused, generally [1]. Nevertheless, dealing the disturbances with only inertial sensors is quite difficult.

To reduce the influence of these disturbances, many kinds of SLAM (Simultaneous Localization And Mapping) [2] have been proposed. SLAM with LiDAR registers point clouds by ICP [3], NDT [4], and so on. Many visual SLAM with cameras have also been proposed [5, 6]. SLAM often contains accumulative error since relative pose changes with error are summed up. Some methods use prior information such as 3D maps for estimating self-pose [7]. These methods correct the error by matching the prior information against data from the sensor. However, they work only in environments where maps are available. Moreover, creating a map is time-consuming and also requires update. Some methods [8, 9] estimate attitude under Manhattan world assumption. They assume that planes and edges in the environment are orthogonal to each other. It helps achieving drift-free estimation. However, it is difficult for this kind of methods to avoid being affected by objects which do not satisfy the assumption.

Deep learning has been used for attitude estimation in recent years. In [10], IMU-based odometry by end-to-end learning has been proposed. In [11], a deep neural network identifies the measurement noise characteristics of IMU. In [12], a neural network estimates angular rates from sequential images. It was trained with synthetic and real images. The large synthetic dataset was collected in AirSim [13] which offers visually realistic graphics. In [14], a gravity vector is directly estimated from a single shot image. This is based on expectation that the network can learn edge, context, and landscape information; for example, most of all artificial buildings should be built vertically, the sky should be seen when the camera orients upper, and so on. The method does not depend on time sequence since only a single shot image is used to estimate the attitude. It helps suppressing drift, noise, and accumulative error. This method is the most similar to our proposed method. However, this method contains some problems. It cannot express uncertainty of the prediction; for instance, the network outputs estimation even when the camera is all covered by obstacles, when less features are captured, and so on. These outputs with large error worsen estimation when it is used in filter functions such as Kalman filter [15]. Therefore, they should be detected and be rejected before the integration.

To address these issues above, this paper presents self-attitude estimation with DNN which predicts a gravity vector from a single shot image, where the outputs are mean and variance. Only outputs with small variance are absorbed in the EKF (extended Kalman filter) to suppress accumulative error. Moreover, the output covariance matrices are used to adjust the process uncertainty in the EKF. The differences from the related work [14] are noted below:

  • A larger dataset is collected in our work by using a simulator.

  • Our network is trained with reliable ground truth while the related work collects a dataset with a hand carried phone.

  • L2 normalization is applied to the output gravity vector while ReLU is applied in the related work.

  • Our network outputs mean and a covariance matrix while the related work directly outputs the gravity vector.

  • Our method filters out DNN inferences with large variance before processing the integration in the EKF.

  • Our method uses the covariance matrix output from the DNN as update process noise in the EKF.

The dataset and the source code used in this paper for deep learning, and for the EKF have been released in open repositories.

DNN estimating gravity vector

The proposed method makes the DNN learn landscape information for estimating a gravity vector in a camera frame. The gravity is expressed as mean and covariance to consider uncertainty of the prediction.

Coordinate definition

A camera frame is defined as a right-handed coordinate system which is fixed on the camera pose. It is shown in Fig. 1.

Fig. 1
figure1

Screenshot of AirSim with coordinate description. An IMU and a camera are equipped to the drone in the simulator. The purpose of the proposed neural network is estimating a gravity vector in the camera frame from a front camera image

Dataset collection

A dataset is collected in AirSim [13]. AirSim is a simulator for drones, cars and more, built on Unreal Engine, which provides visually realistic graphics. The dataset consists of images and corresponded gravity vectors \({\varvec{g}}\) in the camera frame. The camera pose and weather parameters are randomized, and a image and a gravity vector are recorded at each pose. The range of random \(Z\) is limited as [2 m, 3 m] in this work. The ranges of random roll \(\phi\) and pitch \(\theta\) are limited as [-30 deg, 30 deg], respectively. Fig. 2 shows examples of the dataset.

Real data is also collected with the sensors in Fig. 3 for test. The stick is hand-carried, and images and linear acceleration vectors measured with the IMU (Xsens MTi-30) are recorded. They are saved only when the stick is shaking less than 0.001 m, 0.1 deg, and when it is at least 5 deg away from the last saved pose. The IMU is regarded as ground truth because it has enough accuracy (within 0.2 deg) in static according to the specification.

Fig. 2
figure2

Examples of datasets. The dataset consists of images and correspond gravity vectors \(\varvec{g}\;[{\text{m/s}}^2]\) in the camera frame. These examples were collected in “Neighborhood” of AirSim. The camera pose and weather parameters are randomized for creating a dataset. An example of the weather parameter is that the road in the lower right image is covered in snow

Fig. 3
figure3

Sensor stick. Images and acceleration are recorded with this stick when it is still. The judge whether it is still is processed by programing. Note that depth information is not used in this study although RealSense D435 is a RGB-D camera

Data transformation and augmentation

Input data and label data are transformed, and are augmented in each epoch of training.

Image (input)

The image is randomly rotated for augmenting the roll data. The rotation angle \(\Delta \phi\) is limited as [– 10 deg, 10 deg]. The image is resized to \(224 \times 224\). RGB values are normalized. In this work, this normalization is done following \({\varvec{mean}} = (0.5, 0.5, 0.5)\) and \({\varvec{std}} = (0.5, 0.5, 0.5)\). Fig. 4 shows an example of the data augmentation.

Fig. 4
figure4

Example of a transformed image. An image is randomly rotated according to \(\Delta \phi\). This example shows an image when \(\Delta \phi = 10{\mkern 1mu}\; {\text{deg}}\). It is also resized, and is normalized

Gravity vector (label)

A gravity vector in the camera frame is rotated according to \(\Delta \phi\). Since the network does not need to learn norm of the gravity, L2 normalization is also applied to the vector in order to make training efficient.

$$\begin{gathered} \varvec{g}_{{{\mathbf{trans}}}} \; = \varvec{Rot}_{{(\Delta \phi )}}^{{{\text{xyz}}}} \frac{\varvec{g}}{{|\varvec{g}|}} \hfill \\ \varvec{Rot}_{{(\Delta \phi )}}^{{{\text{xyz}}}} \; = \left( {\begin{array}{*{20}c} 1 & 0 & 0 \\ 0 & {\cos ( - \Delta \phi )} & { - \sin ( - \Delta \phi )} \\ 0 & {\sin ( - \Delta \phi )} & {\cos ( - \Delta \phi )} \\ \end{array} } \right) \hfill \\ \end{gathered}$$
(1)

Network

The proposed DNN is shown in Fig. 5. It consists of CNN layers and FC layers. Its input is a resized image, and its outputs are a mean vector of a gravity vector and a covariance matrix. Technically, the output of the FC layers is \((\mu _{g_x}, \mu _{g_y}, \mu _{g_z}, L_0, \cdots , L_5)\), and the mean vector \({\varvec{\mu }}\) and the covariance matrix \({\varvec{\Sigma }}\) are computed as Eqs. 2, 3, respectively. Since the lower-triangular matrix \({\varvec{L}}\) is required to have positive-valued diagonal entries, an exponential function is applied to the diagonal elements.

$$\varvec{\mu} =\,\frac{{(\mu _{{g_{x} }} ,\mu _{{g_{y} }} ,\mu _{{g_{z} }} )^{{\text{T}}} }}{{|(\mu _{{g_{x} }} ,\mu _{{g_{y} }} ,\mu _{{g_{z} }} )^{{\text{T}}} |}}$$
(2)
$$\varvec{\Sigma} = \varvec{LL}^{{\text{T}}} ,\quad \varvec{L} = \left( {\begin{array}{*{20}c} {\exp (L_{0} )} & 0 & 0 \\ {L_{1} } & {\exp (L_{2} )} & 0 \\ {L_{3} } & {L_{4} } & {\exp (L_{5} )} \\ \end{array} } \right)$$
(3)

It is expected that the CNN layers lean extracting features such as edges, and the FC layers learn landscape information. Feature module of VGG16 [16] pre-trained on ImageNet [17] is adopted as the CNN layers of the proposed method. All layers, except the final output layer, use the ReLU function [18] as activation function. All FC layers, except the final output layer, use the 10% Dropout [19].

Fig. 5
figure5

Proposed network architecture. It consists of CNN layers and FC layers. The input data is a resized image, and the output data are a mean vector and a covariance matrix. They are computed with an output from the final layer as Eqs. 2, 3, respectively. Log-probability of multivariate normal distribution is used as a loss function of this model

Loss function

By learning the distribution of the dataset and updating the weights to maximize the probability density for the outputs, the DNN can output mean and variance. Assuming that the estimation follows a multivariate normal distribution, the probability density is computed as Eq. 4.

$$P(\varvec{x}|\varvec{\mu} ,\varvec{\Sigma} ) = \frac{{\exp ( - \frac{1}{2}(\varvec{x} - \varvec{\mu} )^{{\text{T}}} \varvec{\Sigma} ^{{ - 1}} (\varvec{x} - \varvec{\mu} ))}}{{\sqrt {(2\pi )^{k} |\varvec{\Sigma} |} }},\quad k = {\text{rank}}(\varvec{\Sigma} )$$
(4)

where \(k\) denotes the dimensions of the variables, i.e. \(k = 3\) in the proposed method. To make the network learn the probabilistic model, it is trained maximizing \(P({\varvec{x}}_\mathbf {label} | {\varvec{\mu }}, {\varvec{\Sigma }})\), where \({\varvec{x}}_\mathbf {label}\) (= \({\varvec{g}}_\mathbf {trans}\)) is a label of the dataset. The total probability density \(P_{{{\text{total}}}}\) of a dataset \(D_{{{\text{label}}}}\) is computed by multiplying as Eq. 5.

$$\begin{gathered} P_{{{\text{total}}}} = \prod\limits_{{i = 0}}^{n} P (\varvec{x}_{{{\mathbf{label}}_{i} }} |\varvec{\mu} _{i} ,\varvec{\Sigma} _{i} ) \hfill \\ D_{{{\text{label}}}} = [\varvec{x}_{{{\mathbf{label}}_{0} }} , \cdots ,\varvec{x}_{{{\mathbf{label}}_{i} }} , \cdots ,\varvec{x}_{{{\mathbf{label}}_{n} }} ] \hfill \\ \end{gathered}$$
(5)

Since the natural logarithm is a monotonically increasing function, maximizing \(P_{{{\text{total}}}}\) can be simplified by taking the natural logarithm. This avoids the value becoming too small by multiplying, and decreases computational cost. Here, \(P_{{{\text{log}}_{{{\text{total}}}} }}\) denotes a total value of the log-probability.

$$P_{{{\text{log}}_{{{\text{total}}}} }} = \sum\limits_{{i = 0}}^{n} {\ln } P(\varvec{x}_{{{\mathbf{label}}_{i} }} |\varvec{\mu} _{i} ,\varvec{\Sigma} _{i} ){\text{ }}$$
(6)

Moreover, maximizing \(P_{{{\text{log}}_{{{\text{total}}}} }}\) is equivalent to minimizing \(P_{{{\text{log}}_{{{\text{total}}}} }}\). Finally, the loss function of the proposed method is Eq. 7.

$$f_{{(\varvec{w})}} = - P_{{{\text{log}}_{{{\text{total}}}} }}$$
(7)

where \({\varvec{w}}\) denotes parameters of the network. The network minimizes the loss by updating \({\varvec{w}}\).

Optimization

Adaptative Moment Estimation (Adam) [20] is used to optimize the parameters. The learning rates are set as \(lr_{{{\text{CNN}}}} = 0.00001\), \(lr_{{{\text{FC}}}} = 0.0001\), where \(lr_\text {CNN}\) is a value for the CNN layers, \(lr_\text {FC}\) is a value for the FC layers.

Validation of DNN

The proposed DNN was validated by comparing to other methods. The datasets used in this validation are listed in Table 1.

Table 1 Dataset list

Compared methods

Definitions of methods which were used in this validation are summarized here.

MLE (ours, all)

“MLE (ours, all)” denotes the proposed method described in "DNN estimating gravity vector" section . MLE is short for Maximum Likelihood Estimation.

MLE (ours, selected)

“MLE (ours, selected)” denotes the method uses the exactly same network and the same parameters as “MLE (ours, all)” does, but only samples which output small variance are used for validation of attitude estimation. It means samples with large variance are filtered out as outliers. Assuming \(\beta\) in Eq. 8 expresses uncertainty of the prediction, samples with small variance are selected with a threshold \({\text{TH}}_{\beta }\). In this validation, the threshold is set as \({\text{TH}}_{\beta } = \frac{1}{n}\sum\limits_{{i = 0}}^{n} {\beta _{i} }\), where \(n\) is the number of samples in the testing dataset.

$$\begin{aligned} \beta = \sqrt{\Sigma _{0, 0}} \times \sqrt{\Sigma _{1, 1}} \times \sqrt{\Sigma _{2, 2}} ,\quad {\varvec{\Sigma }} = \begin{pmatrix} \Sigma _{0, 0} &{} \Sigma _{0, 1} &{} \Sigma _{0, 2} \\ \Sigma _{1, 0} &{} \Sigma _{1, 1} &{} \Sigma _{1, 2} \\ \Sigma _{2, 0} &{} \Sigma _{2, 1} &{} \Sigma _{2, 2} \end{pmatrix} \end{aligned}$$
(8)

Regression w/ L2 normalization

“Regression w/ L2 normalization” denotes the network which the final FC layer is difference from “MLE (ours)”. It outputs a 3d vector without covariance. It is a similar architecture as [14]. L2 normalization is applied to the final layer while ReLU is applied in [14]. Mean square error (MSE) between labels and outputs is used as a loss function. Fig. 6 shows the network architecture.

Fig. 6
figure6

Comparative network architecture. This network is almost same as [14]. Whereas the proposed method outputs the mean and variance, this one outputs only the mean

Regression w/o L2 normalization

“Regression w/o L2 normalization” denotes the regression network without L2 normalization. It is require to learn not only the direction, but also norm of the gravity vector, i.e. approximately 9.8 \({\text{m/s}}^{2}\), in order to minimize the loss. This information is not required to estimate attitude \(\phi\), \(\theta\).

Training and validation

The network was trained with 10000 samples (#1) with a batch size of 200 samples for 200 epochs. Another 1000 samples (#2) were used for test. They were collected in “Neighborhood” of AirSim. The training dataset and the test dataset were not mixed.

Loss values during training are plotted in Fig. 7. The regression models converged much faster than the MLE model did. The regression model with L2 normalization converged a bit faster than the regression model without L2 normalization did. Tables 2, 3 and 4 respectively show the loss values after 200 epoch training. It is noted that gradient was not computed with the test datasets. It is also noted a loss function of the MLE model and one of the regression models are difference.

Another 1000 samples (#3) were also collected in “SoccerField” of AirSim as data in an unknown environment. 1108 real data samples (#4) were also collected with the sensors in Fig. 3. Note that these samples are just for test, thus the DNN was not trained with them. The loss values of these samples are larger than ones of the known environment in which the network was trained.

Fig. 7
figure7

Loss plotting. It is noted a loss functions of the MLE model and one of the regression models are difference. Therefore, their values can not be simply compared

Table 2 Loss of MLE (ours) after 200 epochs
Table 3 Loss of Regression w/ L2 normalization after 200 epochs
Table 4 Loss of Regression w/o L2 normalization after 200 epochs

Attitude estimation

Roll \(\phi\) and pitch \(\theta\) of the camera pose in the gravitational coordinate are estimated by using \({\varvec{\mu }}\).

$$\begin{aligned} \phi = \tan ^{-1} \frac{ \mu _{g_y} }{ \mu _{g_z} } ,\quad \theta = \tan ^{-1} \frac{ -\mu _{g_x} }{ \sqrt{\mu _{g_y}^2 + \mu _{g_z}^2} } \end{aligned}$$
(9)

Mean absolute error (MAE) of the estimation of the test dataset is shown in Table 5. Variance of the estimated attitude error is shown in Table 6. Both of the error and the variance of “MLE (ours, selected)” are smaller than the others. 715 samples which has \(\beta < {\text{ TH}}_{\beta } = 0.00008814{\mkern 1mu} \; {\text{m}}^3 {\text{/s}}^6\) were selected from 1000 test samples (#2) with “MLE (ours, selected)”.

Comparing “MLE (ours, all)” and “MLE (ours, selected)”, filtering by \({\text{TH}}_{\beta }\) is found valid, which means the network expresses the uncertainty by outputting the covariance matrix. In order to see this, the samples are sorted in Fig. 8. In Fig. 8a, top 50 samples with largest estimation error with “MLE (ours)” are shown in order. In Fig. 8b, top 50 samples with the largest \(\beta\) with “MLE (ours)” are shown in order. 21 samples of the top 50 samples are mutual of both groups. In Fig. 8a, most of the sample images with large error are covered by obstacles, and they are dark with much less landscape information. There is no way to detect them by the regression model. Correlation between error and \(\beta\) are not complete, but many samples with large error were detected by sorting samples with \(\beta\). A good example with large \(\beta\) and one with small \(\beta\) are shown in Fig. 9, respectively. Obviously, the sample in Fig. 9a does not have enough landscape information to estimate the gravity direction, and the proposed network expresses the uncertainty with large \(\beta\).

Comparing “Regression w/ L2 normalization” and “Regression w/o L2 normalization”, L2 normalization does not contribute to the accuracy, although the one with L2 normalization converged a bit faster than the one without L2 normalization did.

A dataset of the unknown environment “SoccerField” (#3) and one of real world (#4) were also used for validation. Note that the neural networks were not trained in these environments. MAE and variance of the error are shown in Tables 7, 8, 9 and 10, respectively. Filtering by the threshold \({\text{TH}}_{\beta }\) made error smaller even in the unknown environments. However, the errors and the variances of errors are larger than the known environment (Tables 5, 6). To reduce difference of results between in known environments and in unknown ones, wider variety of datasets are needed for training.

Table 5 MAE of estimated attitude in known environment (#2)
Table 6 Variance of estimated attitude error in known environment (#2)
Fig. 8
figure8

Sorted samples. A number above each image is a index of a sample. In (a), top 50 samples are sorted in descending order of the error in “MLE (ours)”. Error of sample#608 in the regression model is \({ \phi _{{\rm error}}} = - 35.34\;{{\rm deg}}, \; {\theta _{{\rm error}}} = - 25.74\;{{\rm deg}}\). Error of sample#352 is \(\phi _{{{\text{error}}}} \; = \;2.80\;{\text{deg}}, \; \theta _{{{\text{error}}}} \; = \; - \;12.93\;{\text{deg}}\). In (b), top 50 samples are sorted in descending order of \(\beta\) in “MLE (ours)”. Mutual samples of the both groups are marked with red rectangles

Fig. 9
figure9

Examples with large \(\beta\) and small \(\beta\). Obviously, it is hard to estimate the gravity direction from the sample (a). The proposed network expresses uncertainty of the prediction of the sample (a) by outputting large covariance

Table 7 MAE of estimated attitude in unknown environment (#3)
Table 8 Variance of estimated attitude error in unknown environment (#3)
Table 9 MAE of estimated attitude in real world (#4)
Table 10 Variance of estimated attitude error in real world (#4)

EKF-based self-attitude estimstion with DNN

The proposed method integrates angular velocities from a gyroscope and the DNN outputs in the EKF. The proposed EKF architecture is shown in Fig. 10. The state vector \({\varvec{x}}\) of the proposed Kalman filter consists of roll and pitch of the robot. Both of the vector \({\varvec{x}}\) and the covariance matrix \({\varvec{P}}\) are computed in a prediction process and in an update process. The prediction process is computed by integrating angular velocity from a gyroscope. The update process is computed by observing the output of the DNN.

$$\varvec{x} = \left( {\begin{array}{*{20}c} \phi & \theta \\ \end{array} } \right)^{{\text{T}}}$$
(10)

Here, \(k\) denotes time step, \(S_{\phi }\), \(C_{\phi }\), \(T_{\phi }\) are short for \(\sin {\phi }\), \(\cos {\phi }\), \(\tan {\phi }\), respectively in the following sections.

Fig. 10
figure10

Proposed EKF architecture. Gyroscopic angular rates are integrated in the prediction process in the EKF. The DNN outputs are integrated in the update process in the EKF

Prediction process

The state vector \({\varvec{x}}\) and the covariance matrix \({\varvec{P}}\) are respectively computed as following.

$$\begin{gathered} \overline{{\varvec{x}_{k} }} = f_{{(\varvec{x}_{{k - 1}} ,\varvec{u}_{{k - 1}} )}} = \varvec{x}_{{k - 1}} + \varvec{Rot}_{{(\varvec{x}_{{k - 1}} )}}^{{{\text{rpy}}}} \varvec{u}_{{k - 1}} \hfill \\ \varvec{u}_{{k - 1}} = \varvec{\omega} _{{k - 1}} \Delta t = \left( {\begin{array}{*{20}c} {\omega _{{{\text{x}}_{{k - 1}} }} \Delta t} \\ {\omega _{{{\text{y}}_{{k - 1}} }} \Delta t} \\ {\omega _{{{\text{z}}_{{k - 1}} }} \Delta t} \\ \end{array} } \right) \hfill \\ \varvec{Rot}_{{(\varvec{x}_{{k - 1}} )}}^{{{\text{rpy}}}} \; = \;\left( {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} \begin{array}{*{20}c} {S_{{\phi _{{k - 1}} }} T_{{\theta _{{k - 1}} }} } \\ {C_{{\phi _{{k - 1}} }} } \\ \end{array} \begin{array}{*{20}c} {C_{{\phi _{{k - 1}} }} T_{{\theta _{{k - 1}} }} } \\ { - S_{{\phi _{{k - 1}} }} } \\ \end{array} } \right) \hfill \\ \end{gathered}$$
(11)

where \(f\) is a state transition model, \({\varvec{u}}\) denotes a control vector, and \(\varvec{Rot}^{{{\text{rpy}}}}\) denotes a rotation matrix for angular velocities.

$$\overline{{\varvec{P}_{k} }} \; = \;\varvec{J_f}_{{k - 1}} \varvec{P}_{{k - 1}} \varvec{J_f}^{T} _{{k - 1}} \; + \;\varvec{Q}_{{k - 1}} ,\varvec{J_f}_{{k - 1}} \;\left. {\frac{{\partial f}}{{\partial \varvec{x}}}} \right|_{{\varvec{x}_{{k - 1}} ,\varvec{u}_{{k - 1}} }}$$
(12)

where \({\varvec{J_f}}\) denotes \(f\) Jacobean, and \({\varvec{Q}}\) denotes a covariance matrix of the process noise.

Update process

Outputs of the DNN with a large variance value \(\beta\) are rejected. A threshold \({\text{TH}}_{\beta }\) is set for judging \(\beta\), and only outputs with \(\beta \;{\text{ < }}\;{\text{TH}}_{\beta }\) are observed in the EKF. The observation vector is \({\varvec{z}}\).

$$\begin{aligned} {\varvec{z}} = {\varvec{\mu }} \end{aligned}$$
(13)

where \({\varvec{\mu }}\) denotes a mean vector of gravity which is output from the DNN. The observation model is \(h\).

$$h_{{(\varvec{x}_{k} )}} = \varvec{Rot}_{{(\varvec{x}_{k} )}}^{{{\text{xyz}}}} \varvec{g}_{{{\mathbf{world}}}} ,\quad \varvec{g}_{{{\mathbf{world}}}} = \left( {\begin{array}{*{20}c} 0 \\ 0 \\ {g_{{{\text{world}}}} } \\ \end{array} } \right){\text{ }}$$
(14)
$$\varvec{Rot}_{{(\varvec{x}_{k} )}}^{{{\text{xyz}}}} \; = \;\left( {\begin{array}{*{20}c} {C_{{\theta _{k} }} C_{{\psi _{k} }} C_{{\theta _{k} }} S_{{\psi _{k} }} - S_{{\theta _{k} }} } \\ {S_{{\phi _{k} }} S_{{\theta _{k} }} C_{{\psi _{k} }} - C_{{\phi _{k} }} S_{{\psi _{k} }} S_{{\phi _{k} }} S_{{\theta _{k} }} S_{{\psi _{k} }} + C_{{\phi _{k} }} C_{{\psi _{k} }} S_{{\phi _{k} }} C_{{\theta _{k} }} } \\ {C_{{\phi _{k} }} S_{{\theta _{k} }} C_{{\psi _{k} }} + S_{{\phi _{k} }} S_{{\psi _{k} }} C_{{\phi _{k} }} S_{{\theta _{k} }} S_{{\psi _{k} }} - S_{{\phi _{k} }} C_{{\psi _{k} }} C_{{\phi _{k} }} C_{{\theta _{k} }} } \\ \end{array} } \right){\text{ }}$$
(15)

where \({\varvec{g}}_\mathbf {world}\) denotes a gravity vector in the world frame i.e. \(g_{{{\text{world}}}} \; \fallingdotseq\;9.8{\mkern 1mu}\; {\text{m/s}}^{{\text{2}}}\), \(\varvec{Rot}^{{{\text{xyz}}}}\) denotes a rotation matrix for a vector. The covariance matrix of the process noise is \({\varvec{R}}\).

$$\begin{aligned} {\varvec{R}} = \begin{pmatrix} \gamma \Sigma _{0, 0} &{} \Sigma _{0, 1} &{} \Sigma _{0, 2} \\ \Sigma _{1, 0} &{} \gamma \Sigma _{1, 1} &{} \Sigma _{1, 2} \\ \Sigma _{2, 0} &{} \Sigma _{2, 1} &{} \gamma \Sigma _{2, 2} \end{pmatrix} \end{aligned}$$
(16)

where \(\gamma\) denotes a hyperparameter for adjusting variance. The state vector \({\varvec{x}}\) and the covariance matrix \({\varvec{P}}\) are respectively computed as following.

$$\begin{gathered} \hat{\varvec{x}}_{k} \; = \;\varvec{x}_{k} + \varvec{K}_{k} (\varvec{z}_{k} - h_{{(\varvec{x}_{k} )}} ),\;\hat{\varvec{P}}_{k} \; = \;(\varvec{I} - \varvec{K}_{k} \varvec{J_h}_{{{k} }} )\varvec{P}_{k} \hfill \\ \varvec{J_h}_{{{k} }} \; = \;\left. {\frac{{\partial h}}{{\partial \varvec{x}}}} \right|_{{x_{k} }} ,\;\varvec{K}_{k} \; = \;\varvec{P}_{k} \varvec{J_h}^{T} _{{{k} }} (\varvec{J_h}_{{{k} }} \varvec{P}_{k} \varvec{J_h}^{T} _{{{k} }} + \varvec{R}_{k} )^{{ - 1}} \hfill \\ \end{gathered}$$
(17)

where \({\varvec{J_h}}\) denotes \(h\) Jacobean, \({\varvec{K}}\) denotes a gain matrix, and \({\varvec{I}}\) denotes an identity matrix.

Validation of EKF

The proposed system was validated in real-time. Since ground truth is available in the simulator, we used synthetic flight data of a drone.

Compared methods

Definitions of methods which were used in this validation are summarized here.

Gyro

“Gyro” denotes an estimation method integrating angular velocity from a gyroscope.

Gyro+Acc

“Gyro+Acc” denotes an EKF-based estimation method integrating angular velocity and linear acceleration from IMU.

Gyro+NDT

“Gyro+NDT” denotes NDT SLAM [4] using 32 layers of LiDAR. Angular velocity from a gyroscope, linear velocity of ground truth and NDT output are integrated in the EKF. Note that linear velocity of ground truth is available because the environment is a simulator.

Gyro+Regression

“Gyro+Regression” denotes an EKF-based estimation method integrating angular velocity from a gyroscope and gravity vectors from the regression network (Fig. 6). All outputs from the network are integrated in the EKF.

Gyro+MLE (ours)

“Gyro+MLE (ours)” denotes the proposed method described in "EKF-based self-attitude estimstion with DNN" section. The hyperparameters were set as \({\text{TH}}_{\beta } = 8 \times 10^{{ - 5}}\) and \(\gamma = 1 \times 10^4\). They were empirically determined based on the result in "Validation of DNN" section.

Experimental conditions

Flight data of a drone ware recorded in “Neighborhood” and “SoccerField” of AirSim. “Neighborhood” is the same environment where the DNN was trained. An IMU and a camera are installed on a drone in the simulator. A LiDAR with 32 layers is also used for “Gyro+NDT”. The sampling frequency of the IMU, the camera and LiDAR are approximately 100 Hz, 12 Hz and 20 Hz, respectively. The camera’s horizontal FOV is 70 deg. Virtual noise was add to the IMU’s 6-axis data. It was randomly added following a normal distribution with a mean of 0 \({\text{rad/s}}\), 0 \({\text{m/s}}^2\) and a standard deviation of 0.1 \({\text{rad/s}}\), 0.1 \({\text{m/s}}^2\), respectively. The flight courses of “Neighborhood” and “SoccerField” are shown in Fig. 11. A computer which has i7-6700 CPU and GTX1080 GPU with 16 GB memory was used for the estimation. The DNN inference computation takes around 0.01–0.02 seconds with the computer. The proposed method is not only for UAVs, but it needs to be noted that this computer may be over-performance for real UAVs. Therefore, testing with a general UAV’s computer specification should be necessary when the method is used on real UAVs, and it is not the focus of this study.

Fig. 11
figure11

Flight courses. In (a), a course in “Neighborhood” which is a known environment for the DNN is shown. The drone flew for 5 rounds. It took about 22 minutes. In (b), a course in “SoccerField” which is an unknown environment for the DNN is shown. The drone flew for 3 rounds. It took about 6 minutes

Experimental results

The estimated attitudes are plotted in Fig. 12a. Table 11 shows MAE of the estimated attitudes in “Neighborhood”. MAE of “Gyro+MLE (ours)” is smaller than ones of the other methods. “Gyro” had large accumulative error. That is natural because noise was added and the method does not have any other observation. “Gyro+Acc” does not have accumulative error. However it has error constantly, since the acceleration values of the sensor contain own acceleration of the robot. “Gyro+NDT” accumulated error slower than “Gyro” did by using LiDAR, but it could not remove the accumulation. “Gyro+Regression” and “Gyro+MLE (ours)” corrected the accumulative error by observing the estimated gravity. Comparing “Gyro+Regression” and “Gyro+MLE (ours)”, filtering out the DNN outputs with large \(\beta\) is found valid. With “Gyro+MLE (ours)”, 31 % of the outputs is rejected by the threshold in the flight.

MAE of the estimation in “SoccerField” is shown in Table 12. “Gyro+Regression” and “Gyro+MLE (ours)” suppressed accumulative error even in the environment where the DNN was not trained. MAE of “Gyro+MLE (ours)” is smaller than ones of the other methods. With “Gyro+MLE (ours)”, more outputs (58 %) from the DNN are filtered out by the threshold \({\text{TH}}_{\beta }\) in the unknown environment (“SoccerField”) than in the known environment (“Neighborhood”). This can avoid observing outputs with high uncertainty. However, it leads chances of correcting error less. In other words, the proposed method can not correct error when the large variance is continuing. To compensate this decrease of the chances, the proposed method actually can integrates other observations such as IMU’s acceleration, SLAM, and so on in a future work. Whereas, in this experiment, the proposed method integrates only angular rate and the DNN outputs in order to make the validation simple.

Fig. 12
figure12

Real-time attitude plotting. In (a), the result of the flight in “Neighborhood” is shown. In (b), the result of the flight in “SoccerField” is shown.

Table 11 MAE of estimated attitude during flight in known environment
Table 12 MAE of estimated attitude during flight in unknown environment

Conclusions and future work

EKF-based self-attitude estimation with DNN learning landscape information was proposed. The DNN estimates direction of gravity from a single shot image. The network outputs not only the gravity vector, but also a covariance matrix. Training was done with synthetic data of AirSim, and validation was done with both of the synthetic data and real sensors’ data. In the validation, many samples with large error are filtered out by judging variance values. It means the proposed network expresses uncertainty of the prediction by outputting covariance matrices. The outputs from the DNN and angular velocity from a gyroscope are integrated in the EKF. The covariance matrix is used adjusting process noise. Moreover, outputs with too large variance are filtered out by a threshold. This EKF-based proposed method was validated with flight data of a drone in AirSim environments. In the experiments, the proposed method suppressed accumulative error by using the DNN.

In a future work, wider variety of datasets including real data are needed for the DNN to close the gap between the results in known environments and in unknown environments. Simulator data can be used for pre-training of the DNN before fine tuning with the real data. Only the roll augmentation is applied in this paper because it is easy and simple, but pitch augmentation by the homography transformation should also be valid for fine-tuning with the less real samples. Using other sensors or combining other methods is another future work. The experiments in this paper were done only with the daytime condition, but real robots should work in day and night. Therefor the future work needs to compensate the camera’s weak point for night operations.

Availability of data and materials

Dataset consisting of images and their corresponding gravity vectors. https://github.com/ozakiryota/dataset_image_to_gravity. .https://github.com/ozakiryota/image_to_gravity.https://github.com/ozakiryota/dnn_attitude_estimation.

References

  1. 1.

    Vaganay J, Aldon MJ, Fournier A (1993) Mobile robot attitude estimation by fusion of inertial data. In: Proceedings of 1993 IEEE International Conference on Robotics and Automation (ICRA), pp. 277–282

  2. 2.

    Thrun S, Burgard W, Fox D (2005) In: Probabilistic Robotics, pp. 309–336. The MIT Press

  3. 3.

    Rusinkiewicz S, Levoy M (2001) Efficient variants of the icp algorithm. In: Proceedings of Third International Conference on 3-D Digital Imaging and Modeling, pp. 145–152

  4. 4.

    Biber P, er, WS (2003) The normal distributions transform: a new approach to laser scan matching. In: Proceedings of 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

  5. 5.

    Engel J, Stueckler J, Cremers D (2014) Lsd-slam: Large-scale direct monocular slam. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 834–849

  6. 6.

    Mur-Artal R, Montiel JMM, Tardós JD (2015) Orb-slam: a versatile and accurate monocular slam system. IEEE Transact Robotic 31(5):1147–1163

    Article  Google Scholar 

  7. 7.

    Quddus MA, Ochieng WY, Noland RB (2007) Current map-matching algorithms for transport applications: State-of-the art and future research directions. Transportat Res Part C Emerg Technol 15(5):312–328

    Article  Google Scholar 

  8. 8.

    Kim P, Coltin B, Kim HJ (2018) Linear rgb-d slam for planar environments. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 333–348

  9. 9.

    Hwangbo M, Kanade T (2011) Visual-inertial uav attitude estimation using urban scene regularities. In: Proceedings of 2011 IEEE International Conference on Robotics and Automation (ICRA), pp. 2451–2458

  10. 10.

    do Monte Lima JPS, Uchiyama H, Taniguchi RI (2019) End-to-end learning framework for imu-based 6-dof odometry. Sensors 19(17):3777

    Article  Google Scholar 

  11. 11.

    Al-Sharman MK, Zweiri Y, Jaradat MAK, Al-Husari R, Gan D, Seneviratne LD (2020) Deep-learning-based neural network training for state estimation enhancement: application to attitude estimation. IEEE Transact Instrument Measure 69(1):24–34

    Article  Google Scholar 

  12. 12.

    Mérida-Floriano M, Caballero F, Acedo D, García-Morales D, Casares F, Merino L (2019) Bioinspired direct visual estimation of attitude rates with very low resolution images using deep networks. In: Proceedings of 2019 IEEE International Conference on Robotics and Automation (ICRA), pp. 5672–5678

  13. 13.

    Shah S, Dey D, Lovett C, Kapoor A (2017) Airsim: High-fidelity visual and physical simulation for autonomous vehicles. Field Serv Robotics 5:621–635

    Article  Google Scholar 

  14. 14.

    Ellingson G, Wingate D, McLain T (2017) Deep visual gravity vector detection for unmanned aircraft attitude estimation. In: Proceedings of 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

  15. 15.

    Kalman RE (1960) A new approach to linear fi ltering and prediction problems. J Basic Eng 82:35–45

    Article  Google Scholar 

  16. 16.

    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: arXiv Preprint, pp. 1409–1556

  17. 17.

    Deng J, Dong W, Socher R, L. Li KL, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255

  18. 18.

    Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. Proceed ICML 2010:807–814

    Google Scholar 

  19. 19.

    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  20. 20.

    Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations (ICLR)

Download references

Acknowledgements

The authors are thankful for the generous support from the New Energy and Industrial Technology Development Organization (NEDO) for this study. This study was conducted under “Autonomous Robot Research Cluster” at Meiji University.

Funding

This study was supported by the New Energy and Industrial Technology Development Organization (NEDO).

Author information

Affiliations

Authors

Contributions

RO proposed the method described in this paper, implemented all the programing, conducted all the experiments, and drafted the manuscript. YK provided the inspiration for this study, provided advice, and checked and corrected the manuscript. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Ryota Ozaki.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ozaki, R., Kuroda, Y. EKF-based self-attitude estimation with DNN learning landscape information. Robomech J 8, 9 (2021). https://doi.org/10.1186/s40648-021-00196-3

Download citation

Keywords

  • Attitude estimation
  • Mobile robotics
  • Deep learning
  • Extended Kalman filter