LiDAR DNN based self-attitude estimation with learning landscape regularities

This paper presents an EKF (extended Kalman filter) based self-attitude estimation method with a LiDAR DNN (deep neural network) learning landscape regularities. The proposed DNN infers the gravity direction from LiDAR data. The point cloud obtained with the LiDAR is transformed to a depth image to be input to the network. It is pre-trained with large synthetic datasets. They are collected in a flight simulator because various gravity vectors can be easily obtained, although this study focuses not only on UAVs. Fine-tuning with datasets collected with real sensors is done after the pre-training. Data augmentation is processed during the training in order to provide higher general versatility. The proposed method integrates angular rates from a gyroscope and the DNN outputs in an EKF. Static validations are performed to show the DNN can infer the gravity direction. Dynamic validations are performed to show the DNN can be used in real-time estimation. Some conventional methods are implemented for comparison.


Introduction
Estimating the attitude of a robot is one of the classic problems in mobile robotics. In particular, real-time estimation is required for real-time attitude control. The attitude is generally estimated with inertial sensors such as accelerometers and gyroscopes. However, mobile robots have their own acceleration. Moreover, on-road robots also receive pulses from the ground, and UAVs suffer from vibration of their multi-rotor. These need to be filtered out from the accelerometer. On the other hand, integration of gyroscopic angular rate has problems of drift and bias. These disturbances worsen the accuracy of the estimation. To complement each other, these inertial data are fused, generally [1]. Nevertheless, dealing with the disturbances with only inertial sensors is quite difficult.
To reduce the influence of these disturbances, many kinds of LiDAR odometry, VO (visual odometry) and SLAM (simultaneous localization and mapping) [2] have been proposed. LiDAR-based methods register point clouds by ICP [3], NDT [4], and so on. Visual methods often track features in image sequences [5,6]. However, these odometry methods and SLAMs often contain accumulative error since relative pose changes with error are summed up. In order to correct the accumulative error, prior information such as 3D maps is often used [7]. These methods correct the error by matching the prior information against the sensor data. However, they work only in environments where maps are available. Moreover, creating a map is time-consuming, and updating is also required. Some methods [8][9][10] estimate the attitude under Manhattan world assumption. They assume that planes or edges in the environment are orthogonal to each other. It helps achieving drift-free estimation. However, it is difficult for this kind of method to avoid being affected by objects which do not satisfy the assumption. We presented a method in [11] using looser restraints than Manhattan world assumption. It exploited a regularity which most buildings are built vertically. Therefore, the method can be applied not only to Manhattan world, but also to environments where planes are not orthogonal to each other. Vertical planes are extracted from LiDAR point cloud, and the gravity direction is estimated based on normals of the planes. However, the method misses many features even though there are more regularities in environments, since it uses only completely flat planes. Besides, it requires tuning many hyperparameters such as the searching radius for PCA (principal component analysis), and the threshold for the number of the normals, and the threshold for the clustering angle. In addition, the processing of the 3D point clouds takes much time.
Deep learning has been used for attitude estimation in recent years. In [12], IMU-based odometry by end-toend learning has been proposed. In [13], a deep neural network identifies the measurement noise characteristics of IMU. In [14], a neural network estimates angular rates from sequential images. It was trained with synthetic and real images. The large synthetic dataset was collected in AirSim [15] which offers visually realistic graphics. In [16], a gravity vector is directly estimated from a single shot image. The method can infer the orientation at which a picture was taken simply by looking at the picture, like a human. It implies there are regularities between the gravity direction and landscapes. Since the method is a camera-based one, it is cost effective, but it does not work well in nighttime.
To address these issues above, we present an EKFbased self-attitude estimation method with a LiDAR DNN learning the landscape regularities. By using a LiDAR, the method can estimate the attitude regardless of day or night. Moreover, problems of rule-based methods such as [11] described above can be solved by using deep learning. The main contributions of this paper are summarized below: • A LiDAR-based DNN which infers an absolute gravity direction which does not contain the robot's own acceleration nor vibration is proposed. Unlike camera-based methods, the DNN can estimate it regardless of day or night. • The DNN inference is integrated with a gyroscope for real-time estimation. • A data transformation and augmentation method for the LiDAR and IMU data is proposed. Converting the 3D point cloud into a 2D depth image reduces the computation time in the DNN. • Pre-training the DNN with large synthetic data before fine-tuning it with real data makes the learning efficient, while the related work [16] uses only real one.
The datasets and the source code used in this paper have been released in open repositories (see ' Availability of data and materials').

DNN estimating gravity vector
The proposed method makes the DNN learn landscape regularities for estimating the gravity vector.

Coordinate definition
A world frame is defined as a standard right-handed coordinate system. A robot frame is defined as a righthanded coordinate system which is fixed on the robot pose. Its x axis is on the robot's heading direction. They are shown in Fig. 1.

Dataset collection
Both synthetic and real data are collected in this study. The synthetic datasets are used for pre-training, and the real ones are used for fine-tuning. The datasets consist of LiDAR data and corresponded gravity vectors g in the robot frame. Technically, lidar data represents 2D depth images transformed from 3D point clouds. The details about the transformation are described in the next section. Fig. 2 shows examples of the datasets. The synthetic data is collected in AirSim [15]. AirSim is a simulator for drones, cars and more, built on Unreal Engine, which provides visually realistic graphics. An IMU and a LiDAR which has 32 layers are installed to a drone in the simulator. The robot pose and weather parameters are randomized, and the LiDAR data and a gravity vector are recorded at each pose. The range of the random Z is limited as [2 m, 3 m] in this work. The ranges of the random roll φ and pitch θ are limited as [−30 deg, 30 deg], respectively. The reason for limiting Z , φ and θ is that this study mainly focuses on on-road robots that tilt their bodies on which a LiDAR can be installed. The real datasets are collected with an IMU (Xsens MTi-30) and a LiDAR (Velodyne VLP-32) installed on a top of a stick (Fig. 3). The stick is hand-carried, and point clouds and linear acceleration vectors are recorded. They are saved only when the stick is shaking less than 0.001 m and 0.1 deg in 0.5 s, and when it is at least 5 deg away from the last saved pose. The IMU is regarded as ground truth because it has enough accuracy (within 0.2 deg) in static according to the specification. Learning the static IMU is valuable because the DNN can reproduce it even in dynamic.

Data preprocessing
Each input and label data are transformed, and are augmented in each epoch of training. Data augmentation is especially important for real data because collecting real data is time-consuming. Figure 4 shows an example of data transformation.

LiDAR data transformation
The point clouds obtained with the LiDAR are transformed to depth images as follows.   Each generated depth image is flipped in 50% of probability for augmenting the data. After the flipping process, the pixels of each depth image are randomly slid horizontally. The virtual yaw variant �ψ is computed as below.
where Δcol denotes the number of the slid pixels.

IMU data transformation
The gravity vector is also flipped and rotated according to �ψ . Since the network does not need to learn the norm of the gravity, L2 normalization is applied to the vector in order to make the training efficient. (1) The proposed DNN is shown in Fig. 5. It consists of CNN (convolutional neural network) layers and FC (fully connected) layers. The input to the network is the depth image, and the output is a gravity vector . Technically, the output of the FC layers is normalized. The CNN layers are expected to learn extracting features such as edges and planes. The FC layers are expected to learn the regularities between the features and the gravity direction.
It is expected that the CNN layers learn extracting features such as edges and planes, and the FC layers learn landscape regularities. All layers, except the final output layer, use the ReLU function [17] as an activation function. All FC layers, except the final output layer, use the 10% Dropout [18] to avoid the over-fitting problem.

Loss function
The MSE (mean square error) between the outputs and labels is used as a loss function of this model.
where denotes the parameters of the network, and #D denotes the number of samples. The network minimizes the loss by updating .

Optimization
Adam (adaptive moment estimation) [19] is used to optimize the parameters. For the training with the synthetic data, the learning rates are set as lr CNN = 0.00001 , lr FC = 0.0001 , where lr CNN is a value for the CNN layers, lr FC is a value for the FC layers. For the fine-tuning with real data, they are set smaller as lr CNN = 0.000001 , lr FC = 0.00001.

EKF-based real-time estimation
The outputs from the DNN are integrated with gyroscopic angular rate in an EKF. The proposed EKF architecture is shown in Fig. 6. It is based on [16] which simplifies the attitude estimator in [20]. The state vector x of the proposed Kalman filter consists of the roll φ and pitch θ of the robot pose.
Both of the vector x and the covariance matrix P are computed in a prediction process and an update process. The prediction process is computed by integrating angular velocity from a gyroscope. The update process is computed by observing the outputs of the DNN. Note that the covariance matrices for the prediction and observation are determined experimentally. Here, t denotes the time step, S , C , T are short for sin , cos , tan , respectively in the following sections.

Prediction process
The state vector x and the covariance matrix P are respectively computed as follows.
where f is a state transition model, u denotes a control vector, denotes the angular velocity measured with a gyroscope, and Rot rpy denotes a rotation matrix for angular velocities.
where J f denotes f Jacobean, and Q denotes a covariance matrix of the process noise.

Update process
The observation vector is z as below.
where ̂ denotes a gravity which is output from the DNN. The observation model is h.
where g denotes a gravity vector in the world frame i.e. g world ≒ 9.8 m∕s 2 , Rot xyz denotes a rotation matrix for vectors. The state vector x and the covariance matrix P are respectively computed as follows.
where J h denotes h Jacobean, K denotes a gain matrix, R denotes the covariance matrix of the process noise, and I denotes an identity matrix.

Validation
Static and dynamic experiments were performed on both synthetic and real data.

Static validation of DNN
The proposed DNN was trained with training datasets, and was evaluated with test datasets.

Method list
Definitions of methods which were used in this validation are summarized here.
• LiDAR DNN (ours): 'LiDAR DNN (ours)' denotes the proposed method described in the section above. • Camera DNN: 'Camera DNN' denotes a DNN where the input to it is a color image, and the output is a gravity vector. Its CNN module is the same feature module as VGG16 [21], which means this network is almost the same as the related work [16]. • Statistics: 'Statistics' denotes a method using the average of the label vectors as outputs for all samples, which means #D ι=0 g ι is used for estimating attitudes of all samples. Computing the error of this method is equivalent to calculating the standard deviation of the dataset. This method is regarded as the baseline in this study.

Training
The datasets used in this validation are listed in Table 1. The DNN was trained with 10000 synthetic samples (Dataset#1) with a batch size of 200 samples for 200 epochs. Another 1000 samples (Dataset#2) were used for test. They were collected in 'Neighborhood' of AirSim. The training dataset and the test dataset were not mixed. A computer which has W-2133 CPU and Quadro GV100 GPU with 32 GB memory was used for the training. The training took about 1.1 h with the computer.
The loss values during the training are plotted in Fig. 7(a). Table 2 shows the loss values after 200 epochs of training.

Fine-tuning
Fine-tuning with the real data was done after the training with the synthetic data. The DNN was tuned with 1941 real data samples (Dataset#3) with a batch size of 200 samples for 200 epochs. Another 1217 samples (Data-set#4-6) were used for test. Dataset#3-5 were collected in the same campus of Meiji University, but not in the same area. Dataset#6 was collected on a slope without surrounding buildings. Figure 8 shows pictures of one part of each environment.
The loss values during the fine-tuning are plotted in Fig. 7(b). Table 3 shows the loss values after 200 epochs of the fine-tuning. The loss value on the real dataset became smaller by the fine-tuning. However the loss value on the test dataset is larger than one on the training dataset. To reduce the difference of the results between the training data and the test data, a wider variety of datasets are needed for training.

Attitude estimation
The MAE (mean absolute error) and the variance of the static estimation are computed as below.  The MAE and the variance of the estimation on the synthetic datasets is shown in Table 4. Those on the real datasets are shown in Table 5. Both of 'LiDAR DNN (ours)' and 'Camera DNN' inferred the attitude with small errors, except with Dataset#6. The DNNs did not work well on Dataset#6. The distribution of error in Fig. 9(a) implies that the LiDAR DNN before fine-tuning underfit the dataset. On the other hand, Fig. 9(b) implies that the inferences given by LiDAR DNN after fine-tuning are biased. We consider it had an illusion that the slope is horizontal because the training datasets do not contain many samples of slopes. It might also have an illusion that the wall (in Fig. 8(c)) is vertical. Therefore, it should be noted that the DNNs do not perform well in situations which are not contained in the datasets. At the point in Fig. 8(c), the ground truth is gt = 9.23 deg, gt = −11.51 deg , and the inference given by LiDAR DNN after fine-tuning is est = 3.83 deg, est = 1.50 deg . According to this result, establishing a way to collect more data including slopes is our future work. On the other hand, the DNNs perform well even in unknown environments when regularities such as vertical buildings exist.
Focusing on the data collected in the nighttime (dataset #5), 'Camera DNN' showed a good result thanks to streetlights, but it had larger error than in the daytime. The variance is also much larger. On the other hand, 'LiDAR DNN' is not affected by the light condition.    Comparing 'before fine-tuning' and 'after fine-tuning' , the fine-tuning with the real datasets makes the error smaller. The number of the samples for the finetuning is not large, but it worked enough. It implies the pre-training with the large synthetic dataset is valid. Before the fine-tuning, the error of 'LiDAR DNN' is large. It is considered to be because the LiDAR in the simulator does not fully reproduce specifications of the real one, such as non-linear spacing of FOV (Fig. 10). This problem is solved by the fine-tuning.

Validation of real-time estimation in simulator
The proposed EKF-based real-time estimation was validated on synthetic flight data of a drone since ground truth is available in the simulator. Videos of the experiments have been released in public (see ' Availability of data and materials').

Method list
Definitions of methods which were used in this validation are summarized here.
• Gyro: 'Gyro' denotes an estimation method integrating angular velocity from a gyroscope. • Gyro+Acc: 'Gyro+Acc' denotes an EKF-based estimation method integrating angular velocity and linear acceleration from an IMU. • Gyro+NDT: 'Gyro+NDT' denotes NDT SLAM [4] using 32 layers of LiDAR. Angular velocity from a gyroscope, linear velocity of ground truth, and the NDT output are integrated in an EKF. Note that linear velocity of the ground truth is available because the environment is a simulator. • Gyro+DGSphere [11]: 'Gyro+DGSphere' denotes a method described in [11]. Vertical planes are extracted from the LiDAR point cloud, and the cross product of the planes' normals is used as an estimated gravity direction. 'DGSphere' is short for 'depth-Gaussian sphere' . • DNN: 'DNN' denotes a method using the proposed DNN directly without EKF. • Gyro+DNN (ours): 'Gyro+DNN (ours)' denotes the proposed method described in the section above.

Experimental conditions
Flight data of a drone was recorded in 'Neighborhood' of AirSim. The sampling frequency of the IMU and the LiDAR are approximately 100 Hz, 20 Hz, respectively. Virtual noise was added to the IMU's 6-axis data. It was randomly added following a normal distribution with a mean of 0 rad∕s , 0 m∕s 2 and a standard deviation of 0.5 rad∕s , 0.5 m∕s 2 , respectively. Note that the simulator does not reproduce the motion distortion of the LiDAR data. The flight course is shown in Fig. 11. A computer which has i7-6700 CPU and GTX1080 GPU with 16 GB memory was used for the estimation. The DNN inference computation takes around 0.005 s with the computer, while 'DGSphere' takes around 0.4 s every step.

Experimental results
The estimated attitudes in 'Neighborhood' are plotted in Fig. 12. Table 6 shows the MAE of the estimated attitude. The MAE of 'Gyro+DNN (ours)' is smaller than ones of the other methods. 'Gyro' had large accumulative error. That is natural because noise was added and the method does not have any other observation. 'Gyro+Acc' did not have accumulative error. However it had error constantly, since the acceleration values of the sensor contained own acceleration of the robot and noise. On the other hand, the proposed method can observe the gravity vector which does not contain them. 'Gyro+NDT' accumulated error slower than 'Gyro' did by using the LiDAR, but it could not remove the accumulation. 'Gyro+DGSphere' and 'Gyro+DNN' corrected the accumulative error by observing the estimated gravity. Comparing their MAE, the deep learning surpasses the rule-based method. Moreover, the DNN   outputs the estimation much faster than 'DGSphere' which processes 3D point clouds.

Validation of real-time estimation in real world
To see the fine-tuned DNN can work in real world, two types of experiments with the real sensors (Fig. 3) were performed.

Indoor experiment with motion capture
The sensors were hand-carried in an indoor environment of 4.5 m × 6 m for about 23 min. Motion capture cameras (Vicon Vero v1.3X) were used for measuring the ground truth. Note that the DNN was not trained in this area. Table 7 shows the MAE of the estimated attitude. The proposed method suppressed accumulation of error also in the real world. In the flat indoor environment, the MAE given by the proposed method is almost the same as that of 'Gyro+Acc' . The acceleration measured with the IMU is not integrated in the proposed EKF in this paper just for making the validation simple, but it actually can be integrated, and it would be a more stable estimation. For reference, the error given by 'Gyro+Acc+DNN' was error = 2.503 deg, error = 1.637 deg. 'DNN' without EKF showed worse performance compared with 'Gyro+DNN (ours)' although they performed similarly in the simulator. One of the reasons may be because the sensors moved more intensely in this real experiment. In order to see this, Fig. 13 plots the sensor attitude during the experiment. The DNN had larger error when the sensors moved rapidly. Motion distortion of the LiDAR data might affect the DNN in the real world, while the simulator does not reproduce the distortion. Another reason may be because 'DNN' does not interpolate the state between inferences unlike 'Gyro+DNN (ours)' .

Outdoor experiment
The motion capture cameras measure the attitude accurately, but the captured area is limited. To complement that, a long distance experiment was also performed. Detailed quantitative evaluation of the accuracy was done in the previous section, thus this section is just for seeing that the proposed method also be able to work outdoors.
The sensors were hand-carried for around 5 min in Area-II ( Fig. 11(b)) where the DNN was not trained. Since the ground truth is not available while the sensors are being carried, the estimated attitude at the end  The sensors were placed on a flat floor at the start and end of the experiment as Fig. 14, and the ground truth was assumed as gt = 0 deg, gt = 0 deg . This evaluation method is based on the related study [16]. Table 8 shows the error of the estimation at the last pose. The proposed method suppressed accumulation of error during the driving outdoor. 'Gyro+Acc' had very small error at the final pose because the sensor was still. The intermediate estimation results are also shown in Fig. 15 as reference.

Conclusions and future work
The proposed method integrates a gyroscope and the DNN for estimating self-attitude in real-time. The proposed network estimates the gravity direction from LiDAR data. It was trained with synthetic data, and was fine-tuned with real data. Pre-training with the large synthetic data and augmenting the data help making the learning efficient. The static experiment showed the DNN can infer the gravity direction from only single shot LiDAR data. It showed good results regardless of day or night. For dynamic estimation, angular rates from a gyroscope and the DNN's outputs are integrated in the EKF. The dynamic experiments showed the proposed method can be used for real-time estimation.
However, it should be noted that the proposed DNN did not perform well in the situations which are not contained in the training datasets, especially without buildings. A way to collect more variety of data or judging the difficulty of inferences is necessary in our future work. Besides, the proposed method does not cope with the distortion of the LiDAR data in this paper. It worked well in the experiments, but the distortion may affect the inference when the LiDAR moves much faster. Testing the effects and coping with it are our future work. As our other future work, adopting the invariant extended Kalman filter (IEKF) [22] as the estimator instead of the EKF should be considered. Combining the camera DNN and the LiDAR DNN, or using other sensors for estimating the attitude is another future work.  15 Real-time plotting in outdoor experiment. Note that the ground truth while the sensors were being carried is not available in the outdoor environments