Terrain traversability prediction for off-road vehicles based on multi-source transfer learning

In this paper, a novel terrain traversability prediction method is proposed for new operation environments. When an off-road vehicle is operated on rough terrains or slopes made up of unconsolidated materials, it is crucial to accurately predict terrain traversability to ensure efficient operations and avoid critical mobility risks. However, the prediction of traversability in new environments is challenging, especially for possibly risky terrains, because the traverse data available for such terrains is either limited or non-existent. To address this limitation, this study proposes an adaptive terrain traversability prediction method based on multi-source transfer Gaussian process regression. The proposed method utilizes the limited data available on low-risk terrains of the target environment to enhance the prediction accuracy on untraversed, possibly higher-risk terrains by leveraging past traverse experiences on multiple types of terrain surface. The effectiveness of the proposed method is demonstrated in scenarios where vehicle slippage and power consumption are predicted using a dataset of various terrain surfaces and geometries. In addition to predicting terrain traversability as continuous values, the utility of the proposed method is demonstrated in binary risk level classification of yet to be traversed steep terrains from limited data on safer terrains.


Introduction
The demand for mobile robots and/or autonomous ground vehicles in off-road operations has steadily increased. Examples of such applications are found in the construction, forestry, mining, disaster response, and planetary exploration industries. To safely and efficiently operate these vehicles, it is necessary to predict and assess the traversability of the target terrains. Depending on the operation, terrain traversability is conventionally measured by the existence of geometric obstacles, vehicle posture alterations, vibrations, required energy, and slippage.
Predicting terrain traversability is particularly important when a vehicle is expected to traverse potentially high-risk terrains, such as slopes made up of loose materials. One of the dangers of traversing such terrains is the slippage of vehicles. Slippage inhibits the smooth traverse of a vehicle, which increases the time and energy required for the operation. Furthermore, if the slip becomes significant, the vehicle cannot make successful forward progress, thereby requiring its operators to replan the route or even the entire mission. Therefore, to ensure successful operations, it is important to detect risky terrains so that effective routes can be properly selected.
However, the accurate prediction of vehicle motion on natural terrains is difficult owing to complicated vehicle-terrain interactions. Vehicle behaviors are generally influenced by several factors including terrain geometry (slope and roughness), surface type (sand, cohesive soil, rocks, bedrock, or mixtures of these), and surfaceaccumulated conditions (compacted or not, depth, and moisture content) as well as the vehicle size, weight, and locomotion configurations [1].
Numerous studies have investigated ways to improve the prediction of traversability on natural terrains [2,3]. Although many of the developed methods are relatively promising, sufficient traverse data on target environments Inotsume and Kubota ROBOMECH Journal (2022) 9:6 are still required for the accurate prediction of traversability. For example, if a vehicle's target is an ascending slope, traverse data related to that possibly hazardous terrain is primarily required for making an accurate prediction. However, during operations, vehicles usually avoid risky terrains for safety reasons. Therefore, traverse data on such terrains might be limited or even non-existent. Model parameters solely trained on benign terrains can overfit the limited data, resulting in either overestimating the traversability or underestimating the risks on more difficult terrains. This makes the traversability prediction of risky environments highly challenging.
This study addresses the above problem, which is inherent to the prediction of terrain traversability. Specifically, this study aims to improve the prediction accuracy of traversability on challenging terrains in new operation environments on the basis of (1) traverse data on relatively safe areas and (2) past experiences (Fig. 1). Accordingly, this study proposes an adaptive terrain traversability prediction method based on multi-source transfer Gaussian process regression (MS-TGPR) [4]. The proposed method leverages traverse experiences obtained from multiple terrain surfaces and improves the prediction accuracy on possibly risky terrains in the target mission environment, where in-situ traverse data are only available from benign terrains beforehand.
The contributions of this study are as follows.
1. The development of a method for predicting terrain traversability on untraversed, possibly risky terrains based on multi-source transfer learning. 2. The application of the proposed method to vehicle slip prediction, which is a significantly difficult tra-versability prediction challenge when training data is severely limited, owing to complicated terrain-vehicle interactions. 3. The evaluation of the effectiveness of the proposed method using a dataset collected with a mobile robot on multiple terrain geometries and surface types.
The basic concept of the proposed method was previously proposed and partially validated by using synthetic and real datasets in earlier works [5,6]. The work in this paper newly demonstrates the utility of the proposed method in 1) slip risk level classification and 2) power consumption prediction on slopes from limited in-situ data. This paper also presents thorough additional evaluations of the traversability prediction performance from the viewpoint of the usability of the method in real operations, with added data on more terrain surface types and conditions compared to the previous work [6]. The rest of the paper is organized as follows. The next section describes related works of this study. After that, the transfer learning-based traversability prediction method is proposed. Then, the experiments for collecting the traverse data of a mobile robot on multiple types of terrain are described. Following that, the proposed method is evaluated using the collected dataset and discusses its usability and extendibility. Finally the last section summarizes this study and concludes this paper with future work.

Related works
The prediction of terrain traversability for safe autonomous vehicle navigation has been studied for the last several decades [2,3]. Most Fig. 1 Concept of proposed traversability prediction based on multi-source transfer regression. By leveraging past experiences on multiple terrain surfaces, the proposed method improves the prediction accuracy on possibly risky terrains only from traverse data on safer terrains in new environments [6] Inotsume and Kubota ROBOMECH Journal (2022) 9:6 sufficiently robust for indoor or partial outdoor situations thanks to their detection and avoidance of dangerous geometric obstacles. However, conventional approaches are not applicable when a vehicle is required to traverse more difficult terrains characterized by unconsolidated materials, where vehicle behaviors are not easy to predict from pure-terrain geometry or visual information. Learning or model updating using in-situ traverse data is a straightforward way to improve the prediction accuracy of the terrain traversability. The work presented in [7] proposed a learning method that repeatedly improves the prediction accuracy of terrain traversability or the vehicle responses during traversals of the same operation area multiple times. In [8], models of control disturbances are learned and iteratively improved from repeated drives on the same terrain. The learned models are then used for better path tracking on the same off-road terrain. These methods are effective for vehicles deployed over a long duration in confined environments. However, if the vehicles are not supposed to traverse the same locations multiple times, this type of approach is not applicable.
Self-supervised learning approaches have been applied for terrain classification and traversability prediction [9][10][11], where proprioceptive sensing information is correlated to visual terrain information and utilized to supervise a vision-based terrain classifier, and vice versa. These approaches classify terrains and predict the traversability of distant fields by comparing underfoot terrain responses (such as slip, vibration, or force/torque) to visual cues at the learning phase. Because these methods only require a small amount of labeled data to train the terrain classifiers, they are relatively promising for classifying terrains from data on safe regions. However, they are limited when large variabilities in vehicle behaviors exist among intra-classes (sub-classes). For example, the behaviors of a vehicle can be significantly different even on similar-looking sandy surfaces depending on the accumulation conditions. In addition, even if the terrain classification is successful, it is still challenging to predict the traversability on higher-risk terrains of the corresponding class where training data are not available. The prediction model learned on limited available data can possibly be overfit on a low-risk terrain, thereby resulting in an underestimation of the risk of untraversed terrains.
Several approaches have been proposed to address the challenge of intra-class variability. In [12], the prediction accuracy of terrain traversability is improved by globally and locally modeling traversability. The local variations of vehicle motions are captured by learning traversability as a function of positions in addition to the global model constructed on the basis of terrain appearance and geometry. In [13], a first-in-first-out data structure was utilized to update traversability models. In this method, older data are removed from the training data whenever new local traverse data are observed, thereby improving the adaptation of the model to new operation environments. To address the intra-class variability challenge, [14] adopted a thermal sensor to examine the correlations between soil properties and the thermal inertia of terrains and showed that a mixture-of-experts approach utilizing surface thermal information can differentiate high-slip sand surfaces from low-slip sandy terrains and improve the slip prediction accuracy on sand-type surfaces. However, none of the above-mentioned studies explicitly address the challenge of the data limitation in traversability model learning in new environments.
One possible method for improving the prediction accuracy on high-risk terrains might be to aggressively explore such terrains and acquire the traverse data necessary for learning a better model [15]. However, this action can trigger critical conditions in the vehicle, and therefore should be avoided if the risk is uncertain. Another work [16] proposed a method to learn terrain traversability from physical simulations. As the authors also mentioned as a limitation, since the traversability is learned solely from terrain geometry for solid surfaces, this method is not applicable to deformable, slip-inducing terrains that are hard to realistically simulate without rigorous parameter tuning from real data. Another possible approach is simply to assume that terrains for which no data is available are untraversable.
In contrast to the above approaches, this study attempts to accurately predict the traversability of untraversed terrains by avoiding an overly conservative prediction for scenarios where the vehicle is required to traverse such terrains. The proposed method leverages data obtained from relatively safe terrains in the target environment and from past experiences of traversing multiple types of terrain surface, and it improves the prediction accuracy for challenging terrains without actually having to traverse such terrains. The proposed method is fundamentally a general traversability learning and prediction method, and as such, it can be integrated with the earlier developed self-supervised terrain classification approaches to develop an end-to-end learning framework for operations in new environments.

Proposed method
This study assumes that vehicle behaviors on any terrain can be represented by combinations of behaviors on the base terrains that constitute the target terrain. On the basis of this hypothesis, this study develops a traversability prediction method for new operation environments.
The proposed method is developed in accordance with two machine learning paradigms: transfer learning and ensemble learning. Transfer learning [17] is a method that applies data or learned models in one or more applications/tasks (source domains) to the model for another application/task (target domain). By transferring the model information from the source domain appropriately, transfer learning can improve the performance in the target domain, which has only limited training data. In contrast, ensemble learning [18] is a method that combines multiple models learned from the same training dataset but with different data subsets. This enables the generalization performance of the learned model to be improved; hence, its prediction performance is suitable for unseen data.

Problem definition
In this study, the traversability prediction is formalized as a regression problem in which the relationship between various input features x and an output traversability index of continuous values f (x) , along with its predictive uncertainty, is learned from training data in a supervised learning fashion. As described earlier, the traversability index can include, for example, vehicle posture alternations, vibration, required energy, or slippage. The input features x represent terrain geometries (terrain pitch, roll, and/or surface roughness). In general, x can also include other features, such as vehicle locations [7,12], visual features of the terrain [19], or control inputs to the vehicle [8]. This study aims at predicting the probabilistic distribution of the traversability (i.e., p(f (x) ) rather than performing deterministic predictions, as the vehicle motions can be significantly varied in difficult off-road environments even when the same vehicle traverses the same terrain and location multiple times. Operating a vehicle based on deterministic traversability prediction can lead to unexpected hazardous situations in such environments.
This study makes two key assumptions. First, the vehicle is able to estimate the terrain geometry x ahead of it by using on-board exteroceptive sensors (such as a stereo camera or range sensor) before traversing the terrain. In addition, the vehicle can estimate the traversability measurement y that corresponds to the predictive target traversability f (x) from the vehicle response during its traverse on the terrain corresponding to the input x by using on-board sensory information. The measurement is assumed to have a random noise ε that follows a zero-mean Gaussian distribution with variance σ 2 , i.e., ε ∼ N (0, σ 2 ) . Then, y can be expressed as Second, there exist a variety of terrain surface classes, or domains. Among these domains, the vehicle has collected a set of traverse data, which covers a range of terrain geometries from safe to (1) risky terrains, in N surface types (source domains) S = {S i : 1 ≤ i ≤ N } beforehand. The data of n S i observations in each source domain S i are denoted as In contrast, in a specific surface type (target domain) T, only traverse data of n T observations in relatively safe, benign terrains have been obtained by the vehicle. Similarly to the source domains, the target training data are denoted as j , y (T ) j )|j = 1, · · · , n T } . Each item of source and target domain data has been separately labeled and classified in accordance with the corresponding domain either by human experts or on the basis of an existing classification method (e.g., [9,10]).
Under these assumptions, the objective of this study is to stochastically predict the traversability f (T ) (x * ) , which corresponds to the input point x * in the target domain, using the already described available data samples, i.e., to predict the distribution p(f (T ) (x * )|x * , D (S 1 ) , · · · D (S N ) , D (T ) ) , before actually traversing that point. Specifically, this study builds the predictive target traversability model p(f * (x)) as a weighted sum of multiple weak traversability models {p(f (S i ,T ) (x)) : 1 ≤ i ≤ N } learned from the available data in the target domain T and those in the source domains S, as expressed below: where w i represents the weight of the model learned from the target data and the i-th source domain data, and satisfies w i = 1 . Although any algorithm can be adopted to learn the base model f (S i ,T ) (x) and the weight w i of that model, this study applies the MS-TGPR algorithm [4], a variant of Gaussian process regression (GPR) [20], for multi-source transfer learning.

Gaussian process regression
GPR is a data-driven, nonparametric approach to learning a regression model that does not require strong assumptions in its forms (e.g., linear, polynomial, or exponential), with the shape of the model determined on the basis of the training data. In addition, GPs can express the prediction uncertainties as variances, together with the predictive means. Owing to these features, GPR is suitable for modeling complicated vehicle behaviors in off-road environments. Several studies (e.g., [7,12,21]) have demonstrated the effectiveness of GPR for modelling terrain traversability when sufficient amounts of data are available.
The GPR-based traversability method assumes that the traversability latent function values f (X) of the input data X follow a multivariate Gaussian distribution, which is determined by the mean µ(X) and covariance matrix K(X, X) . In this study, the mean µ(X) is a zero vector assuming that the training traversability data are normalized before learning. The (k, l)-th element of the covariance matrix is computed on the basis of a covariance function (kernel function) as K k,l = k(x k , x l ) . In this study, the exponential covariance function defined below is adopted in both the GPR and the transfer GPR (TGPR) for slip-slope modeling: where a 1 is a constant scalar and A 2 is a constant diagonal matrix given by A 2 = diag(a 2 ) , where a 2 is a constant vector of the same length as that of the input feature x . The constants a 1 , a 2 and the measurement noise variance σ 2 are a set of hyper-parameters to be tuned. For modeling the relationship between the electric power consumption of actuators and the terrain slope, the following linear covariance function is adopted: where a 3 and a 4 are another set of hyper-parameters for the power modeling. Note that the linear covariance function was selected for power consumption because strong linear relations can be observed between power consumption and terrain slope, as reported in [22] and also later demonstrated in this paper. While GPR is a data-driven approach and does not need a strong assumption of its model shape, as mentioned earlier, the model shape can also be restricted by specifying a certain covariance function if prior knowledge on the shape is available.
Given the training data D = {X, y} of n data points, the joint distribution of the latent function and measurements, y , can be learned by tuning the hyper-parameters of the model. In the prediction step, the predictive distribution of the traversability at the point x * can be computed by , where the predictive mean m(x * ) and variance v(x * ) are given by the following equations, respectively: where K(·, ·) denotes the covariance matrix evaluated with the pairs of training input points or the prediction point x * based on Eq. (3) or Eq. (4), and I n denotes an n × n identity matrix.

Transfer Gaussian process regression
In TGPR [23], the regression for the target domain is modeled from the rich data in the source domain S i and the small data in the target domain T on the basis of the correlation between the two domain. This correlation is captured in addition to the correlation between data points by using the transfer covariance function k * (x, x ′ ) , as where i ∈ [−1, 1] denotes the similarity coefficient between the source domain S i and the target domain T, with a higher i value indicating a higher similarity.
On the basis of the transfer covariance function, the covariance matrix for a single TGPR is defined as where K S i and K T represent the covariance matrices of the individual domains S i and T, respectively, while K S i T = K T TS i is the covariance matrix between the two domains. These covariance matrices are computed from Eq. with σ 2 S i and σ 2 T representing the variances of the measurement noise in domains S i and T, respectively.

Multi-source transfer Gaussian process stacking
In the proposed terrain traversability prediction method, the weak TGPR models from multiple source domains are stacked to build the target prediction model in an ensemble learning fashion from Eq. (2). The predictive distribution of the traversability can be expressed as This can also be viewed as a mixture of Gaussian processes [24,25], where the weight w i corresponds to the gating function for the i-th transfer regression model. Figure 2 illustrates the general concept of the model learning.
To appropriately combine the multiple TGPR models incorporating domain similarities, the weight w i of the source domain S i is determined on the basis of the similarity coefficient i , as in [4]. Specifically, in this study, the following softmax function is adopted to assign a higher weight to a source domain with a higher, positive sourcetarget similarity.
In addition, a threshold th for the similarity coefficient is introduced in this study. If the domain similarity i is lower than the threshold, the weight w i of the corresponding source domain is set to zero by setting i = −∞ in Eq. (14). This threshold controls the amount of low-similarity domains included in the final model. The weight function and threshold can together avoid the negative transfer [26] from irrelevant source domains that would deteriorate the final learned model.
In summary, by learning the similarity coefficient i , as well as by tuning the hyper-parameters in the TGPR models, {p(f (S i ,T ) (x)) : 1 ≤ i ≤ N } , from the target training data and the source domain data, the proposed traversability prediction method develops the prediction model of the target domain p(f * (x)) based on Eq. (13).
Here, the set of model (hyper-)parameters, , to be tuned includes the domain similarity, i , for each source-target combination, the variance of the measurement noise in the target domain, σ 2 T , that of each source domain, σ 2 S i , and the hyper-parameters in the covariance function, a 1i and a 2i for slip or a 3i and a 4i for power consumption, for each source domain S i . There are thus (3 + l)N + 1 (hyper-)parameters to learn for a slip model, where l (14) Outline of traversability model learning based on multi-source transfer Gaussian process regression [6]. The domain similarity i between each source S i and target T domain, as well as the single TGPR model between them, are learned from the data present in S i and T. The prediction model for the target domain is then built as an ensemble of all TGPR models. The importance of each model is determined from i denotes the length of the input feature vector x , and 4N + 1 for a power consumption model. In this study, these (hyper-)parameters, , are learned such that the following log-likelihood of the joint distribution of the target training measurement y (T ) is maximized: As the derivative of Eq. (15) cannot be analytically solved in a closed form, random/grid search and k-fold cross validation are adopted in this study. Specifically, the i of each source domain is determined by random search and the σ T by grid search with 5-fold cross validation. Other hyper-parameters, σ S i , a 1i , a 2i , a 3i , and a 4i for the TGPR of the source S i , are adopted from those of a pre-tuned GPR model for the corresponding source domain to reduce the computation burden and also to increase the learning stability. Although this may limit the accuracy of the learned model, reasonable results can be obtained, as presented later in the Evaluation and Discussion section. Alternatively, the Expectation-Maximization (EM) algorithm [27] or gradient descent-based methods with a mean-squared error loss (as used in [4]) are other possible approaches for tuning all of the hyper-parameters.
In the prediction phase, the geometry of the terrain ahead of the vehicle, x * , is input to the learned model to obtain the predictive distribution of the traversability, p(f * (x * )) , on the target terrain surface.

Risk level classification
In addition to predicting the traversability as a continuous value, the learned regression model can also be used to predict a discrete risk level of the target terrain, e.g., "high risk" or "low risk". In some missions, it may be more useful to output such discrete labels rather than continuous traversability values.
One straightforward way to achieve this is to categorize the regression output from the proposed method into one of the risk levels based on pre-defined thresholds. For example, the risk of immobility on deformable terrains can be inferred by checking whether the predicted slip value y * is above a threshold or not, similarly to the approaches in [13,28,29]. Although some of the existing works classify the risk level in a deterministic manner by assessing mean predictive slip for this risk categorization, it would be more appropriate to do so in a stochastic fashion by taking into account uncertainties in the slip prediction for the risk leveling. As the proposed regression method can provide predictive distributions of a traversability index, the method can be easily applied to that direction. While a stochastic risk classification approach has also been proposed ([13]), where a risk level is (15) categorized by counting the number of data points above a threshold, this study adopts a predicted traversability distribution for the same assessment. Specifically, this study considers an estimation of the risk label l risk of the target terrain x * on the basis of the predicted traversability index, as where p risk (x * ) denotes the risk probability that is estimated as the probability of the traversability index y(x * ) being greater than or equal to a threshold y threshold , as The risk probability threshold p threshold in Eq. (16) controls the extent to which the amount of uncertainty in the prediction y(x * ) to be incorporated in the classification with a lower threshold makes the classification more conservative.

Data collection experiments
To evaluate the effectiveness of the proposed method, a set of experiments was conducted to collect traverse data on multiple terrain surfaces and geometries using a mobile robot. In this study, vehicle longitudinal slippage and electric power consumption of actuators are adopted as the terrain traversability to be predicted. The traverse data required for estimating these were collected in the experiments, along with some other data for future use. Figure 3 shows the test vehicle and field used in the experiments.

Experimental setting and procedures
In the experiments, a small mobile robot developed at JAXA/ISAS was used to collect traverse data. The dimensions of the robotic vehicle are 0.85 m × 0.80 m × 0.65 m, and its mass is approximately 50 kg. The vehicle has four wheels with independent driving and steering actuators, including a differential link for making all wheels come into contact with the surface on uneven terrains. The actuators are controlled by an on-board computer and motor drivers, and their rotation angles are measured by rotary encoders. The vehicle is also equipped with a Stereolabs ZED 2 stereo camera featuring a built-in inertial measurement unit (IMU). The 6-degrees-of-freedom vehicle motion (position and orientation) is estimated along with the terrain geometry using the stereo visualinertial simultaneous localization and mapping (VI-SLAM) software provided by Stereolabs.
An external PC generates the operation commands and sends them to the on-board computer of the vehicle via (16) Wi-Fi. The stereo camera is wire-connected to another PC and the VI-SLAM is run on the machine owing to the limited computing capability of the on-board computer and the compatibility of the software.
The size of the test field is 1.8 m × 1.8 m, as shown in Fig. 3a. The field can be manually tilted by hydraulic jacks and a lifting table, which enables the vehicle to be tested on a variety of terrain geometries.
Nine surface materials and/or conditions (shown in Fig. 4) were set in the test field: rough sand, loose sand, compacted sand, small rocks, gravel, gravel with sand, sand over bedrock, sand-covered bedrock, and (clean) bedrock. The details of each surface type are summarized in Table 1. In addition, several large rocks were positioned on the field to improve the accuracy of the VI-SLAM.
Before each experiment, the test surface was prepared and the vehicle was positioned in front of the field. The vehicle was then commanded to drive at a speed of approximately 5.5 cm/s in the longitudinal slope direction without any steering maneuver.
During each experiment, the encoder count and the applied current and voltage of the right front and rear wheel motors were obtained from the motor drivers at 10 Hz. The VI-SLAM computed and recorded the 6-degrees-of-freedom vehicle pose and registered point clouds at 30 Hz and 1 Hz, respectively. In addition, the ground truth of the vehicle pose was recorded at 100 Hz using a set of motion capture cameras placed around the test field. The motion capture data was only used to evaluate the accuracy of the VI-SLAM. The position errors of the VI-SLAM were at most a couple of centimeters in the 1-meter linear traverse without steering. Raw stereo camera images and IMU data were also recorded for future use; however, they were not utilized in this study.

Data processing
From each recorded traverse data, terrain inclination, vehicle slippage, and electric power consumption were extracted as data points in the following way. The recorded data were divided into non-overlapping time windows of a duration t . Then the terrain inclination, vehicle slippage, and power consumption corresponding to each time window were estimated. In this study, t was set to 1.0 s to capture local vehicle responses. Figure 3b shows the flow of the data processing. The longitudinal vehicle slippage is measured by the slip ratio [1] as follows: where d w represents the longitudinal travel distance in t , which is estimated from the wheel odometry, and d vi denotes the estimated distance from the VI-SLAM. A positive slip ratio indicates that the vehicle moves a shorter distance than commanded, and s x = 1.0 indicates that the vehicle does not make any forward movement.
A negative slip indicates that the vehicle travels further than commanded. The total electric power consumption P of the motors was estimated from the recorded voltage V i and current Only the voltage and current of the right wheels were recorded due to device limitations. Therefore, the total power was roughly estimated by using the same motor current and voltage values for the left wheels as for the right, assuming they are not significantly different when the vehicle travels on longitudinal slope.  The terrain inclination was estimated using the point clouds obtained from the VI-SLAM. For each time window, the positions of the vehicle wheels were estimated from the vehicle's position and geometry. The point clouds included in the volume swept by the wheels were then extracted, and a best-fit plane to the point clouds was computed using linear regression. The terrain pitch and roll angles were determined from the slope of the plane. The terrain roughness was also measured by the residual of the fitted plane from the points. The accuracy of the slope estimation was approximately 1 • .

Slip-slope and power-slope dataset
The collected slip vs. slope data and power consumption vs. slope data are plotted in Fig. 5, with different colors representing the surface domains. Individual data in each surface domain are shown in Figs. 6 and 7 for slip and power consumption, respectively.
The slip on the rough sand, loose sand, and compacted sand surfaces used in this study exhibited a relatively similar trend, as the slip rapidly increased along with the increase in the terrain slope angle. The slip on the rough sand had more variability and became higher than that on the other sand surfaces.
The largest slip variability was observed on the small rocks, where the average slip gradually increased with the terrain inclination. On the slopes of approximately 20 • , the vehicle repeatedly made slight uphill progress and then slid downhill, which resulted in no successful slope ascent. The slip ratio values higher than 1.0 on small rocks represent the downhill skid motion.
On the gravel and gravel with sand, a moderate increase of the slip was observed compared to the sand types. The slip on the gravel was relatively higher than that on the gravel with sand.
On the bedrock, the average slip was around zero, even on the slope of approximately 25 • , owing to the high frictional texture of the surfaces. On the 30 • slope, the slip slightly increased, but the vehicle could make steady uphill progress. Although a similar vehicle behavior was observed on shallow slopes of the sand-covered bedrock, the slip suddenly increased on more than 15 • , and the vehicle could not make continuous forward progress on the 25 • slope. The slip on the sand over bedrock type differed from that on the loose and compacted sand despite the similar visual outlooks. The slip behavior of the sand over bedrock was relatively similar to that on the sand-covered bedrock.
Note that the negative slip values of some data points were triggered by errors in the wheel and VI odometries, and also owing to the imperfect time synchronizations of the signals of these odometries.
The power consumption showed a linear relationship to the terrain slope (as seen in Fig. 7) in every surface domain. Other than that, the trends of the power consumption between different surface domains were similar to those of the slippage, as surface types that induced higher slippage tended to require higher power for the vehicle to drive.

Results and discussion
This section evaluates the effectiveness of the proposed method for predicting terrain traversability using the dataset described in the previous section.
The terrain slope pitch angle was used as the input feature x to model the separate latent functions f s x (x) and f P (x) for the longitudinal slip and power consumption, respectively.

Reference GPR models
First, GP regression models for the slip-slope characteristics and for the power-slope in all surface domains were learned for reference. The hyper-parameters of the reference models were also adopted for the TGPR models of the corresponding source domains in the later evaluations. In the learning of each model, 75% of the data were adopted for training and the remaining 25% for testing. Additional artificial slip data of s x = 0.0 was inserted at approximately zero degrees, and those of s x = 1.0 were added over the slope angles, where no uphill progress was expected, to stabilize the slip model curves around the two edges, where no test data was obtained. The learned models are presented in Fig. 6 for slip and in Fig. 7 for power consumption. As shown, given sufficient training data, GPR could capture the trends of slip and power consumption in all surface types to a certain extent. Better models for slip might be obtained by imposing a  constraint of monotonic increase of the slip against the increase in the slope, as in [12]. The root-mean-square errors (RMSEs) and log-likelihoods (LOGLIKs) of the reference GPR models were evaluated on the test data, and are presented in Table 2. RMSEs indicate the accuracy of the predictive mean (lower is better), while the LOGLIKs relatively indicate the feasibility of the predictive distribution (higher is better). The surface type with a larger slip variability resulted in a larger RMSE and a lower LOGLIK.

Evaluation 1: Comparison of MS-TGPR and GPR
In the first evaluation, the proposed MS-TGPR-based traversability prediction method was compared with GPRbased methods in a slip prediction scenario. Four surface types, namely, compacted sand, small rocks, gravel, and bedrock, were adopted as source domains, and four surface types, namely, rough sand, gravel with sand, sand over bedrock, and sand-covered bedrock, were set as target domains. The data presented in each target domain with slopes shallower than 5 • , as well as the entire data in the source domains, were used for training each MS-TGPR. Each model was then tested with the corresponding target domain data of the slopes steeper than 5 • . This simulates the prediction of traversability on slopes in a new environment from the data on relatively benign terrains. In this evaluation, the threshold of the domain similarity, th , was set to 0.8.
In addition to the proposed method, two GPR-based approaches were tested for comparison. The first approach, GPR-naive, learns a GPR model solely from the limited training data of the target domain. The second approach, GPR-conventional, utilizes the slip data in source domains with similar visual outlooks. In this latter approach, the training data comprises the target training data and the data in the source domains. Similar data augmentation methods have been conventionally adopted in practice when the available training data is limited [12]. Alternatively, this approach is also considered a type of online model updating, in which the pre-trained GPR model for a source domain is updated using the newly obtained in-situ traverse data. In this study, the data from the compacted sand were introduced to train the models for the rough sand and sand over bedrock, whereas the data on gravel were added to train the model for gravel with sand, and the data obtained from the bedrock were added to train the model for the sand-covered bedrock. No pre-processing was performed on the source data for the augmentation. The augmented source data was restricted to the range of the terrain inclination, which is not included in the target domain, i.e., slopes steeper than 5 • .
The learned slip-slope models for the rough sand and sand-covered bedrock domains are plotted in Fig. 8 as examples. The RMSEs and LOGLIKs of the models learned for all four of the target domains are presented in Table 3. In addition, this evaluation assesses how the learned predictive distribution p(f (x)) can represent the true distribution q(f (x)) based on the Kullback-Leibler divergence (KLD) [27]. The KLD measures the difference between two probability distributions, with a lower value indicating a higher similarity and KLD = 0 indicating that the two distributions are identical. Since the true distributions from which the traverse data were generated are unknown, the distributions predicted by the reference GPR, presented in Fig. 6, were adopted instead to compute the KLDs. Figure 8a, c, e presents the prediction models for the rough sand learned using GPR-naive, GPR-conventional, and MS-TGPR, respectively. Owing to the very limited data, the model learned by GPR-naive predicted just zero slips on every slope, as shown in (a). This result indicates that the prediction is a challenging extrapolation problem. In contrast, by adopting the data augmented from the compacted sand domain, GPR-conventional could predict the slip on the test slopes better than GPR-naive, as shown in (c). GPR-conventional worked well owing to the relatively similar trends in the slipslope curves for the rough sand and compacted sand. However, some test data were outside of the confidence interval, indicating that the model sometimes underestimated the possible vehicle slippage. The proposed MS-TGPR could also capture the slip trends of the rough sand on the test slopes, as shown in (e), with more test data included in the confidence interval. The source-target similarity i and the weight w i for each source domain are presented in Table 4. High similarity values were learned for the compacted sand, small rocks, and gravel, while the similarity to the bedrock was relatively less significant. As the threshold of th = 0.8 was set, the TGPR models from the former three domains were selected to develop the prediction model. As presented in Table 3, although GPRconventional exhibited the lowest RMSE, the proposed method achieved the highest LOGLIK and lowest KLD, indicating the best modeling of the predictive distribution and the prediction uncertainty among the three methods.
The slip prediction models for the sand-covered bedrock are presented in Fig. 8b, d, f. Again, although GPR-naive in (b) could capture the trends of the training data, the model predicted an almost constant slip on steeper slopes without available training data. At this point, GPR-conventional could not predict the abrupt increase in the slip over approximately 20 • , as it is biased by the augmented data from the bedrock domain, as illustrated in (d). In contrast, MS-TGPR could improve the slip prediction more than the other two approaches, as demonstrated in (f ) and presented in Table 3. Although the predictive mean was slightly lower than the actual slip at slopes higher than 20 • , the test data was covered by the confidence interval with low KLD. For this target domain, high similarity values were learned for the small rocks and bedrock domains, as presented in Table 4, so the TGPR models of these domains were selected to develop the prediction model. Table 3, similar results were obtained for the other two target domains.

Evaluation 2: Influence of source domains
In the second evaluation, the effect of the number of source domains on the performance of the learned model was analyzed. The same four surface domains as Evaluation 1 were set as the target domains, with data shallower than 5 • adopted for training and that steeper than 5 • for testing. The number of source domains N was set to 1, 2, 3, 5, or 8 (all source domains). For each case except  Figure 9 shows the statistics of the RMSE and KLD values of the MS-TGPR models learned with the varied number of source domains for each target domain. Note that, with N = 8 , only one combination of the source domains is possible. Therefore, the RMSE and KLD are plotted with a cross that represents the single result for each target domain.
When only one or two source domains were used, both the RMSE and KLD showed large variability due to the difference in the available source domains to transfer. Basically, lower RMSE and KLD were achieved when domains more similar to the target domain were included in the source domain sets, e.g., loose sand and compacted sand for the rough sand domain. On the other hand, higher RMSE and KLD resulted when the source domain sets only included non-similar surface types, e.g., sand over bedrock, sand-covered bedrock, and/or bedrock for the rough sand.
When the number of source domains increased, the variability in the RMSE and KLD decreased, whereas the median value did not significantly differ depending on the number of source domains. This is because when the number of source domains increases, the possibility of similar domains being included in the source sets increases. The minimum RMSE and KLD rose when the number of source domains increased.
The above results indicate that the improvement of the predictive performance does not depend on the available number of source domains; rather, it depends on whether similar domains are included in the sources. The model performance does not significantly improve even if the available source domains increase, but they are limited to irrelevant ones. For most of the target domains, the best model with the minimum RMSE and KLD could be attained when the model was learned with the single most similar source domain. An exception was the large KLD of the sand over bedrock domain, for which no single source domain could achieve an accurate predictive distribution when transferred. However, in actual usage, the true most similar domain cannot be known until the traverse data on slopes are obtained. Therefore, it may Table 2 RMSEs and LOGLIKs of reference GPR models for slip (Fig. 6) and power consumption (Fig. 7

Evaluation 3: Influence of target training data
This evaluation assessed the influence of the terrain inclinations included in the target training data on the predictive accuracy of the slippage. The rough sand, gravel with sand, sand over bedrock, and sand-covered bedrock were again set as the target domains. When training the model for each target, all of the eight surface domains except for the target domain itself were adopted as source domains for the transfer learning. All data in the target domain were first divided into training and test data at a ratio of 3:1. Multiple training datasets were then created from the training data with each dataset containing the data with upper boundaries of the slope angle varied from 2.5 • to 30 • . Prediction models were learned from each dataset and then evaluated with the test data. For MS-TGPR, five different domain similarity threshold th values ( −∞ , 0.0, 0.5, 0.8, 0.9) were tested. th = −∞ represents a model learned without the threshold, meaning that all TGPRs were adopted regardless of similarities. In contrast, th = 0 excludes domains with negative similarities, and th = 0.9 only accepts domains with very high similarities. Example MS-TPGR models trained from different target datasets for the rough sand domain are presented in Fig. 10. As shown, the model trained on the target data up to 2.5 • could capture the test slip data on steeper slopes in the confidence interval. While the confidence interval, or prediction uncertainty, was large in the model, the proposed method could rapidly improve the predictive distribution, which made the confidence interval tighter when additional target training data on slightly steeper terrains were available. Figure 11 shows the RMSEs and KLDs of the models learned based on GPR-naive, GPR-augmented, and MS-TGPR for the varied maximum slope angles included in the training dataset. Those of the reference GPR are also plotted. As shown, the proposed MS-TGPR-based  method could learn the prediction model with relatively smaller RMSE and KLD even when the target terrain data was limited to up to 2.5 • . The model was improved with the additional traverse data on steeper terrains, and depending on the target domain, its performance reached almost the same level as the reference GPR model from the data up to 7.5-15 • , which indicates that a significantly shallower terrain is required to train MS-TGPR models than the reference GPR models. Figure 12 shows the RMSEs and KLDs of the models learned based on MS-TGPR with various domain similarity thresholds th . The model learned with th = 0.9 achieved a lower RMSE or KLD than the other settings in the rough sand and sand-covered bedrock domains, especially when the training data were very limited on shallow slopes. This can be attributed to the fact that the high threshold in the domain similarity excluded the irrelevant source domains to be transferred, and thus avoided negative transfer. However, when more data were available on steeper terrains, the characteristics of the target domains started becoming distinct from those of the source domains. Therefore, domain similarities between the target and sources decreased. In this regard, the threshold th = 0.9 might have been too strict to transfer source models of moderate similarities to construct a good model. This resulted in the large fluctuation in the performances of the model with the high threshold value. Overall, th = 0.5 showed a relatively better stability in the performance over varied training datasets and various target domains.

Evaluation 4: Slip risk classification
In this evaluation, the proposed risk classification method was assessed in a scenario where the immobility training data test data predictive mean Fig. 10 Models trained on different target training data in Evaluation 3. Slip-slope models learned based on the proposed method are shown for a series of increased target training datasets with increased slope steepness. By using additional target training data, the proposed method could rapidly improve the model distribution with a tighter confidence interval. Results for the rough sand surface class are presented. Note that the shape of the model trained on the data up to 5 • (upper center) is slightly different from that shown in Fig. 8e because the source domains used for training are not the same risk of a vehicle was predicted. The slip prediction models trained for the target domains in Evaluation 1 were adopted to classify the risk. Here, as the slip threshold in Eq. (16), y threshold = 0.6 was adopted for classifying the risk level l risk into high or low. This threshold value is also utilized in NASA's Mars exploration rover missions as one of the slip threshold values [30]; detecting a single slip event over this value will immediately stop the vehicle from driving. Another possible slip threshold may include 0.2 or 0.8, where the former provides rigorous risk assessment and the latter may be used when the vehicle is required to travel over high-slip terrains.
In this evaluation, the risk probability threshold p threshold in Eq. (16) was varied from 0.01 to 0.50, and the classification performance of the different p threshold variations was compared. The lower the threshold, the more predictive uncertainty is taken into consideration, with p threshold = 0.50 corresponding to classifying the risk only from the mean slip prediction.
Example classification results for the rough sand domain are presented in Fig. 13 for p threshold = 0.01, 0.05, 0.10, and 0.50. The slopes classified as high and low risk are colored red and green, respectively. As shown in the figure, the lower p threshold moved the boundary of the two risk levels toward the shallower slope, indicating a more conservative risk assessment. Figure 14 presents the confusion matrices corresponding to the classification results shown in Fig. 13. With the lower p threshold , more true high-risk data were correctly classified as high risk, while miss-classification of true low-risk samples to high risk slightly increased.
The classification performance was quantitatively evaluated based on the following recall and precision scores for the high-risk class.
Higher recall indicates that a smaller number of highrisk samples are erroneously underestimated as low risk. On the other hand, higher precision indicates a smaller number of low-risk samples are incorrectly classified as high risk. Generally, recall and precision are in a trade-off (19)  Classification results of risk levels in Evaluation 4. The red and green areas represent the classified " high-risk" and " low-risk" slopes, respectively. The blue curve represents the predictive slip mean, while the blue-shaded area represents the confidence interval corresponding to the risk probability threshold p threshold . The dashed line represents the slip threshold value y threshold = 0.6 with which the risk level was classified. The red and green points represent the data of high and low risk, respectively. The slip regression model was trained from the target data on slopes shallower than 5 • . Lower p threshold makes the risk level boundary shift to a shallower slope and provides more conservative risk prediction. Results for the rough sand surface class are presented relationship, as trying to increase recall can result in a conservative classifier in which most of the samples are classified as high risk, resulting in low precision. In traversal risk assessment, miss-classifications of high-risk terrains as low risk and that of low-risk terrains as high risk cannot be treated equally. While the latter may result in an inefficient operation (e.g., taking a longer route to avoid terrains that are actually not hazardous), the former can lead to a catastrophic event, such as wheels embedded in the sand or vehicle turnover. Therefore, it is preferable for a risk level classifier to obtain high recall and moderate-to-high precision scores so that it can probably detect true risky terrains while not over-conservatively assessing the risk. Note that F-1 score, a harmonic average of the recall and precision, is often used for evaluating the classification performance. However, for the above reason, this study adopts recall and precision to assess both scores individually rather than using the integrated F1-score. Figure 15 presents the recall and precision scores resulting from the varied risk probability threshold p threshold in the four tested target domains. As qualitatively observed in Fig. 13, increasing the risk probability threshold p threshold results in a lower recall and increased precision, meaning a higher possibility of miss-classifying high-risk terrain as low risk, which is undesirable.
This result implies that it may be better to select p threshold between 0.05 and 0.3 (depending on the mission) to keep the recall score high while not sacrificing the precision, and to correctly detect high-risk terrains with a moderate possibility of incorrectly predicting true low-risk terrain as high risk. How much The risk level classifiers were tested on the target data of slopes steeper than 5 • , which were not used for training the model. With lower p threshold , more true high-risk data were correctly classified as high risk while miss-classification of true low-risk samples to high risk increased slightly. Results for the rough sand surface class are presented uncertainty should be incorporated also depends on the accuracy of the regression model, and on the variability of the traversability. The relatively low precision in the risk classification for the sand-covered bedrock domain is attributed to the relatively wide confidence interval for the domain. Note that the slip prediction model was learned from the target training data on slopes shallower than 5 • and tested on the data of slopes steeper than 5 • . As the proposed MS-TGPR-based traversability learning method can improve the prediction model with increased data from slightly steeper slopes, a higher risk classification score should be possible with the improved model.
The above results demonstrate the effectiveness of the proposed traversability prediction method for classifying the risk level from limited target data with a simple risk probability assessment. However, classifying the risk level on the basis of a more sophisticated measure, such as conditional value at risk (CVaR), would provide a better risk assessment [31].

Evaluation 5: Learning and prediction of power consumption
The final evaluation assesses how well the proposed method can learn and predict a traversability metric other than vehicle slippage, specifically, power consumption. The conditions of the source domains and target domains adopted are the same as those used in Evaluation 1.
The resulting models for power consumption vs. slope are presented in Fig. 16. The RMSE and KLD of the learned models are listed in Table 5. Overall, the models learned on the basis of the proposed method could addequately predict the power consumption on slopes that were not used for training, as the confidence intervals covered most of the test data.
The reason for the relatively large error on the sand over bedrock domain was the miss-learning of the domain similarities and the assignation of a high similarity only to the small rocks domain, resulting in a model learned only from the TGPR of that domain. The learned domain similarity and weight w of each source domain are presented in Table 6.
When comparing the learned domain similarities for the power consumption model (Table 6) and for the slip model (Table 4), it is clear that different similarities are learned for these two models. These differences occurred because the power and slip models were independently learned. It might be possible to improve the predictive accuracy of both models by learning them in a coupled fashion, where the same similarity coefficient set is assigned to models of different traversability metrics. Adopting a multi-output GP method [32] could be one way of accomplishing that.
Overall, the results indicate the possible capability of the proposed method to predict traversability indices other than vehicle slippage.

Discussion
The above evaluations demonstrate that the proposed method can improve traversability prediction accuracy and its predictive distribution when available training data are limited to only those obtained on benign terrains in the target environments. The results also indicate that the required terrains to traverse for obtaining a mature learned model are significantly shallower than those for the models trained without transfer learning.
If sufficient target domain data for learning the prediction model are available, the performance improvement by the proposed learning will be very limited. In such cases, it is preferable to adopt the basic GPR alone owing to its lower computational cost and better learning stability compared to MS-TGPR. One of the limitations of the current approach is that the method determines the source-target similarity i from limited target data and assumes that the similarity does not significantly differ over different terrain geometries. However, this is not always the case. Sometimes high similarity can be assigned to irrelevant domains, resulting in a low prediction accuracy, and also in a wide predictive distribution, or uncertainty. The large uncertainties can be triggered because the predictive distribution is estimated as a mixture of the distributions of the multiple TGPRs included in the model. To improve the traversability prediction, it is important to better estimate the similarity coefficient from the limited data. One possible approach to achieving this would be to use additional feature inputs, such   Table 6 Domain similarity i and weight w i learned for each source domain for the power consumption modeling in Evaluation 5

Target domain Source domain
as exteroceptive sensor information (e.g., visual features) or other proprioceptive sensor information (e.g., vehicle vibration). Adopting a multi-output regression approach might also improve the predictive performance by learning models for multiple traversability indices in a coupled manner, thus making the best use of the underlying relationships between these indices. As clarified in the evaluations, the predictive capability of the proposed method differs depending on the target domain, as well as on the available source domains to be transferred. The key concept of the proposed method is the extrapolation of limited traverse data on the target terrain by interpolating the data from source domains. Accordingly, if the target terrain geometry is located outside the data range in the source domains, the proposed method may fail in its prediction. Similarly, if the terrain traversability of the target domain is significantly outside the traversability characteristic envelope covered by the source domain data, the improvement of the prediction accuracy by the proposed method would also be limited, as interpolating the data from the source domain data is no longer possible in such situations. For example, because the bedrock class is located at the edges of all domains, the proposed method cannot perfectly predict the traversability for this class. Developing a method that can utilize available source domain data when the traversability characteristic of the target domain significantly differs from that of the source domains is an important future work.
As discussed earlier, the proposed method learns the traversability model in the target domain by learning the similarities between the small amount of target domain data and the data in source domains. Therefore, it is fundamentally not applicable if the vehicle has not been driven on the target environment yet, or when in-situ data are completely unavailable even on safe terrains. In such cases, the operators are required to implement some existing terrain classification and/or traversability prediction methods using information from on-board exteroceptive sensors or orbital/aerial observations. However, due to the lack of in-situ traverse data, the prediction result may not always be reliable enough. Once the vehicle obtains even a few traverse data on the target environment, the proposed method can contribute to better predicting the traversability on the new environment.
There are several directions in which the proposed method can be extended. For example, it could be implemented as part of an end-to-end learning framework by combining it with a self-supervised terrain classification method (e.g., [9,10]), in the following fashion. First, traverse data newly obtained in a target environment are classified using a pre-trained proprioceptive terrain classifier. If the data are classified to one of the pre-defined terrain classes with a high confidence, a traversability regression model already learned for that class is adopted to predict the traversability of the same terrain in a distant field. If not, the data is assumed to belong to a novel terrain class. In this case, a new traversability prediction model is learned for the novel terrain class by using the method proposed in this study. A vision-based classifier, if available, can be also updated with the newly defined class at this stage in a self-supervised manner for detecting the same surface class in distant fields. Once a sufficient amount of data samples are collected, the new terrain class is added as a new source domain for future use. The evaluation results presented in the previous section indicate that the model performance and robustness can be improved when more source domains are available for learning. This suggests that the proposed method, combined with self-supervised terrain classification, can lead to a lifelong learning paradigm [33], as the vehicle will become more adaptive to newly encountered terrains by utilizing more traverse experience on multiple types of terrain in its lifetime.
Furthermore, the proposed prediction method can be combined with path planning. In addition to the predictive mean accuracy, the proposed method showed a capability in improving the predictive distribution, which is crucial for generating a safe path if a significant variability and/or uncertainties in the vehicle motion exist. When combined with a stochastic planning approach (e.g., [34]), the proposed method can enable vehicles to be more safely and efficiently operated in novel and/or difficult off-road environments.

Conclusion
This study proposed a transfer regression method to improve the traversability prediction accuracy when insitu traverse data are only available on gentle terrains. Combined with the limited in-situ data, the proposed method leverages past traverse experiences on multiple types of surface to improve the prediction accuracy of the traversability on untraversed slopes. The effectiveness of the proposed method was demonstrated in evaluations using a traverse dataset collected from experiments. The results showed that the proposed method can learn a more accurate prediction model than conventional methods, both in terms of the predictive mean and distribution, from just data on almost flat surfaces of the target domain. In addition, the proposed method can rapidly improve the model performance with additional data obtained from significantly less steep terrains compared to methods without transfer learning. These findings demonstrate the high adaptability of the proposed method to new environments.