In this section, the relative poses of the 3D models of the robot, the floor, the walls, and the multiple fisheye cameras on the robot are estimated.

First of all, we define the world coordinate so that the plane of the floor is in an *xy*-plane, and the origin of the world coordinate \(\mathbf {O} = (0,0,0)^{{\mathsf {T}}}\) is set to a point on the floor. The *z*-axis is directed upward. In other words, \((p_x,p_y,0)^{{\mathsf {T}}}\) and \((p_x,p_y,h)^{{\mathsf {T}}}\) indicate a point on the floor and a point at the height *h* from the floor, respectively.

As the 3D model of the walls is defined by the LRF in Eq. (1), the relative pose of the LRF should be estimated instead of that of the walls.

Therefore, the parameters that need to be estimated are as follows:

### Relative poses of the floor and fisheye cameras

The intrinsic parameters of the fisheye cameras are needed to be estimated in advance [14, 15]. We use OCamCalib as it supports a wide field of view of the fisheye cameras (up to 195 degrees) and the closed-form expressions for the distortion and the undistortion.

The relative poses of the floor and the fisheye cameras are estimated in the similar manner as done in [5] except that instead of estimating the homography matrix of each image, we estimate the actual camera poses. In the previous method, square boards as calibration targets are placed on the floor, so that two adjacent cameras are able to capture the same target as shown in Fig. 5. For example, assuming that the poses of four cameras are to be estimated, then four targets should be placed on the floor. Next, four corners of each target are selected manually in each image and the corresponding points between two images are used for estimating the parameters.

In this study, instead of selecting the corners of the targets manually, AprilTag [16, 17] is employed as the calibration targets so that the corners of the targets are detected automatically. AprilTag uses a 2D bar code to distinguish the targets, which make it possible to find the corresponding points between two images automatically.

The relative poses are calculated by minimizing the reprojection error between the corresponding points on the images, which is formulated as follows:

$$\begin{aligned} E_{\text{reproj}} = \sum _{c_k \in \mathcal {C}} \sum _{i \in \mathcal {V}_{c_k}} \sum _{j=1}^{4} || \omega _{c_k} \left( \mathbf {q}^w_{ij}\right) - \mathbf {u}^{c_k}_{ij} ||^2, \end{aligned}$$

(2)

where \(\mathcal {C}\) is the set of all the cameras, \(\mathcal {V}_{c_k}\) is the set of all the indices of the targets which can be seen from the camera \(c_k\), and \(\mathbf {u}^{c_k}_{ij} \in \Omega\) is the point corresponding to \(\mathbf {q}^w_{ij}\) captured by the camera \(c_k\) in the image domain \(\Omega\). \(\mathbf {q}^w_{ij} \in \mathbb {R}^3\) is the *j*-th corner location of the *i*-th target in the world coordinate, which is formulated as follows:

$$\left( {\begin{array}{*{20}c} {{\mathbf{q}}_{1}^{p} } & {{\mathbf{q}}_{2}^{p} } & {{\mathbf{q}}_{3}^{p} } & {{\mathbf{q}}_{4}^{p} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} 0 & L & L & 0 \\ 0 & 0 & L & L \\ 0 & 0 & 0 & 0 \\ \end{array} } \right),$$

(3)

$${\mathbf{q}}_{{ij}}^{w} = \left( {\begin{array}{*{20}c} {\cos \alpha _{i} } & { - \sin \alpha _{i} } & 0 \\ {\sin \alpha _{i} } & {\cos \alpha _{i} } & 0 \\ 0 & 0 & 1 \\ \end{array} } \right){\mathbf{q}}_{j}^{p} + \left( {\begin{array}{*{20}c} {\beta _{i} } \\ {\gamma _{i} } \\ 0 \\ \end{array} } \right),$$

(4)

where *L* is a side length of the targets. \(\alpha _i\), \(\beta _i\), and \(\gamma _i\) are the parameters to be optimized, which determine the pose of the *i*-th target as shown in Fig. 6. The parameterization of the calibration targets in *xy*-plane is illustrated in Fig. 6.

\(\omega _{c_k} :\mathbb {R}^3 \rightarrow \Omega\) is a function that projects a point \(\mathbf {q}^w_{ij}\) in the world coordinate onto an image plane of a camera \(c_k\), which is formulated as follows:

$$\begin{aligned} \omega _{c_k} \left( \mathbf {q}^w_{ij}\right) =\pi _{c_k} \left( \mathbf {R}_{c_kw} \mathbf {q}^w_{ij} + \mathbf {t}_{c_kw}\right) , \end{aligned}$$

(5)

where \(\mathbf {R}_{c_kw} \in \text {SO}(3)\) and \(\mathbf {t}_{c_kw} \in \mathbb {R}^3\) are the rotation matrix and the translation vector, respectively, to determine the pose of the camera \(c_k\). \(\pi _{c_k} :\mathbb {R}^3 \rightarrow \Omega\) is a fisheye projection function [14, 15].

The cost function in Eq. (2) is minimized by using the Levenberg Marquardt (LM) method provided by [18] to estimate the camera poses, \(\mathbf {R}_{c_kw}\) and \(\mathbf {t}_{c_kw}\). The Rodrigues’ rotation formula is employed to represent the rotation \(\mathbf {R}_{c_kw}\) to avoid redundant parameterization. The initial values for the LM method are calculated by IPPE [19] which is suitable for estimating a camera pose using a plane-based calibration target.

### Relative poses of the robot model

The 3D model of the robot is created using open source structure-from-motion software [8,9,10,11] where the scale of the 3D model cannot be obtained. Therefore, in addition to the relative pose between the robot model and the world coordinate, the relative scale is also needed to be estimated.

The relative scale and pose of the robot model are estimated by selecting some reference points \(\mathbf {q}^m_i\) in the robot model coordinate and their corresponding \(\mathbf {q}^w_i\) in the world coordinate. As the camera poses are already estimated, the camera locations are used as the reference points for the estimation. After selecting the points corresponding to the locations of the cameras in the model coordinate, the relative scale and pose, *s*, \(\mathbf {R}_{wm}\), and \(\mathbf {t}_{wm}\), are estimated by minimizing the following cost function:

$$\begin{aligned} E({s}, \mathbf {R}_{wm}, \mathbf {t}_{wm}) = \sum _{c_k \in \mathcal {C}} || \left( {s}\mathbf {R}_{wm} \mathbf {q}^m_{c_{k}} + \mathbf {t}_{wm}\right) - \mathbf {q}^w_{c_{k}}||^2, \end{aligned}$$

(6)

where \(\mathbf {q}^w_{c_{k}}\) is the location of the camera \(c_k\) in the world coordinate, which is calculated as follows:

$$\begin{aligned} \mathbf {q}^w_{c_{k}} = -\mathbf {R}_{c_kw} \mathbf {t}_{c_kw}, \end{aligned}$$

(7)

and \(\mathbf {q}^m_{c_{k}}\) is the location of the camera \(c_k\) in the robot model coordinate. The cost function in Eq. (6) can be solved by the closed-form solution [20].

### Relative pose of the LRF

The LRF is installed on the top of the robot, so that the scan lines are parallel to the floor, and the height of the LRF does not affect the 3D model of the walls. Therefore, only the translation \(t^{xy}_{wl} \in \mathbb {R}^2\) in the *xy* plane and the rotation \(\alpha _{wl} \in \mathbb {R}\) around the *z*-axis should be estimated.

The procedure for estimating the relative pose of LRF is almost similar to the method of estimating the relative pose of the robot model, except that only 2D rigid body transformation is required for the LRF. Some measurement points of the LRF and the corresponding points in the world coordinate are chosen, and the relative pose of the LRF is estimated by [20]. In this study, the corners are selected as they are easily distinguished from the measurement points as can be seen in Fig. 4. The corresponding points in the world coordinate are selected in the bird’s-eye view image, which is generated by projecting the fisheye images onto the 3D model of the floor by using the method described in "Give textures from fisheye images" section.