Occlusion handling for a target-tracking robot with a stereo camera

This paper presents an occlusion-handling method for a target-tracking robot with a stereo camera. One of the main challenges with the robot is to continue tracking when the illumination changes and occlusion occurs. In order to cope with the challenge, we use both color and disparity images acquired from a stereo camera. The tracking system is composed of three phases: candidate extraction, target identification, and occlusion handling. First, by using only three-dimensional (3D) information, target candidates are extracted. Second, the target is identified from the candidates based on a combination of both color and location features of the target and candidates. The combination depends on illumination changes that are supposed by changes in the white balance. Finally, the state of occlusion is estimated by results of both the analysis of the positional relationship between the candidates and the identification of a target. The proper procedure for the state is implemented. In the off-line experiments, the proposed method is compared with previous methods. Then, the proposed method is applied to a mobile robot, and an on-line experiment is carried out. Through the experiments, the effectiveness of the proposed method is verified.


Introduction
Autonomous mobile robots must have various abilities to assist humans. Tracking a specific person is one of these abilities. This skill can be applied to the carrying of luggage for a person, surveillance, and communication. With a wide range of applications, this ability is expected to be used both indoors and outdoors, from industry to daily life. Currently, target-tracking robots have been deployed in shopping centers [1], military areas [2], golf courses [3], and other places [4]. In order to utilize robots in wide fields and dynamic environments, it is essential for robots to have high perceptual capabilities. Color cameras are most commonly used to give mobile robots such capabilities. Color information acquired from the camera provides features of a target for effective target tracking. However, the effectiveness is influenced by illumination changes and occlusion.
Changing illumination is the leading cause of changes in the color information. One strategy for coping with the problem is to use both the color and location features of a target as humans do. Our previous method [5] was also based on the combination of both features. In that method, a parameter was adopted to show how illumination conditions change. Based on the parameters, how each feature is relied on changes. This compensates for weaknesses of using each feature singly. However, the method did not introduce any method of handling occlusion.
Occlusion occurs when a target is invisible in the frame and cannot be detected. It leads to the loss of a target because a similar candidate might be identified as the target. Additionally, due to occlusion, robots cannot determine whether tracking should be recovered due to loss or continued. One solution is based on estimating the state of occlusion occurring. Because it can be recognized that a target is invisible in the frame during occlusion, loss is prevented when a target is not supposed to be detected. Furthermore, estimation indicates which situation is occurring, loss or occlusion.
In this paper, we propose an occlusion-handling system for a target-tracking robot. The system exploits stereo vision, which can be produced stably during illumination changes. In a further development of our previous targettracking method, the state of occlusion is estimated, and the procedure is selected in accordance with the state. The process of estimating the occlusion state is implemented with only three-dimensional (3D) information. Three occlusion states are defined: no occlusion, partial occlusion with the exact target detection, and partial/ total occlusion with no one being detected. Based on the state, the color or location models of a target are updated, and locations of the other obstacles/people are predicted by trackers.
The paper is organized as follows. We review state-ofthe-art target-tracking robots in "Efforts to overcome the occlusion problem" section . In "Proposed system" section , we detail the proposed system. "Experiments" section describes two types of target-tracking experiments, on-line and off-line. In the off-line experiments, the proposed system is compared with the previous one. Online, in real-world outdoor environments, the proposed system is applied to a mobile robot, and its effectiveness is verified. Finally, "Conclusion" section concludes the paper and discuss future works.

Related works
Several techniques can be used in the attempt to carry out target tracking. Many of them introduce timeseries filters to improve the accuracy of tracking. While someone overlaps a target, the estimation of the target's position reinforces robustness to occlusion. Some methods [6][7][8] use the filter without detecting the occlusion state. However, it is ambiguous as to whether to implement the procedure to recover or continue tracking.
The problem of occlusion is also a challenge in the field of human detection with a fixed camera. Researchers have proposed more occlusion-handling methods with a fixed camera than with a moving camera. Algorithmically, some of these fixed camera-based methods can be applied to the occlusion detection of target-tracking robots. Common methods using a fixed camera are based on classifiers that are built using machine learning algorithms. With benchmark datasets or preparing samples, whether occlusion occurs or not is classified as learning. The Support Vector Machine (SVM) is one of the most common algorithms. Wang et al. [9] use a linear SVM classifier for human detection to detect occluded regions. For human features, Histograms of Oriented Gradients (HOG) are combined with Local Binary Pattern (LBP). The idea is that densely extracted blocks of HOG features are prone to responding to the linear SVM score with negative inner products. HOG and LBP features have the advantage of being feasible for use during illumination changes. To develop the method, Shu et al. [10] introduced part-based detection of humans. The human model is created using features of the parts in each human region. It provides the advantages of excluding the effect from the background and obtaining the regional features. Although these methods perform with high accuracy, scanning the windows where the features are extracted causes the computational cost to be too high for applying to tracking robots. Basso et al. [11] and Cielniak et al. [12] reduced the computational cost by using another learning algorithm, Adaptive Boosting (AdaBoost). Additionally, these methods have been successfully embedded into robot systems. In both methods, the color feature is used to train the classifier. Because color is an unambiguous feature, the classifier performs better and is composed of a larger number of weak classifiers than classifiers that are trained using features other than color. However, these methods are affected more easily by illumination changes.
Without any learning algorithm, some approaches to detecting occlusion are presented according to their analysis of the appearance of a target or human. Pan et al. [13] proposed a content-adaptive progressive occlusion analysis. Occlusion detection is based on scanning the regions of interest (ROI). The occlusion situation is determined by analyzing the pixels in the ROI. Iterative scanning for target detection leads to not only high performance in the experiments but also high computational cost. By evaluating both the distance between objects and the changes of object size in an image, Yilmaz et al. [14] proposed a contour-based tracking to cope with the occlusion problem. Once the evaluation has detected an occlusion, modeling the contour changes alleviates the effect of shape variation from frame to frame. In [15], target-tracking and occlusion-handling methods were shown and applied to a mobile robot. The colors of a target's parts are used as features. The number of pixels in each region identifies the current situation from three cases of occlusion. Based on the case, the tracking procedure is implemented appropriately. In these methods, when the distance between objects is close and the target's size is changed, it causes the modeling to fail. Changes in size occur for two reasons, one is occlusion, the other is the changing distance of the target-getting close to a camera or farther away from it. Because these methods use only a color image, it is unclear which reason explains the modeling failure in the situation. Contrastively, disparity-based occlusion detection is carried out in [16]. This study uses the changes in both the distance between humans and a stereo camera, and the size of human regions. However, the human regions are given by the result of background subtraction method. By using the method, it is easy to extract the human regions. However, it cannot be applied to a moving camera and dynamic environments. Also, the method with the change in the feature between frames is proposed by Tran et al. [17]. The method uses the change in the number of people as a feature which indicates when occlusion occurs or finishes. However, in dynamic environments, the number would be changed by not only occlusion but also the movement of people.

Our previous method
In our previous method [18], the state of occlusion is estimated, and, then, the proper procedure is followed. The estimation is achieved using only two factors: the result of target identification and the analysis of the positional relationships between the target and others.
In the preprocessing of target identification, candidate targets are extracted. To help cope with illumination problems during candidate extraction, only 3D information is used. The 3D information is steadily acquired from a stereo camera, even under varying illuminations. Based on the 3D information, a point cloud in a 3D space is produced. By using the point cloud, candidate regions can be extracted, even when the entire region of the target (from head to foot) cannot be visible due to partial occlusion. Furthermore, it reduces estimation errors regarding a target's position during occlusion. Also in the method of [5], a target is identified from the candidates using a combination of both color and location models based on illumination parameters.
The positional relationship is analyzed based on the 3D information, and the occlusion state is estimated from the results of the analysis. The occlusion states are classified into three types: no occlusion occurs (STATE 1), so little occlusion occurs that a target is identified (STATE 2), and so much occlusion occurs that no one is identified as a target (STATE 3). In accordance with the state, it is determined whether each color or location is registered as a target feature.
The results of the experiments showed the method's effectiveness. However, there may be challenging situations with the method [18], as shown in Fig. 1. In each figure, the region of identified target is depicted as a red rectangle. A target is drastically occluded by person A when the illumination changes extremely. In the method, the reliability of both color and location features is determined by the degree of changes in the illumination. In this situation, target tracking relies heavily on the location feature. Then, during total occlusion of approximately 21 frames (3.0 s), estimation errors of the target's position accumulate. Finally, the estimation is close to the position of A, and mis-identification occurs.
In order to cope with the problem, an occlusion-handling method is developed. In the method, candidates other than the target are also tracked from frame to frame. To prevent the escalation of computational costs, tracking is implemented only during STATE 2 and 3. In the next section, we will detail the proposed method.

Proposed system
The proposed system is composed of a stereo camera mounted on a mobile robot. The target-tracking system consists of three procedures: candidate extraction, target identification, and occlusion detection.
The first phase of the system is candidate extraction. By projecting 3D information acquired from a disparity image into a 3D space, the candidate regions of a target are extracted. Second, a target region is distinguished from the others. In order to achieve the distinction, a target model is produced. The model is composed of a target's color and location features. Finally, in the process of occlusion detection, the positional relationship between a target and the others is analyzed. Using the results of both the analysis and target identification, the occlusion state is estimated. Depending on the state, the appropriate procedure is followed.

Candidate extraction
In our previous system [5], the segmentation method [19] was applied for candidate extraction. The method utilized an overlooked plane, and 3D information in each pixel of a disparity image was projected onto the plane. The density of the projected points tended to be high in regions corresponding to humans. The human regions were extracted based on density. The method required the entire region of the person (from head to foot) to be visible because the height information was squeezed on the plane. When partial occlusion occurs and a target region is partly visible but not entirely, the target may not be detected even as human. As mentioned above, according to how long occlusion continues, estimation errors regarding the target's position accumulated. To avoid this accumulation, extracting a partially occluded region of the candidate is also desirable in this phase. Therefore, in this paper, the candidate-extraction procedure utilizes a 3D space against an overlooked 2D plane. By using 3D information acquired from a disparity image, a point cloud is obtained in the space. The space is defined by the X-Y-Z coordinate, as shown in Fig. 2. The procedures for candidate extraction are explained as follows: first, a point cloud is produced. For instance, Fig. 3 depicts an illustrative captured image. The point cloud acquired from the corresponding disparity image is shown in Fig. 4. Note that the points on the ground have been eliminated. The cloud includes groups of points that correspond to three people in Fig. 3. Second, in order to obtain the density of the points, the space is divided into boxes, and the number of the points in each box is counted. The density in each box would represent the existence of the candidates. Therefore, if the density in a box exceeds a certain threshold, the box is assumed to be a part of the candidate. Third, boxes with high densities are extracted (Fig. 5) and labeled. In the labeling procedure, four-connected components are defined as belonging to the same label. Fourth, mean-shift clustering is implemented to merge or split labeled regions. Finally, candidates' regions are extracted by thresholding with respect to the height, width, and depth of the regions. The result of candidate extraction for the illustrative condition is shown in Fig. 6. Three regions are extracted as the candidates of a target. The candidate-extraction method gives robustness to both partial occlusion and illumination changes. In contrast to methods that use the contour features of people, our method does not require the entire contour to be visible. Furthermore, because the method is composed only of 3D information, it is not affected by varying illumination.

Target identification
To compare extracted candidates with a target, the target's model is used. Both color and location features are components of the model.
The dissimilarity of color features is based on comparisons of the hue and saturation histogram. The color histograms of each candidate H c and the color model of a target H t are compared by the following equation: Note that H c (h, s) and H t (h, s) indicate the normalized frequencies of hue (h) and saturation (s). Additionally, the color model is compared with pre-registered one, and if the dissimilarity is under threshold, the color model is updated.
The location model of a target is given by a Kalman filter. The filter is defined based on the assumption between frame k and (k + 1): where X k+1 and X k are the state at respective frame, F k and H k are the transition and observation model, u k is the control vector, z k is the observation at frame k ,and (1)

The location feature between each candidate and the model is compared
where (X c , Y c ) is the location of a candidate, (X t , Y t ) is the location model of a target, and k is a fixed value for normalizing R location . As with our previous method [5], the total dissimilarity between candidates' features and the model is calculated by combining these dissimilarities in accordance with the illumination changes. The total dissimilarity is defined as follows: where α is the parameter that represents illumination changes and has the relation α = p|W |, W is the amount of the white-balance change, and p is a constant. The amount of white balance W means the difference of the

Fig. 5 Divided boxes
value between the present frame and the last frame in which the color model is updated. The value of p is determined so as to hold the relation of 0 ≤ α ≤ 1. In (7), α th denotes the threshold of illumination change. When illumination changes remarkably (α ≥ α th ), we use only the location feature.

Occlusion handling
Previously, we proposed an occlusion-detection method [18] that used a disparity image. First, by using the method, the state of occlusion in the latest frame is determined. The occlusion-handling procedure is performed in accordance with the state.

Occlusion detection
The state of the positional relationship between a target and other objects/people is analyzed on an overlooked plane (Fig. 7). In the figure, the blue rectangle shows a stereo camera, and the dotted lines indicate the cameras' fields of view. D is the closest distance that can be measured by the stereo camera. Region A is defined as the inner region, which is structured by lines that connect the left camera with the left edge and the right camera with the right edge of a target's region with margins. When there are objects/people in Region A, the partial/ total region of a target is (or is going to be) hidden in an image, i.e. occlusion occurs. While occlusion continues, Region B is structured as the region hidden by the object in Region A (Fig. 8). By using Regions A and B, and also based on the result of target identification, the state of occlusion is classified into three types: STATE 1, 2, and 3.

I. STATE 1: When no object/person violates Region
A, the occlusion state of the frame is STATE 1. No occlusion occurs in this state. II. STATE 2: When objects/people are present in Region A, occlusion is regarded to occur. If an identified target is partially occluded, the occlusion state is STATE 2. In this situation, the edges of the target's region are incorrectly determined because the correct edges might be hidden. Due to the margins on the edges, it is allowed to define this state even when the target is partially in Region B. The width of the margins depends on uncertainly of candidate extraction because the candidate region is given by boxes. Therefore, as shown in Fig. 9, when the target's region is not hidden but is close to Region B, it is also classified as this state. In other words, STATE 2 shows the situation when a target is going to be occluded in a few frames. III. STATE 3: When the estimated region of a target is in Region B and a target is not identified, the occlusion state is determined to be STATE 3. The region of a target is defined as the model of the target's position acquired by a Kalman filter with margins. The width of the margins is determined by the width of the target region that is identified with STATE 1 just prior to STATE 3. In this state, only the estimated location of the target can be assessed, due to occlusion.

The occlusion-handling procedure in each state
As defined above, situations with/without occlusions are shown for all STATE. Then, the strategy for continued tracking in each STATE is detailed as follows.
I. STATE 1: In this state, the color model of a target is considered to be correctly obtained, because no object/person occludes a target. Therefore, the color model is updated to adjust to illumination changes. In addition, the location model of a target is also updated to reduce the estimation errors of the target's position. II. STATE 2: Though a target is visible, the region may not be completely visible. The color model of a target is not updated. The location model is updated when the width of the target's region exceeds a threshold. Another problem during occlusion is the incorrect identification of a target. When the target region is occluded, the candidate would be erroneously identified and mistaken for the correct target. With tracking objects/people between frames, the problem could be avoided. To reduce the computational cost, objects/people (whether they are extracted as the candidates of the target or not) are Tracking is implemented based on Euclidean distance, as shown in Eq. 6. In STATE 2, the objects in Region A are tracked between frames. Once the objects invade Region A, the location features of the objects are registered and are tracked during STATE 2 or 3. Even when the registered feature is not similar to any object, the estimated position of the feature is updated using the measurement position. If the object position is not similar to that of any registered object, the position is newly registered. III. STATE 3: Occlusion causes a target to be hidden and not identified. Therefore, the duration of the estimation of a target's position is extended until the estimated position is moving out of Region B. Non-target objects/people have also been tracked in this state. All of the objects/people within 0.5 m of a target's estimated position are tracked. This procedure allows for re-identification after longterm occlusion.
Additionally, if some objects are tracked and the value of the illumination parameter α exceeds a certain threshold, target identification might be incorrect. When illumination changes and the neighbor candidate partly/ totally occludes a target, identification failure might occur. The R location of the identified target is compared with the distance between the position of the identified target and each registered position. Through the comparison, if the R location is the smallest value, the target is regarded to have been identified correctly. On the contrary, if another distance value is the smallest, the identified target is not considered to be a correct target but a corresponding candidate. It follows that a target cannot be identified and it leads to STATE 3.

Experiments
The proposed system has been tested in outdoor environments with both illumination changes and occlusion. We have conducted two types of experiments, off-line and on-line. In the off-line experiments, images that had been prospectively captured were used. The proposed method and our previous methods were applied to the images and compared. In the on-line experiment, a mobile robot tracked a target. In each experiment, we used the Bum-blebee2, of Point Grey Research, as a stereo camera, and Blackship, of Segway Japan, as a mobile robot. Additionally, the parameters of the proposed method is shown in Table 1. The amount of white balance changes |W| is calculated as the sum of the changes of red and blue gains of the camera. Each gain changes in 1024 steps. In addition, the minimum width when the location model of a target is updated in STATE 2, is half value of the target's width when a target has been identified last.

Off-line experiments
Before the experiments, images had been captured by the stereo camera attached to the mobile robot. The robot was controlled to follow a target by a human operator. During controlling, 3260 frames were captured at 12.8 Hz.
Details of the experimental environments are shown in Tables 2 and 3. In Table 2, the number of times when occlusion occurred and the average and maximum duration of occlusion are shown. In Table 3, the number of frames and the average duration are shown for each number of people. Additionally, Fig. 10 shows examples of the color images that were used in the experiments.
In the off-line experiments, three types of methods were applied. Method I is the proposed method, Fig. 9 The situation when the object does not violate Region A. This state is also determined to be STATE 2 because of the margins attached to the target's region Table 1 The parameters of the proposed method in the experiments with procedures for detecting and handling occlusions. Method II is our previous method [5], with procedures for detecting occlusions and both updating the color model and continuing the duration of estimation of target's position. Method III is also our previous method [18] without any procedures for detecting or handling occlusions.
The effectiveness of each method is verified by three evaluation values: precision, recall, and F-measure P, R, and F, respectively. These values represent accuracy, completeness, and the harmonic mean of precision and recall, respectively.
A: the number of frames in which the target is correctly detected, B: the number of frames in which a non-target is detected (mis-identification), C: the number of frames in which no objects are detected (dis-identification). Table 4 is the result of the calculation of each evaluation value against each method. The result shows that the highest evaluation values of all methods are acquired using the proposed method. Comparing Method II with III, the precision value of Method II is lower than that of Method III. However, the recall and F-measure values of Method II are higher than those of Method III. This shows that, under occlusion, a target is readily lost without occlusion detection. The detection method helps complete identification but might cause mis-identification. Therefore, the proposed method, which aims to decrease mis-identification, is effective.   Situations in which a target is not identified are classified into two types. One is occlusion, the other is the loss of a target. With Method I or II, when a target is not identified and there is no one in Region A, the situation is defined as the frame when a target was lost. Using Method III, whenever target identification is not carried out, the situation is defined as the loss of a target. The number of frames when a target is lost with each method is shown in Table 5.
The number of Method I is the smallest of all methods, approximately 1% of all frames (3260 frames). The number of Method II is less than that of Method III. This also shows the necessity of occlusion estimation.

On-line experiment
The proposed system was applied to the mobile robot with the stereo camera in real-world environments. The robot's behavior was based on a PID controller so as to keep the distance between a target and the robot to 1 m and the angle of direction to 0 rad. An entire experiment was composed of 1498 frames that were captured at 12.1 Hz. The experimental environments are classified into five scenes based on the illumination. Details of the experimental environments are explained in Tables 6  and 7. Three of the values are also used for evaluation (see "Off-line experiments" section). Figure 11 depicts the target-identification results. In the figure, red rectangles indicate the centroids of the target. The results of the evaluation are shown in Table 8.
The precision values are calculated as 100%. Both of the recall and F-measure values are higher than 94%. The results follow the purpose of this method to decrease target mis-identification. Even when there was a person near the target who caused occlusion, mis-identification did not occur. Additionally, the robustness to occlusion is shown by successful identification after 20 frames (about 1.9 s) of occlusion.
However, frames occasionally occurred in which no target was identified despite the target's presence, 92% (55 frames) of which were caused by occlusion. Due to estimation errors regarding the target's position, in 23 frames, dis-identification occurred after occlusion. In each frame, the target was regarded not as the target but as the other candidate, and associated across frames. It is impossible to estimate the target position correctly during long-term occlusion. Therefore, a recovery method is required that will help with re-identification whenever the target is lost.
Dis-identification by occlusion occurred in 21 frames due to illumination changes. Before or after the frames, the target was occluded, and illumination changed. However, the illumination parameter did not change. Figure 12 shows this situation. In the situation, only the brightness of the images changed; therefore, the whitebalance values did not change. To deal with the problem, other factors that reflect the brightness changes should be adopted in the future.
Identification was prevented in the other 9 frames. In these frames, part of target's region was still visible. Because the color model of a target was produced using the entire region of the target, the color histogram of the small part did not correspond to the model. Segmentation errors of the candidate's regions caused the errors in another 3 frames, as the target's region was not correctly extracted and identified as a target.