# Vision and force/torque integration for realtime estimation of fast-moving object under intermittent contacts

- Hyunki Bae
^{1}, - Soo Jeon
^{1}Email authorView ORCID ID profile and - Jan P. Huissoon
^{1}

**3**:15

**DOI: **10.1186/s40648-016-0054-2

© Bae et al 2016

**Received: **22 January 2016

**Accepted: **1 July 2016

**Published: **16 July 2016

## Abstract

This paper considers the fast and accurate estimation of motion variables of a rigid body object whose movement occurs from intermittent contacts with coordinating manipulators in nonprehensile manipulation tasks. The estimator operates under multiple sensory data including visual, joint torque, joint position and/or tactile measurements which are combined at the lower level to compensate for the latency and the slow sampling of the visual data. The estimator is real-time in the sense that it provides the motion data of the target object at the same fast sample rate of the servo controller without delay. The basic formulation is the multi-rate Kalman filter with the contact force vector as its process input, and the visual observation as its measurement output signal which is the down-sampled and delayed version of the configuration of the target object. Experimental tests are conducted for the case of planar object manipulation as well as the non-centroidal rotation under gravity using a robotic hand, and their results are presented to demonstrate the validity of the proposed estimation scheme.

### Keywords

Coordinated manipulation Kalman filter Sensors and sensing systems## Background

Coordinating multiple manipulators for handling an object has potential to substantially enhance dexterity and versatility, the level of which is not possible by traditional stand-alone manipulators [1–3]. Although we expect that this will bring us numerous benefits, there are a number of technical issues to be resolved before we demonstrate the practical use of coordinated manipulation with human-like dexterity. From a hardware perspective, there has been much progress in recent years in new designs of complex manipulation systems such as anthropomorphic robotic hands that are capable of highly coordinated manipulation tasks with a large number of degrees of freedom [4–7]. On the other hand, there is a consensus (see, for example, *A Roadmap for US Robotics* [1]) that we are still far behind in some key technology areas such as perception, robust high fidelity sensing and planning and control to achieve highly dexterous object manipulation capabilities. As for the sensing in particular, it is recognized that processing sensory data with multiple modalities (e.g. vision, joint angle, tactile and force) will play a key role in synthesizing dexterous manipulation skills [1], which is clearly the case for human dexterity as we have ample evidence from neuroscience [8, 9].

This paper considers the situation where we want to keep track of the movement of the object in realtime, while the object is being manipulated by intermittent contact forces from the coordinating manipulators. Such scenarios occur when we attempt to manipulate a rigid body object through, so called nonprehensile manipulation [10] which refers to a class of dexterous manipulation tasks that make an explicit use of the dynamic forces (e.g. inertial, gravitational, or centrifugal) to allow extra motion freedoms to the object. Typical examples of nonprehensile manipulation include pushing, throwing, caging, flipping, twirling, etc. In conventional approaches to coordinated (or dexterous) manipulation [11], the object is being held stably through force-closure condition during manipulation. The force closure condition ensures that the object and manipulating bodies are kinematically related throughout the manipulation process, and thus we can easily keep track of the movement of the object using the associated kinematic relations. On the contrary, in the nonprehensile situations, the movement of the target object cannot be inferred from the kinematic relation because the object frequently loses its contact with some or all of its manipulating bodies, making the contact force applied in an intermittent way. Therefore, an additional sensor such as a vision camera is often necessary to provide the information of object motions in nonprehensile cases. The visual data, however, may not be enough to provide all the information in a timely manner. The vision may be blocked from a certain direction (occlusion) [12]. Also, the movement data obtained from visual images are often slow and delayed (latency) [13, 14], thereby restricting the speed and the accuracy of achievable object manipulation tasks, especially when the object is moving relatively fast, as is often the case with nonprehensile tasks mentioned above.

The main idea of this paper is to design a real-time rigid body estimator by augmenting the visual data with other sensory inputs available from manipulating bodies including the joint force/torque sensor or the tactile sensor mounted at the surface of the end-effector. The estimator is termed “real-time” in the sense that the visual feedback and other sensory data are updated and utilized to guide the manipulators at the same rate of their motion controller (or the servo controller), which is usually much faster than the update rate of the vision data. Therefore, the information fusion occurs at the very low level of the sensory data processing. Such a technical issue has been addressed either by some algorithmic approach (see, for example, [13] and references therein) or by sensor-based approach [14]. In this paper, we try to combine different types of sensors but make explicit use of the physical model for the rigid body dynamics requiring the determination of contact forces in detail as a necessary step to achieve a good estimation performance.

Combining the vision with force/torque or tactile data has been tried by many researchers in recent years [15–19], mainly for task-orientated (i.e. high-level) operations of robotic manipulators such as opening doors [18] or grasping a stationary object [15, 19]. Because the time scale of these tasks is much larger than that of the controller, the timing issue (i.e. the latency of the visual data) is not critical in these applications. On the other hand, the nonprehensile manipulations mentioned above often involve much faster time scale associated with the object motion. Thus the compensation of the visual latency will be important for accurate and fast estimation of the object motion, especially when we use the estimated variables for real-time control purpose. In fact, some pioneering works on nonprehensile manipulation share the motivation with our study in this respect. Specifically, the work in [20] considered a concept called the dynamics matching to close the visual feedback loop at the same rate (i.e. 1 ms) as the other sensory inputs using a specially-designed high-speed vision system.

The remainder of this paper is organized as follows. “Preliminaries” section presents the mathematical description of general rigid body motion and the determination of contact forces using the contact kinematics. The main formulation of the multirate rigid body estimator is described in “Estimator design” section and their experimental results are presented in “Experimental results” section followed by the summary of the results in “Conclusions and discussions” section.

## Preliminaries

### Description of rigid body motion

*S*and

*B*denote the spatial and the body coordinate frame, respectively. The total mass and the inertia tensor of the object, evaluated at the center of mass, are denoted by

*m*and \(\mathcal {I}\), respectively, where \(\mathcal {I}\) is a diagonal matrix with moments of inertia \(J_1\), \(J_2\) and \(J_3\), i.e. \(\mathcal {I}_0 = \mathrm {diag}(J_1,\, J_2,\, J_3)\) when its bases lie along the principal axes. Then, the general form of rigid body dynamics can be written as [11]

*b*indicates that the corresponding quantities are written with respect to the body coordinate frame attached to the object mass center (shown as

*B*in Fig. 1). \(\mathcal {M}=\mathrm {diag}(mI_3,\, \mathcal {I}_0)\in \mathbb {R}^{6\times 6}\) is the generalized inertia matrix (with \(I_3\in \mathbb {R}^{3\times 3}\) denoting the identity matrix). The notation \([\bullet \times ]\in \mathbb {R}^{3\times 3}\) is the skew-symmetric matrix form of the cross product, i.e. \([a\times ]b = a\times b\). The vector \(V^b=\text {col}\left( v^b,\,\omega ^b\right) \in \mathbb {R}^6\) is the twist form for the generalized body velocity where \(v\in \mathbb {R}^3\) is the translational velocity, and \(\omega \in \mathbb {R}^3\) is the angular velocity. \(F^b\in \mathbb {R}^{6}\) is the resultant wrench applied at the center of mass, which can further be written with respect to those of each manipulator. Namely, denoting by \(n_p\) the total number of end-effectors, \(F^b = \sum _{i=1}^{n_p}{F_i^b}= \sum _{i=1}^{n_p}\text {col}\left( f_i^b,\,\tau _i^b\right)\) where \(f_i^b\in \mathbb {R}^3\) and \(\tau _i^b\in \mathbb {R}^3\) (\(i=1,\,2,\,...n_p\)) are the force and the torque vectors, respectively, for the

*i*th contact point (denoted by \(p_{c_i}\) in Fig. 1). Note that \(f_i\) can be either a contact force or the gravitational force. The point of application, \(p_{c_i}\), of the gravitational force will be located at the center of mass (see Fig. 1). We consider the situation where the object is freely moving without any environmental or disturbance forces (such as friction) except for the gravity.

*v*of Eq. (1) in the spatial coordinate frame (indicated by the superscript ’

*s*’) and separate the dynamics into a translational part and a rotational one, we have

In general, the dynamics of the object may behave in a little different way if the contact forces are applied in a non-smooth or impactive way [25]. However, in practice, we may not encounter such an ideal impact because of the softness (or the flexibility) of finger tips and the end-effectors.

### Determination of contact force

Often times, the contact force \(f_i\) in Fig. 1 cannot be directly measured but rather needs to be estimated using proprioceptive sensory data (e.g. joint angles, torques and/or tactile sensors) available from each manipulator. To explain specific ways of doing this, we briefly review the contact kinematics and coordinate transformation of generalized forces in this subsection. More detailed treatment for mathematical formulations of grasping and contact forces can be found in well-known textbooks such as [11] and [21].

*i*-th manipulator (finger) contacting the object at the point \(p_{c_i}\). \(D_i\) denotes the coordinate frame for the base of the

*i*-th manipulator that is fixed relative to

*S*. \(E_i\) is the coordinate frame for the reference end-effector that is attached to the terminal link. We will use \(m_i\) to denote the number of joints in the

*i*th manipulator. The joint space vector for the manipulator is denoted by \(\theta _i=\begin{bmatrix}\theta _{i,1},\ldots\theta _{i,m_i}\end{bmatrix}^T\) which is in \(m_i\)-torus \(\mathbb {T}^{m_i}=\mathbb {S}^1\times \cdots \times \mathbb {S}^1\). For the sake of illustration, we depicted the robotic finger as a 3 DOF (degrees-of-freedom) elbow manipulator, i.e. \(m_i=3\), in Fig. 2.

*B*in this case). Also, \(g^{-1}_{bc_i}\) is the inverse of the homogeneous transformation \(g_{bc_i}\in \mathbb {R}^{4\times 4}\) which belongs to

*SE*(3) and maps a vector in \(C_i\) coordinate into that in

*B*coordinate. (see [11] for more information about the homogeneous transformation and the adjoint transformation.)

*i*th manipulator evaluated with \(T_i\) as the terminal frame and \(D_i\) as the base frame. Combining Eqs. (7) and (8), we have

*i*-th manipulator [11] and the notation \(\bullet ^{\dagger }\) stands for the pseudo-inverse of \(\bullet\). The body wrench \(F^b_i\) is then related to \(\Upsilon _i\) by Eqs. (5), (6) and (9). For \(m_i\ge 3\), \(J_{h_i}\) is either a square or a tall matrix and thus its pseudo-inverse exists uniquely as long as it is full rank. If \(J_{h_i}\) is not full rank, the manipulator may be in a singular configuration and needs a special treatment, which is beyond the scope of this paper.

From Eq. (9), it is clear that we need at least three joint torque sensors to fully determine \(f^{t_i}_{i}\) in 3D. If the tactile sensing is available to provide the contact pressure, \(f^{t_i}_{i,3}\) can be directly obtained from the tactile sensor. This allows us to use only two joint torque sensors to figure out the remaining components for tangential direction, i.e. \(f^{t_i}_{i,1}\) and \(f^{t_i}_{i,2}\). If the manipulator is deficient in sensing, the contact force may not be fully determined in general. However, in some special cases, the contact force may be identified reasonably well even with fewer number of sensors. Manipulation by an impactive force is one example where the direction of the contact force is related to the pre-impact velocity. If the direction of the contact force vector can be known (for example by the vision sensor), we only need to find its magnitude which we may infer with just one joint torque sensor. An example of such cases is considered in the experimental tests.

## Estimator design

In this section, we derive the rigid body estimator for the general 3D case, using the rigid body dynamics and the contact force measurement equations presented in the previous subsections. Multimodal sensory integration will also be put into perspective through the formulation of a new state estimator combining signals from the visual image with the joint torque sensor and/or the tactile data.

### State space description

*t*and

*r*denote the translation and rotation, respectively, \(u(t)=\text {col}\left( f_1^b(t),\, ..., f_{n_p}^b(t)\right) \in \mathbb {R}^{3n_p}\) is the force input and \(w(t)\in \mathbb {R}^{3n_p}\) represents the corresponding process noise. The system matrices are given by

*i*-th end-effector is not in contact with the object, both \(p_{c_{i}}^b(t)\) and \(f_i^b\) can simply be set to zero.

Note that the translational dynamics depends on the rotation matrix *R*(*t*) to be estimated using the attitude dynamics. Similarly, the system matrix for the attitude dynamics, \(F_r(t)\), is a function of the state variable \(\omega ^b\) and the attitude input matrix \(G_r\) is a function of \(\left[ p^b_c(t)\times \right]\) which needs to be estimated from the translational motion. The state dependency of these system matrices can be handled by the pseudo-linear formulation [23, 24] which simply replaces the actual state variables (i.e. \(\omega ^b\), *R*, and \(\left[ \omega ^b(t)\times \right]\) ) in the system matrices by the estimated ones.

*k*, the discretetized form of Eq. (10) is

*u*(

*t*) is available from joint torque sensors and/or tactile sensors with a fast sample rate \(T_s\). On the other hand, the position and orientation of the rigid body can be measured through a stereo vision system at much slower sample rate and with latency due to the vision processing time. The corresponding output equations can be described as

Due to the multi-rate nature of the system equations given in Eqs. (12) and (13), the estimation process consists of two parts: the prediction step at every fast sample time \(T_s\), and the correction step at every vision sample time \(T_v\). The key point is that the correction is operated in a *retrospective way* to fully recover the measurement delay coming from the delayed vision data, thereby ensuring that the estimates reflect the current (real-time) state information. In the remaining part of the paper, the bar notation, \({\bar{\bullet }}\) will be used to denote the values measured by the sensors while the hat notation, \({\hat{\bullet }}\), will be used to denote the estimated values.

### Multirate estimation combining vision and contact force

*k*based on the vision data at the same time step, which we denote as \({\hat{\xi }}(k|k)\) and \({\hat{\eta }}(k|k)\), respectively. Then, until the next vision data become available at the time step \(k+N_v\), the state variables are predicted by propagating Eq. (12) using the contact force vector as an input signal

*k*is the discrete-time index for the moment at which the last correction was made by the vision data. \(\bar{u}(k)= u(k) + w(k)\) is the forces applied at the contact point(s) which can be measured by the force-torque and/or tactile sensors on fingers using the kinematic relation provided in the previous section. The sensor noise \(w(k)\in \mathbb {R}^{3n_p}\) is assumed independent, Gaussian distributed, and stationary with its covariance denoted by \(\mathbb {E}\left[ w(k)w(k)^T\right] =W\in \mathbb {R}^{3n_p\times 3n_p}\). Note that the state-dependent system matrices \(\Gamma _t\), \(\Phi _r\) and \(\Gamma _r\) are updated by the vision sample rate \(T_v\) in Eq. (14) (i.e. they are updated at the time step

*k*and kept constant for \(i=0,1,...,N_v-1\)). However, it is also possible to update them by the base sample rate \(T_s\) using the predicted state variables.

*Z*(

*k*|

*k*) as the estimation error covariance at time

*k*right after updating by the

*k*th image. Then,

*Z*(

*k*|

*k*) propagates through the prediction process in Eq. (14) as

*t*or

*r*in the remainder of the paper. When the new image is obtained at time \(k+N_v\), the Kalman filter gains can be computed for the translational and rotational motions as

*k*is, in fact, obtained from the camera image taken at the time step \(k-N_v\). Therefore, the prediction step Eq. (14) needs to be changed to

*q*(

*k*) (\(=y_r(k+N_v)\)) should have been. This means that we can make the correction for the \(N_v\) time step prior to the current time step and compute \({\hat{\xi }}(k|k)\) and \({\hat{\eta }}(k|k)\)

*in retrospect*. As soon as we compute \({\hat{\xi }}(k|k)\) and \({\hat{\eta }}(k|k)\), we also know what values of \({\hat{\xi }}(k+i+1|k)\) and \({\hat{\eta }}(k+i+1|k)\) should have been for \(i=0,\,1,\,...,\,N_v-1\), and can compute them all at once. All these estimates are for the past and thus useless except for the last pair, i.e. \({\hat{\xi }}(k+N_v|k)\) and \({\hat{\eta }}(k+N_v|k)\). So we should take them as the estimate for the current time step, \(k+N_v\). This process repeats for the next batch of the prediction steps. In summary, the correction step shown in the previous subsection should be modified into the following set of procedures. It applies to both translation and rotation in the same way so we explain here with the translational motion only.

- 1.As soon as we arrive at the time step \(k+N_v\), compute the Kalman filter gain for the time step
*k*, i.e. \(L_{t}(k)\) using Eq. (21). Then, using the delayed vision data \(y_t(k+N_v)=p^s(k)\), make a correction for the time step*k*, i.e. \({\hat{\xi }}(k|k)\), by Eq. (17a). At the same time, make a correction to the estimation error covariance in a similar way as Eq. (18) but with the twice of the vision step as the update interval, i.e.The reason for doubling the update interval will become clear at the end of step 3).$$\begin{aligned} Z_{t}(k|k)&=Z_{t}(k|k-2N_v) \\& \quad -L_{t}(k)C_{t}Z_{t}(k|k-2N_v). \end{aligned}$$(22) - 2.
Compute \({\hat{\xi }}(k+1|k),\,...,\,{\hat{\xi }}(k+N_v|k)\) using the prediction equation Eq. (14a), after which we take only \({\hat{\xi }}(k+N_v|k)\) as the estimate for the current time step \(k+N_v\). Similarly, compute the propagation of the estimation error covariance \(Z_t(k+1|k),\,...,\,Z_t(k+N_v|k)\) in the same way as Eq. (15). This completes the

*retrospective*correction for the time step \(k+N_v\). - 3.
Next, we progress for the subsequent time steps \(k+N_v+1,\,...,\,k+2N_v\) in a similar way as Eq. (19a). At the same time, we continue computing the propagation of the estimation error covariance using Eq. (20), after which we have \(Z_t(k+N_v|k-N_v)\). As a result, the update process operates by the estimation error accumulated during \(2N_v\) time steps.

- 4.
When we arrive at the next vision time step \(k+2N_v\), now we have \(y_t(k+2N_v)=p^s(k+N_v)\) and can compute the Kalman filter gain for the time step \(k+N_v\), i.e. \(L_t(k+N_v)\) using Eq. (21). Now, we are back to step 1) so can repeat the process.

## Experimental results

### Testbed configuration

*Barrett Technologies Inc.*, is mounted to the side of the table so that the robotic fingers can manipulate the object for its planar motion. A vision camera is mounted above to take images of the object while it is moving. Figure 5 shows the sample view from the camera, which indicates, with added labels, the index number for each finger and the locations of some reference frames that we defined in “Determination of contact force” section. The body coordinate frame

*B*is located at the center of the square object. The spatial (inertial) frame is attached to the center of the palm of the BHand and \(D_1\) (the base frame for finger 1) is placed at the inner joint (first joint) of the finger 1.

Specifications of vision

Specification | Values | Units |
---|---|---|

Communication | CameraLink™ | |

Exposure time | 10 | [ms] |

Bit depth | 8 | Bits |

Pixel size | 2048 × 1088 | Pixels |

Resolution | 284 | μm/pixel |

During the experiment, the BHand is manipulated by an operator who controls a set of potentiometers each of which corresponds to the desired position of the first joint of each finger. Each finger takes this desired position as a reference command and runs the PID controller. The square object is then manipulated through contacts with the fingers. As shown in Fig. 5, each finger of the BHand has its configuration of a two link manipulator. However, the second (or the outer) joint is kinematically linked to the first one so only one reference signal is used to control the whole finger. In the remaining part of the paper, F1, F2 and F3 will denote the finger 1, the finger 2 and the finger 3, respectively.

*mg*. The BHand (and the camera) is arranged in \(90^o\) rotation from to the previous configuration. So the back plane of the wrist of the BHand is mounted on the horizontal table. Initially, the object is stably supported by the F3 and one of its corners against gravity as shown in Fig. 6. Then, the object is moved by the action of F3 making a non-centroidal rotation about its corner at the bottom until it gets caught by the F1 (and F2).

### Estimator for 2D motion

Since the surface of the object aligns with *x* and *y* axes of the body coordinate frame, the coordinate chart for the object \(c_c\) can simply be taken as a mapping from one of these axes to the contact point represented by the body coordinate frame. Similarly, the contact frame \(C_1\) for the object is configured such that its *y*-axis is the same as the *x*-axis of *B* for the situation shown in Fig. 7b. Figure 7b also illustrates how the contact force \(f_1\) is represented with respect to the contact frame \(T_1\) for the finger. More specifically, \(f^{t_1}_{1,1}\) denotes the component of \(f_1\) along the *x*-axis (or, equivalently, the direction of \(\frac{\partial c_t}{\partial u_t}\)) of \(T_1\) and \(f^{t_1}_{1,2}\) corresponds to its normal component.

*c*” and “

*s*” denote the cosine and the sine functions, respectively, for the angles indicated by their subscripts, e.g. \(s_{u_t}:=\sin {u_t}\).

### Experimental results and evaluation

*National Instruments Inc.*The real-time loop consists of two dual-rate parallel processes; one for the fast loop (\(T_s\)) and the other for the vision loop (\(T_v\)). The sensor signals from the BHand can only be communicated through high speed CAN protocol, which made us to choose \(T_s=2.5\) ms as the fastest possible sample rate for acquiring the joint encoder position values and the strain gauge values. The reference command for the joint PID controller is also updated by the \(T_s\). Through the timing test of the vision camera, it has been identified that the total latency of the vision data is around 25 ms. To evaluate the performance of the estimated state variables, we need a monitoring signal (i.e. the true position and orientation data for the target object) for comparison. For this, the original vision data has been acquired every 25 ms and processed offline to remove the latency and to be interpolated into 2.5 ms data which is then stored as the monitoring signal. The original vision data sampled at 25 ms is further down-sampled to create \(T_v=50\) ms of the observation data with latency (i.e. \(y_t(k)\) and \(y_r(k)\) in Eq. (13)). In the experiments, we used the original vision data (i.e. delayed and down-sampled position measurement of the object) to identify the contact point, which worked reasonably well despite some errors due to the visual latency. It may be attributed to the fact that position variable does not change abruptly in contrast to the force which can be applied in a discontinuous (or an impactive) way.

Parameter values of the experiment

Parameter | Value | Unit |
---|---|---|

Object size | 10 × 10 cm | |

Object mass ( | 3.165 | kg |

Object inertia (\(\mathcal {I}_0\)) | 9.72 · 10 | kg m |

Base sample rate (\(T_s\)) | 2.5 | ms |

Vision sample rate (\(T_v\)) | 50 | ms |

Force noise covariance ( | 0.04 | N |

Vision noise covariance ( | 0.01 | mm |

Figure 9 shows the experimental results for the position variables; the position of the mass center \(p^s=\left[ x \,\,\, y\right] ^T\) and the orientation angle \(\phi\). The thick solid line is the monitoring signal obtained by offline processing of the original vision data sampled at 25 ms rate and is regarded as the true position of the object. The thin solid line is the real-time vision data with \(T_v=50\) ms latency, which we used as the real-time observation signal for the estimator. The thin dotted line is the motion data obtained from the real-time rigid body estimator. It shows that the latency and inter-sample values are compensated so that the estimated motion variables are close to the monitoring signal.

Figure 10 compares the velocities from the rigid body estimator with those of other signals. Again, the thick solid line is the reference (or the true) velocity which is obtained by numerical differential of the monitoring signal. The thin solid line is the velocity computed using the delayed vision data. Remember that we need two consecutive sample points to compute the velocity by numerical differentiation. Therefore, the velocity obtained from the observed vision data (i.e. the one with \(T_v=50\) ms sampling) has additional 50 ms of latency resulting in 100 ms of total latency as shown in Fig. 10. In contrast, as we can see in the thick dotted line, the rigid body estimator can effectively compensate for such latency providing more accurate real-time velocity signal.

## Conclusions and discussions

Motivated by the dynamic dexterous manipulation tasks where the fast and accurate estimation of the target object motion is often required, this paper presented a new rigid body estimation scheme combining the visual data with other sensory inputs for contact force measurement. The main point is the complementary use of the fast-sampled force measurement in compensating for the slow-sampled visual data with latency. Mathematically, the rigid body estimator takes the multi-rate extended Kalman filter form with retrospective correction step for the delay compensation.

As demonstrated by the experimental results, the estimator is expected to perform very well if the contact force can be correctly determined. This will be especially so for large values of vision latency since the estimator relies on the numerical integration (or the prediction) between two consecutive vision samples. Use of high fidelity tactile and joint torque sensors is expected to enhance the performance as well as the robustness of the proposed rigid body estimator.

In this paper, we assumed that we have a prior knowledge on the inertial properties of the object such as the mass or the inertia matrix. Such inertial properties may also be identified through preliminary manipulation trials and the application of parameter identification techniques. In fact, when we (humans) manipulate an object with our hands, in the beginning, we have only a limited sense of inertial properties but, through a variety of manipulation tasks, we gradually get to learn inertial properties more accurately and eventually become more skillful. Also, in order for the proposed estimator to be useful in a wide range of applications, it should be tested with more complex dynamic manipulation tasks such as throwing and catching an object in 3D space. We plan to pursue these research ideas as our future works.

## Declarations

### Authors’ contributions

HB worked on the formulation of the problem under the supervision of the second and third authors. He also designed the experimental setup and carried out the experimental tests. SJ initiated the research idea, supervised the first author for analysis and experiment, and helped him with drafting the initial version of the manuscript. JPH contributed to research ideas and helped drafting the paper by providing feedback. All authors read and approved the final manuscript.

### Acknowledgements

Authors acknowledge the financial support from NSERC (Natural Sciences and Engineering Research Council of Cananda) under the Discovery Grant Program (RGPIN-2015-05273).

### Competing interests

The authors declare that they have no competing interests.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Robotics-VO (2013) A roadmap for U.S. robotics: from internet to robotics—2013 edition. http://www.robotics-vo.us/sites/default/files/2013%20Robotics%20Roadmap-rs.pdf. Accessed 10 July 2016
- Falco J, Marvel J, Elena M (2013) Dexterous manipulation for manufacturing applications workshop. http://www.nist.gov/el/isd/upload/NIST-IR-7940.pdf. Accessed 10 July 2016
- Okamura AM, Smaby N, Cutkosky MR (2000) An overview of dexterous manipulation. IEEE Int Conf Robot Autom 1:255–262Google Scholar
- Grebenstein M (2014) Approaching human performance. Springer, ChamView ArticleGoogle Scholar
- Martin J, Grossard M (2014) Design of a fully modular and backdrivable dexterous hand. Int J Robot Res 33(5):783–798View ArticleGoogle Scholar
- Quigley M, Salisbury C, Ng AY, Salisbury JK (2014) Mechatronic design of an integrated robotic hand. Int J Robot Res 33(5):706–720View ArticleGoogle Scholar
- Deshpande AD, Xu Z, Weghe MV, Brown BH, Ko J, Chang LY, Wilkinson DD, Bidic SM, Matsuoka Y (2013) Mechanisms of the anatomically correct testbed hand. IEEE/ASME Trans Mechatron 18(1):238–250View ArticleGoogle Scholar
- Kandel ER, Schwartz JH, Jessell TM et al (2000) Principles of neural science, vol 4. McGraw-Hill, New YorkGoogle Scholar
- Franklin DW, Wolpert DM (2011) Computational mechanisms of sensorimotor control. Neuron 72(3):425–442View ArticleGoogle Scholar
- LaValle SM (2006) Planning algorithms. Cambridge University Press, New YorkView ArticleMATHGoogle Scholar
- Murray RM, Sastry SS (1994) A mathematical introduction to robotic manipulation. CRC Press, Boca RatonMATHGoogle Scholar
- Rizzi AA, Koditschek DE (1996) An active visual estimator for dexterous manipulation. IEEE Trans Robot Autom 12(5):697–713View ArticleGoogle Scholar
- Wang C, Lin CY, Tomizuka M (2014) Statistical learning algorithms to compensate slow visual feedback for industrial robots. ASME J Dyn Syst Meas Control 137(3):031011View ArticleGoogle Scholar
- Jeon S, Tomizuka M, Katou T (2009) Kinematic kalman filter (kkf) for robot end-effector sensing. ASME J Dyn Syst Meas Control 131(2):021010View ArticleGoogle Scholar
- Hebert P, Hudson N, Ma J, Burdick J (2011) Fusion of stereo vision, force-torque, and joint sensors for estimation of in-hand object location. IEEE Int Conf Robot Autom 1:5935–5941Google Scholar
- Allen PK, Miller AT, Oh PY, Leibowitz BS (1997) Using tactile and visual sensing with a robotic hand. IEEE Int Conf Robot Autom 1:676–681Google Scholar
- Kamiyama K, Kajimoto H, Kawakami N, Tachi S (2004) Evaluation of a vision-based tactile sensor. IEEE Int Conf Robot Autom 2:1542–1547Google Scholar
- Prats M, Sanz PJ, Del Pobil AP (2009) Vision-tactile-force integration and robot physical interaction. IEEE Int Conf Robot Autom 1:3975–3980Google Scholar
- Kragic D, Crinier S, Brunn D, Christensen HI (2003) Vision and tactile sensing for real world tasks. IEEE Int Conf Robot Autom 1:1545–1550Google Scholar
- Namiki A, Komuro T, Ishikawa M (2002) High-speed sensory-motor fusion based on dynamics matching. Proc IEEE 90(7):1545–1550View ArticleGoogle Scholar
- Mason M (2001) Mechanics of robotic manipulation. MIT Press, CambridgeGoogle Scholar
- Titterton D, Weston J (2004) Strapdown inertial navigation technology, vol 17, 2nd edn. Radar, Sonar, Navigations and Avionics-IET, StevenageView ArticleGoogle Scholar
- Bar-Itzhack IY, Harman RR, Choukroun D (2002) State-dependent pseudo-linear filters for spacecraft attitude and rate estimation. In: AIAA Guidance, Navigation, and Control Conference and ExhibitGoogle Scholar
- Leung W, Damaren C (2004) A comparison of the pseudo-linear and extended kalman filters for spacecraft attitude estimation. In: AIAA Guidance, Navigation, and Control Conference and ExhibitGoogle Scholar
- Brogliato B (1999) Nonsmooth mechanics: models. Dynamics and control. Springer, LondonView ArticleMATHGoogle Scholar