Vision for Robotics [In progress]

- Vision for Robotics [In progress]
Prerequisites
To get the most out of this Vision for Robotics module, it’s helpful to have:
Basic Mathematics
- Familiarity with trigonometry (sine, cosine, angle addition formulas).
- Understanding of linear algebra (vectors, matrices, basic matrix operations).
- Comfort with calculus (especially differentiation).
While you don’t need to be an expert in any one of these areas, having a comfortable grasp of each will make your study of vision for robotics more productive and enjoyable.
If you’d like a refresher on linear algebra, the following YouTube series is an excellent resource.
General Motivation
Cameras have become one of the most accessible and data-rich sensors for robots, offering a wealth of visual information compared to traditional positioning or distance sensors. Advances in hardware and algorithms, such as RGB-D cameras and visual-inertial fusion techniques, have significantly improved robot perception. In navigation, robots use vision to detect obstacles, estimate trajectories, and build 3D maps of their environment. For grasping, visual data helps identify objects, estimate their pose, and determine how to interact with them. The following sections will explore the geometric foundations of 3D vision and its applications in robotic grasping.
The following videos demonstrate an application of vision in robotics.
Course Content
Introduction
Welcome to this introduction on how a camera projects the three-dimensional (3D) world onto a two-dimensional (2D) image plane. We will discuss how to describe a point in 3D space with respect to a camera coordinate system and how these 3D points get projected into pixel coordinates on an image. We will then move on to intrinsic calibration and the issue of lens distortion.
By the end of this section, you should understand:
-
How a 3D point is projected onto a 2D image plane using the pinhole camera model.
-
The role of intrinsic and extrinsic camera parameters in this projection process.
-
How lens distortion affects images and how it is mathematically modeled.
-
How to perform camera calibration to recover intrinsic parameters.
-
How to estimate the pose of a camera using known 3D landmarks (PnP problem).
-
How to use Structure from Motion (SfM) for sparse 3D reconstruction from video.
-
How 3D vision techniques apply to robot navigation and grasping tasks.
We will keep the mathematical notation to a minimum but include enough details to grasp the core ideas. Small exercises are included to reinforce these concepts.
This course closely follows the Chapter 32: 3-D Vision for Navigation and Grasping from the book Springer Handbook of Robotics. Which can be read below.
Here are 2 introduction videos to help understand the core problem.
Conceptual questions
Question 1: A short exposure / high shutter speed minimises motion‑blur, but it also means you need stronger lighting to obtain a clear image
Question 2: For best contrast, you should always illuminate a coloured part with LEDs of the same colour as the part (e.g., red part → red light).
Question 3: Increasing the camera’s megapixel count always yields better results in high‑speed robot pick‑and‑place applications.
Question 4: Roughly 70 % of a successful vision application depends on the proper choice of…
Question 5: Which camera type integrates sensor and on‑board processing in the same small housing?
Geometric Vision
Before we dive into algorithms and code, we first need a picture of how geometry, cameras, and images fit together. This section lays that foundation. We will
-
build the two coordinate systems every vision problem starts with (world vs camera).
-
see how a simple rotation + translation moves points from one frame to the other.
-
follow each 3‑D point through the pinhole projection onto the image plane and on to actual pixel indices.
-
introduce the five classic intrinsic parameters and the common lens‑distortion model.
-
and finish by explaining what it really means to have a calibrated camera.
The short video below previews these ideas visually. The text that follows walks through the maths step‑by‑step with conceptual questions so you can test your understanding as you go.
Transforming From World Coordinates to Camera Coordinates
Suppose there is a point in the real world, denoted as $(X,Y,Z)$. In order to describe how this point appears to a camera, we need to specify its location relative to the camera’s coordinate system. Usually, we place the camera coordinate system at its center of projection (roughly at the camera’s pinhole or main lens center) such that the $Z$-axis goes straight out from the camera (the optical axis).
Let:
-
$X_{world}=(X,Y,Z)^T$ be the coordinates of the point in the world’s coordinate system.
-
$X_{ci}=(X_{ci},Y_{ci},Z_{ci})^T$ be the coordinates of the same point in the camera $ci$’s coordinate system.
The two sets of coordinates are related by:
$$ \begin{bmatrix} X_{ci} \cr Y_{ci} \cr Z_{ci} \end{bmatrix} = R_i * \begin{bmatrix} X \cr Y \cr Z \end{bmatrix} + T_i, $$
where:
- $R_i$ is a $3 \times 3$ rotation matrix describing how the axes of the world coordinate system relate to the camera’s axes. Because it’s a rotation matrix, $R_i^T R_i = I$ and $\det(R_i) = 1$.
- $T_i$ is a translation vector describing the shift from the camera’s origin to the world’s origin (or vice versa, depending on convention).
This transformation says:
“Take the point in world coordinates, rotate it so that the axes align with those of the camera, then translate it so the camera’s center is at the origin.”
Conceptual questions
Question 1: The transformation from world coordinates to camera coordinates involves both a rotation and a translation.
Projection Onto the Image Plane
In the classical pinhole camera model, we project a 3D point $X_{ci} = (X_{ci}, Y_{ci}, Z_{ci})$ onto a 2D image plane. Typically, we assume the image plane is at $Z_{ci} = 1$. (In reality, camera sensors sit behind the pinhole/center of projection by some distance, but mathematically it is simpler to place a plane in front.)
If $\mathbf{X}_{ci}$ lies in front of the camera, the normalized image coordinates $(x_i, y_i)$ (before going into actual pixel coordinates) are:
$$ x_i = \frac{X_{ci}}{Z_{ci}}, \quad y_i = \frac{Y_{ci}}{Z_{ci}}. $$
The quantities $x_i$ and $y_i$ are often called normalized coordinates because we have divided by $Z_{ci}$.
Intuitive Explanation Think of rays of light traveling from the 3D point in the scene, through the camera center, to the image plane. The intersection of that ray with the image plane is >how you figure out the 2D image location. Mathematically, it boils down to dividing by $Z_{ci}$ in the simplest pinhole model.
Conceptual questions
Multiple choice (choose all statements that are correct) :
From Normalized Coordinates to Pixel Coordinates
In a real camera, the image you get consists of pixels indexed by $(u_i, v_i)$. To bridge the gap between the continuous $(x_i, y_i)$ and discrete pixel $(u_i, v_i)$, we often use an affine transformation:
$$ u_i = f \, \alpha \, x_i + \beta \, y_i + c_u, \quad v_i = f \, y_i + c_v. $$
Let’s break down these parameters:
-
$f$: The focal length in pixels. It combines the physical focal length (in millimeters) and the sensor’s pixel size (in millimeters per pixel).
-
$\alpha$: The aspect ratio, allowing for rectangular (non-square) pixels or different horizontal vs. vertical sampling rates.
-
$\beta$: The skew factor. In an ideal camera, $\beta$ is zero. In real cameras where the sensor or read-out lines might be slightly tilted, $\beta$ can model that small shear.
-
$(c_u, c_v)$: The principal point, or image center. It is where the optical axis (the camera’s $Z$-axis) intersects the image plane, expressed in pixel coordinates.
These parameters are called the intrinsic parameters of the camera. Determining them precisely is known as intrinsic calibration (How to find them will be seen in the next section)
Conceptual questions
Question 1: The conversion from normalized coordinates to pixel coordinates involves intrinsic parameters such as the focal length, aspect ratio, skew factor, and the image center.
Question 2: In the affine transformation ui=f α xi+β yi+cuui=fαxi+βyi+cu, which parameter determines the horizontal position of the image center?
Lens Distortion
Many practical camera systems, especially with wide-angle or fisheye lenses, introduce significant radial distortion. If you have ever seen lines near the edges of a photo curve outward (“barrel distortion”) or inward (“pincushion distortion”), that is due to lens imperfections.

A common way to model this is by adding polynomial correction terms that depend on $(r^2, r^4, r^6, \dots)$, where $r^2 = x_i^2 + y_i^2$. Thus, the distorted coordinates $(x_i^{\text{dist}}, y_i^{\text{dist}})$ become something like:
$$ x_i^{\text{dist}} = x_i \left(1 + k_1 r^2 + k_2 r^4 + k_3 r^6 + \dots \right), $$
$$ y_i^{\text{dist}} = y_i \left(1 + k_1 r^2 + k_2 r^4 + k_3 r^6 + \dots \right). $$
The coefficients $k_1, k_2, k_3, \dots$ are additional parameters to be calibrated, especially for wide-angle lenses.
Conceptual questions
Question 1: Radial lens distortion is modeled by applying a polynomial function to the normalized coordinates based on their distance from the image center.
Question 2: In the context of lens distortion, what does the variable r represent?
Putting It All Together: Calibrated Systems
When we say a system is calibrated, it typically means:
-
We know the intrinsic parameters ($f, \alpha, \beta, c_u, c_v, k_1, k_2, \dots$).
-
We know how the camera is positioned in some external coordinate system (its rotation $R_i$ and translation $T_i$), known as the extrinsic parameters.
Once we have done an intrinsic calibration (which can be done using a checkerboard pattern or known calibration target) and accounted for distortion, we can confidently map between:
- 3D coordinates $\boldsymbol{X}_{ci}$ in the camera’s frame
- 2D pixel measurements $(u_i, v_i)$
This is critical for many robotic tasks such as navigation, obstacle avoidance, object tracking, and grasping, since everything eventually must go from real-world distances and geometry to image pixel coordinates.
Conceptual questions
Question 1: A calibrated camera system requires knowing both its intrinsic parameters (e.g., focal length, skew, distortion coefficients) and its extrinsic parameters (e.g., rotation and translation relative to the world).
Question 2: What is the main benefit of calibrating a camera system in the context of robotic vision?
Advanced Mathematical Exerceises
Exercise 1 :
- Determine the Intrinsic Parameter Matrix (𝑲) of a digital camera with an image size 640×480 pixels and a horizontal field of view of 90°
- Assume square pixels and the principal point as the center of the diagonals
- What is the vertical field of view?
- What’s the projection on the image plane of $cP = [1, 1, 2]^T$
Solution
Exercise 2 :
Calibration
Camera calibration is the process by which we determine a camera’s intrinsic parameters (like focal length, principal point, and distortion coefficients) and extrinsic parameters (its position and orientation with respect to some world reference). A well-calibrated camera allows us to accurately map between real-world 3D coordinates and 2D image pixels, which is essential for tasks like navigation, 3D reconstruction, and robotic grasping.
As we saw in the previous sections, the pinhole camera model provides a neat mathematical description of how a point in 3D $(X,Y,Z)$ maps to a pixel coordinate $(u,v)$. However, real cameras have additional nuances:
-
Focal length and principal point need to be estimated precisely (intrinsic calibration).
-
Lens distortion can bend straight lines or enlarge/minimize certain regions (distortion calibration).
-
Camera pose (rotation and translation) with respect to a world coordinate system may be unknown (extrinsic calibration).
Calibration is about figuring out all these parameters so that the projection model in your equations matches the actual camera you are using.
Basic Setup: Intrinsic Calibration
When the camera’s internal parameters remain constant (no zooming in/out) and you can take multiple images of a known reference pattern (e.g., a checkerboard), you can use common methods or toolboxes (e.g. the MATLAB Calibration Toolbox, Zhang’s OpenCV calibration functions) to recover the following:
| Camera Parameters | Description | Symbol |
|---|---|---|
| Intrinsic Parameters | Define the camera’s internal characteristics. | |
| Focal length | Determines the scale of projection. | $ f $ |
| Principal point | The optical center of the image. | $ (c_u, c_v) $ |
| Skew factor | Accounts for potential shearing. | $ \beta $ |
| Aspect ratio | Accounts for pixel shape differences. | $ \alpha $ |
| Extrinsic Parameters | Define the camera’s position and orientation in the world. | |
| Rotation matrix | Describes the camera’s orientation. | $ R $ |
| Translation vector | Specifies the camera’s position relative to a reference frame. | $ T $ |
After calibration, the hope is that for any future image, you can “correct” lens distortions and map each pixel to the corresponding ideal pinhole-ray direction.
Conceptual questions
Question 1: What are the typical intrinsic parameters we aim to find when calibrating a camera?
Question 2: During standard checkerboard-based calibration, we rely on known 3D positions (in the checkerboard reference) of the corners and their measured 2D positions in the images to solve for the camera’s intrinsic parameters.
Varying Intrinsics and Self-Calibration
Not all systems allow us to fix the camera intrinsics. For example, if the focal length can vary (zoom lenses) or if you cannot practically use a known reference pattern in the field, you might need more advanced methods:
-
Self-calibration methods (such as the approach by Pollefeys et al.) rely on multiple views of unknown scenes. They track corresponding features across images and use constraints like the Kruppa equations to solve for the camera intrinsics and distortion.
-
Stratified self-calibration typically requires at least three views and uses epipolar geometry and projective transformations to recover a consistent set of intrinsic parameters across all images.
These approaches can be more sensitive to noise or require many stable point correspondences, but they’re powerful in situations where you can’t do a “checkerboard session.”
Projection Matrix Form and Depth Elimination
Once we include lens distortion (and possibly correct it), the “ideal” pinhole mapping can be summarized in matrix form (assuming we now talk about undistorted, ideal pixel coordinates). Denote:
-
$u_i = (u_i, v_i, 1)^T$ as the homogeneous pixel coordinate of a point in the $i$-th image.
-
$X = (X, Y, Z, 1)^T$ as the homogeneous coordinate of a world point.
Then, for camera $i$, we have: $$ \lambda_i \, u_i = K_i \begin{bmatrix} R_i & T_i \end{bmatrix} X $$ where:
-
$\lambda_i = Z_{ci}$ is the depth of the point relative to camera $i$,
-
$K_i$ is the $3 \times 3$ matrix of intrinsic parameters,
-
$R_i$ and $T_i$ describe the rotation and translation from the world coordinate system to camera $i$’s coordinate system,
-
The product $\begin{bmatrix} R_i & T_i \end{bmatrix}$ is often called the extrinsic part.
Because $\lambda_i$ is just a scalar, you can rearrange or eliminate it, leading to two main equations that relate the world coordinates $X$ and the pixel coordinates $u_i$. These become the basis for solving calibration problems in practice.
Exercises
Conceptual questions
Question 1: If a camera has perfectly square sensor pixels, the aspect‑ratio parameter α in its intrinsic matrix equals 1.
Question 2: When you zoom a camera lens during operation, which intrinsic parameter is most directly affected?
Question 3: If the matrix $K_i$ is unknown, how many different images of a known pattern are typically required to solve for these intrinsic parameters in a standard calibration method?
Question 4: Select all quantities that are typically categorised as extrinsic parameters.
Question 5: Radial‑distortion coefficients \(k_1,k_2,\dots\) are applied before we divide by depth \(Z_{ci}\) when computing normalized image coordinates.
Question 6: Match each intrinsic term to the effect it compensates for.
| Parameter | Effect on image | |
|---|---|---|
| Skew β | → | |
| Focal length f | → | |
| Principal‑point cu,v | → |
Question 7: Your calibration image is 1280 × 720 pixels and you assume the principal point is exactly in the centre. What value should you enter for cu?
Key Takeaway:
-
Once you know $K_i$, $R_i$, and $T_i$, you can project any 3D point in the world straight into the 2D image.
-
Calibration is about finding all those parameters so that 2D–3D correspondences match reality.
Advanced Mathematical Development
The following explore more mathematically how to find the Camera Parameters
Pose estimation or PNP
Once a camera is calibrated (i.e., we know its intrinsic parameters and can handle or correct for any lens distortion), we can tackle the problem of finding the camera’s extrinsic parameters (its rotation and translation) relative to known objects or landmarks in the world. This is often referred to as the Pose Estimation problem.
In many robotics tasks, we know the 3D coordinates of certain points in the environment (so-called landmarks or feature points) and we can detect their corresponding locations in the image. The goal is to solve for the camera’s exact position and orientation that makes those correspondences match the real world.
Here is a youtube video giving a short overview of the Pose estimation problem and how to resolve it
The PnP (Perspective-n-Point) Problem
Suppose you have:
-
N known 3D points in the world: $X_j=(X_j, Y_j, Z_j)$ $Xj=(X_j,Yj_,Z_j)$
-
Their corresponding 2D points in the calibrated image: $xj=(x_j, y_j)$ $xj=(x_j,y_j)$
where the camera has already been calibrated, and any lens distortions are accounted for or removed. The PnP problem is to find a rotation matrix $R$ and a translation vector $T$ such that, for each 3D–2D match, the pinhole projection equation is satisfied: $$ z_j \begin{bmatrix} x_j \cr y_j \cr 1 \end{bmatrix} = K \begin{bmatrix} R & T \end{bmatrix} \begin{bmatrix} X_j \cr Y_j \cr Z_j \cr 1 \end{bmatrix}, $$
where $K$ is the intrinsic matrix and $z_j$ is the point’s depth along the camera’s Z-axis. In simpler words:
“Given N known 3D points and their 2D images, recover the camera’s orientation and position.”
Minimal Example: 3 Points
When only three world points are visible we are in the Perspective‑3‑Point (P3P) setting – the smallest data set that still lets us compute a full camera pose.
Geometric Setup
Define :
-
$d_i = |X_i - C| \quad \text{(distance camera → point)}$
-
$d_{ij} = |X_i - X_j| \quad \text{(known side lengths of the landmark triangle)}$
-
$\cos \delta_{ij} = x_i^\top x_j \quad \text{(measured angle between image rays)}$
Because the rays and the segment $X_i X_j$ form a triangle, the Law of Cosines gives (for every $i \ne j$):
$$ d_i^2 + d_j^2 - 2 d_i d_j \cos \delta_{ij} = d_{ij}^2 \tag{1} $$
There are three such equations — one per edge of the landmark triangle — and the three unknowns $d_1, d_2, d_3$.

Fig. 1 The 3-point pose-estimation problem. Unknown camera–point distances $d_1, d_2, d_3$ and known inter-point distances $d_{12}, d_{13}, d_{23}$. The angles $\delta_{ij}$ between bearing rays are measured in the image.
Reducing the Unknowns
A classical trick (Gröbner-free) is to express two of the depths in terms of the first one:
$$ d_2 = u \, d_1, \quad d_3 = v \, d_1 \quad (u, v > 0). $$
Insert those into (1) and divide by $d_1^2$:
$$ d_{12}^2 = d_1^2 \left( u^2 + 1 - 2u \cos \delta_{12} \right), $$
$$ d_{13}^2 = d_1^2 \left( v^2 + 1 - 2v \cos \delta_{13} \right), $$
$$ d_{23}^2 = d_1^2 \left( u^2 + v^2 - 2uv \cos \delta_{23} \right). \tag{2} $$
Equation (2) immediately yields three expressions for the same $d_1^2$.
Equating any two of them eliminates $d_1$ and leaves a system of two quadratic equations in the two variables $(u, v)$:
$$ d_{12}^2 \left( v^2 + 1 - 2v \cos \delta_{13} \right) = d_{13}^2 \left( u^2 + 1 - 2u \cos \delta_{12} \right), $$
$$ d_{13}^2 \left( u^2 + v^2 - 2uv \cos \delta_{23} \right) = d_{23}^2 \left( v^2 + 1 - 2v \cos \delta_{13} \right). \tag{3} $$
Now you need to:
-
Solve the second equation of (3) linearly for $u^2$.
-
Substitute that expression into the first equation of (3).
The result is a single 4-th degree polynomial in $v$.
This 4-th degree polynomial can have up to four real roots.
For every admissible root $v$:
- compute $u$ from the quadratic substitution,
- recover $d_1, d_2, d_3$,
- keep only solutions where all depths are positive (points must be in front of the camera).
Because each quadratic step can produce two signs, you obtain at most 8 real pose candidates – the well-known P3P eight-fold ambiguity.
From Depths to $R$ and $T$
Once ${d_i}$ are known, the 3-D coordinates of the landmarks in the camera frame are
$$ X_i^{\text{cam}} = d_i \, x_i. $$
You now possess two 3-point sets:
| frame | point 1 | point 2 | point 3 |
|---|---|---|---|
| World | $X_1$ | $X_2$ | $X_3$ |
| Camera | $d_1 x_1$ | $d_2 x_2$ | $d_3 x_3$ |
Compute $R$, $T$ that best align the world set to the camera set.
That is the classic absolute orientation problem and has a closed-form SVD solution (Horn 1987):
$$ \min_{R \in \text{SO}(3), \, T} \sum_{i=1}^{3} \left| d_i x_i - (R X_i + T) \right|^2. $$
Additional content
For a more thorough lecture on pnp you can watch the following youtube videos.
Part 1:
Part 2:
Part 3:
Triangulation
Goal of this section Show how a single 3-D point can be re-built from (at least) two calibrated images by intersecting the two sight-rays that go through the image measurements.
Video explenation
Here is a good youtube video explaining in a more visual way how to do triangulation. Note that the notation in this video may not follow the notation of this course but the method is the same.
1 Why one view is never enough
With a single pinhole camera you can only say that the 3-D point lies somewhere on a ray that starts in the camera centre and passes through the pixel $u_1$. Mathematically (homogeneous notation)
$$ \lambda_1 u_1 = P_1 \begin{bmatrix} X \cr Y \cr Z \cr 1 \end{bmatrix} $$
where $P_1 = K_1\,[\,R_1\mid T_1]$ and $\lambda_1 = Z_{c1}$ is the unknown depth.
2 Two eyes give you depth
Add a second calibrated view
$$ \lambda_2 u_2 = P_2 \begin{bmatrix} X \cr Y \cr Z \cr 1 \end{bmatrix} $$
Stack the two equations and eliminate the depths.
Let $(x,y)$ be the pixel in the current row and $P_i^{(k)}$ the $k$-th row of $P_i$:
$$ \underbrace{\begin{bmatrix} x\,P_1^{(3)}-P_1^{(1)} \cr y\,P_1^{(3)}-P_1^{(2)} \cr x\,P_2^{(3)}-P_2^{(1)} \cr y\,P_2^{(3)}-P_2^{(2)} \end{bmatrix}}_{\displaystyle A} \begin{bmatrix} X \cr Y \cr Z \cr 1 \end{bmatrix}=0. $$
Here $P_i^{(k)}$ denotes the $k$-th row of $P_i$, and $(x,y)$ are the pixel coordinates in that view. In practice the two image rays do not intersect perfectly because of measurement noise, so matrix $A$ has full rank 4. The best estimate of $X$ is therefore the right-singular vector of $A$ corresponding to its smallest singular value (Direct Linear Triangulation, or DLT).
Quick recipe (DLT)
- Undistort and normalise the two image points.
- Build the $4\times4$ matrix $A$ in (2).
- Run an SVD $A=U\Sigma V^\top$ and take the last column of $V$.
- De-homogenise to obtain $(X,Y,Z)$.
3 Epipolar sanity check
The pair $(u_1,u_2)$ must satisfy the epipolar constraint
$$ \mathbf u_2^{\top}\,E\,\mathbf u_1 \;=\;0 $$
with $E$ the essential matrix built from the relative pose $[R\mid T]$ of the two cameras. If that constraint is violated the two rays can never meet and the SVD in (2) will merely return the least-squares compromise.
4 Numerical hints
- Centre and scale image measurements before forming $A$ (improves conditioning).
- A point that is very close to both cameras gives a tiny $Z$—beware of dividing by a noisy depth.
- Use more than two images whenever possible; each extra view adds two more rows to $A$ and makes the SVD solution more robus
Exercises
Conceptual questions
True / False
- A single calibrated image suffices to recover a point’s depth.
Multiple choice
- The linear system $A\,X=0$ admits a finite 3-D solution only when …
Fill-in
- Each extra view adds __ new independent equations for the same 3-D point.
Mathematical questions
Two pin-hole cameras share
$$ K=\begin{bmatrix}800&0&320 \cr 0&800&240 \cr 0&0&1\end{bmatrix}. $$
Left camera: $P_1 = K[I\mid 0]$
Right camera: $P_2 = K[I\mid (-0.1,0,0)^\top]$
Measurements: $\mathbf u_1=(340,240)$, $\mathbf u_2=(300,240)$ px.
Estimate its 3-D coordinates.
Hint: follow the four-step DLT recipe above.
Solution (sketch)
Normalised points $\tilde{x}_1=\tfrac{1}{800}(20,0,1)^\top$, $\tilde{x}_2=\tfrac{1}{800}(-20,0,1)^\top$. Build $A$, run SVD $\Rightarrow$ $X\approx(0,0,1.0)$ m in the left-camera frame.
Key Takeaway:
-
Triangulation turns 2-D correspondences into 3-D positions once the two projection matrices are known.
-
A clean linear formulation (DLT) relies on simple linear-algebra tools (SVD).
-
The geometry behind it is nothing more than “find the intersection of two lines in space”—but careful algebra keeps that intersection stable in the presence of noise.
Moving Stereo
Goal of this section A stereo rig gives you two eyes on the world.
A moving stereo rig (left camera $c_\ell$, right camera $c_r$) straps those eyes to a robot that moves.
This upgrade turns a fixed-baseline depth sensor into a rolling 3-D scanner and a self-motion estimator – the backbone of many visual-odometry and SLAM systems such as libviso2 and its successors.
Video explenation
Here are 2 good youtube videos explaining in a more visual way how Motion Stereo works. Note that the notation in this video may not follow the notation of this course but the method is the same.
1 What exactly is “moving stereo”?
At time $k$ the rig observes some 3-D scene point $X$ in both cameras
$$ \begin{aligned} u_{\ell,k} &= P_\ell\,X_k ,
u_{r ,k} &= P_r\,X_k , \end{aligned} $$
where $P_\ell,P_r$ are the (known, calibrated) projection matrices of the fixed left–right pair.
At time $k+1$ the whole rig has moved by the rigid transform $(R_{k+1},T_{k+1})$, so the same world point now has camera-frame coordinates
$$ X_{k} \;=\; R_{k+1}\,X_{k+1} + T_{k+1}. \tag{1} $$
Once we know pairs ${X_k,X_{k+1}}$ we can solve (1) for the unknown pose $(R_{k+1},T_{k+1})$ – i.e.\ the robot’s motion between the two instants.
2 Two correspondence problems instead of one
To use (1) we need each 3-D point twice:
- Left ↔ Right at the same time → disparity → depth ⇒ $X_k$ (classic stereo triangulation).
- Left (k) ↔ Left (k+1) (or right↔right) → optical flow / feature tracking → cross-time matches.
Only then can we plug two metric 3-D clouds into (1).
3 Estimating the rig motion: Absolute orientation
Write the mean-free coordinates of the cross-time matches as
$$ \bar X_k \;=\; X_k - \frac{1}{n}\sum_{i=1}^{n} X_k^{(i)},\qquad \bar X_{k+1} \;=\; X_{k+1} - \frac{1}{n}\sum_{i=1}^{n} X_{k+1}^{(i)}. $$
Stack the $n$ pairs into the $n\times3$ matrices $A_k,A_{k+1}$; then minimise the Frobenius norm
$$ \min_{R\in\mathrm{SO}(3)} \;\bigl\lVert A_{k+1} - R\,A_k \bigr\rVert_F . $$
This Procrustes problem has the closed-form SVD solution
$R = U\,\mathrm{diag}(1,1,\det(UV^\top))\,V^\top$ where $U\Sigma V^\top = A_{k+1}A_k^\top$.
The translation follows from the centroids.
In practice we wrap the whole thing in RANSAC: draw minimal 3-point samples, compute $(R,T)$, count inliers, repeat.
4 Depth-on-frame or depth-once?
Triangulating every point at every frame is expensive.
An alternative is pose-only refinement:
- Track 2-D features across time in one eye (say the left image).
- Triangulate them once from the left–right disparity at time $k$.
- Estimate $(R,T)$ that aligns those 3-D points with their 2-D re-projections in the next left frame (a 3-D ↔ 2-D PnP problem).
Both routes lead to the same cost function in bundle adjustment; the choice is mostly engineering.
5 Why moving stereo beats monocular VO
- True scale – the fixed baseline recovers metric depth ⇒ no scale drift.
- Better convergence – having depth before motion estimation shrinks the search space.
- Robustness – parallax from the baseline adds “sideways” view change even on straight trajectories.
6 Implementation tips
| Step | Practical hint |
|---|---|
| Feature detection | FAST + ORB descriptors give many robust matches at real-time rates. |
| Left-right matching | Slide along the same image row (epipolar line) – rectification makes this 1-D. Reject matches with negative or too-large disparity. |
| Triangulation | Use linear DLT, but filter out points with depth $Z<0$ or large condition number. |
| Temporal tracking | Bucketing + KLT keeps features evenly spread and cheap to track. |
| Pose RANSAC | 3-point Procrustes is the minimal set. Run $\sim$150 iterations per frame pair. |
| Bundle adjustment | A sliding window of $5$–$10$ keyframes keeps the problem small yet corrects drift. |
Conceptual questions
True / False
Question 1: A moving-stereo rig can recover metric motion scale without any extra sensors.
Multiple choice
Question 2: What is the minimal number of 3-D point correspondences required to solve the Procrustes absolute-orientation problem?
Fill-in
Key takeaway
A rigid stereo pair that moves gives you the best of two worlds:
- Stereo provides instant depth at each frame.
- Motion across frames gives you odometry.
By tying them together through absolute orientation, a robot can track its 6-D pose and build a consis
Structure from Motion
See corresponding chapter in the PDF Springer Handbook of Robotics
Programming
Let’s move on to maybe the most exciting part: applying the Vision concepts you’ve learned in code and seeing your robot working right in front of you!
(Please refer to the Install Webots section if you haven’t installed it yet.)
Step 1: Setup your environment
- 📁 Download the
irbfolder - Extract the downloaded
.zipfile. - Launch Webots. From the top-left corner select File → Open World.
- Navigate to the extracted
irb/worldsfolder and select your.wbtfile.
Step 2: Let’s start coding!
Once successfully opened, your robot and its environment should appear, as illustrated in the screenshot below:

Now, follow the instructions provided on the right side panel within Webots, and complete the code to make your robot move.
Once you’ve implemented all the “COMPLETE THIS LINE OF CODE” sections, click “Build” or “Save”(CTRL+S) to compile your project, and then start the simulation.
Good luck and have fun!
Exercise
Exercise 1
- Determine the Intrinsic Parameter Matrix (𝑲) of a digital camera with an image size 640×480 pixels and a horizontal field of view of 90°
- Assume square pixels and the principal point as the center of the diagonals
- What is the vertical field of view?
- What’s the projection on the image plane of $cP = [1, 1, 2]^T$
Solution


Credits
Ressources
Books
-
Springer Handbook of Robotics (Chapter 32. 3-D Vision for Navigation and Grasping)
-
Springer Handbook of Robotics (Chapter 34. Visual Servoing)
-
Robotic Manipulation (Chapter 4. Geometric Pose Estimation)
Videos
-
Computer Vision (UC Berkley)
-
Multiple View Geometry - Lecture 1 (Prof. Daniel Cremers) (TU Munchen)
-
First Principles of Computer Vision (Youtube Channel)
Free Online Courses
-
Computer Vision (RWTH Aachen)