Geometry of Image Formation

In this post, we will explain the image formation from a geometrical point of view.

在本篇文章中，我们将从几何角度解释图像的形成。

Specifically, we will cover the math behind how a point in 3D gets projected on the image plane.

具体来说，我们将介绍 3D 中的点如何投射到图像平面上的数学原理。

This post is written with beginners in mind but it is mathematical in nature. That said, all you need to know is matrix multiplication.

这篇文章是为初学者编写的，但它是数学性质的，也就是说，您只需要知道矩阵乘法。

To understand the problem easily, let’s say you have a camera deployed in a room.

为了便于理解这个问题，假设您在一个房间里安装了一台摄像机。

Given a 3D point P in this room, we want to find the pixel coordinates (u, v) of this 3D point in the image taken by the camera.

给定房间中的三维点P，我们希望在摄像机拍摄的图像中找到该三维点的像素坐标(u, v)。

There are three coordinate systems in play in this setup. Let’s go over them.

在此情景中有三个坐标系。让我们来看看。

World Coordinate System 世界坐标系

图1. 世界坐标系和相机坐标系通过旋转和平移来关联。这六个参数中，3个用于旋转，3个用于平移，称为相机的外部参数。

To define locations of points in the room we need to first define a coordinate system for this room. It requires two things:

为了定义房间内各点的位置，我们首先需要为该房间定义一个坐标系。这需要两个条件:

Origin : We can arbitrarily fix a corner of the room as the origin $(0,0,0)$
X, Y, Z axes : We can also define the X and Y axis of the room along the two dimensions on the floor and the Z axis along the vertical wall.

Using the above, we can find the 3D coordinates of any point in this room by measuring its distance from the origin along the X, Y, and Z axes.

利用上述方法，我们可以沿着X、Y和Z轴测量房间内任意一点与原点的距离，从而求得该点的三维坐标。

This coordinate system attached to the room is referred to as the World Coordinate System. In Figure 1, it is shown using orange colored axes.We will use bold font (e.g. $\mathbf{X}_w$) to show the axis, and regular font to show a coordinate of the point (e.g. $X_w$).

这个附着到房间的坐标系称为世界坐标系，在图1中，用橙色坐标轴表示。我们使用粗体来表示坐标轴(例如 $\mathbf{X}_w$)，使用斜体来表示点的坐标(例如 $X_w$)。

Let us consider a point P in this room.In the world coordinate system, the coordinates of P are given by ($X_w,Y_w,Z_w$) . You can find $X_w$, $Y_w$ and $Z_w$ coordinates of this point by simply measuring the distance of this point from the origin along the three axes.

让我们考虑这个房间里的 P 点，在世界坐标系中，P 点的坐标由 ($X_w$, $Y_w$, $Z_w$) 给出。通过简单的测量该点沿三个轴到原点的距离，可以找到该点的 $X_w$, $Y_w$, $Z_w$ 坐标。

Camera Coordinate System 相机坐标系

Now, let’s put a camera in this room.

现在，让我们放一个摄像机在房间里。

The image of the room will be captured using this camera, and therefore, we are interested in a 3D coordinate system attached to this camera.

房间的图像将通过该摄像机捕捉，因此，我们对与该摄像机附带的三维坐标系很感兴趣。

If we had put the camera at origin of the room, and align it such that its X, Y, and Z axes aligned with the $\mathbf{X}_w$, $\mathbf{Y}_w$ and $\mathbf{Z}_w$ axes of the room, the two coordinate systems would be the same.

如果我们将摄像机放在房间的原点，并将其对齐，使其的 X 轴，Y 轴和 Z 轴与房间的$\mathbf{X}_w$, $\mathbf{Y}_w$ 和 $\mathbf{Z}_w$ 轴对齐，则两个坐标系将是相同的。

However, that is an absurd restriction. We would want to put the camera anywhere in the room and it should be able to look anywhere. In such a case, we need to find the relationship between the 3D room (i.e. world) coordinates and the 3D camera coordinates.

然而，这是一个荒谬的限制。我们希望将摄像机放置在房间的任何地方，而且它应该能够观察到任何地方。在这种情况下，我们需要找到三维房间（即世界）坐标与三维摄像机坐标之间的关系。

Let’s say our camera is located at some arbitrary location ($t_X$, $t_Y$, $t_Z$) in the room. In technical jargon, we can the camera coordinate is translated by ($t_X$, $t_Y$, $t_Z$) with respect to the world coordinates.

假设我们的摄像机位于房间的任意位置 ($t_X$, $t_Y$, $t_Z$) 。用术语来说，我们可以将相机坐标相对于世界坐标平移 ($t_X$, $t_Y$, $t_Z$)。

The camera may be also looking in some arbitrary direction. In other words, we can say the camera is rotated with respect to the world coordinate system.

摄像机也可以朝任意方向拍摄。换句话说，我们可以说摄像机相对于世界坐标系进行了旋转。

Rotation in 3D is captured using three parameters —- you can think of the three parameters as yaw, pitch, and roll. You can also think of it as an axis in 3D ( two parameters ) and an angular rotation about that axis (one parameter).

三维旋转通过三个参数来捕捉--您可以将这三个参数理解为偏航、俯仰和滚动。您也可以将其视为 3D 中的一个轴（两个参数）和围绕该轴的角度旋转（一个参数）。

However, it is often convenient for mathematical manipulation to encode rotation as a 3×3 matrix. Now, you may be thinking that a 3×3 matrix has 9 elements and therefore 9 parameters but rotation has only 3 parameters. That’s true, and that is exactly why any arbitrary 3×3 matrix is not a rotation matrix. Without going into the details, let us for now just know that a rotation matrix has only three degrees of freedom even though it has 9 elements.

然而，将旋转编码为 3×3 矩阵通常便于数学运算。现在，您可能会想，3×3 矩阵有 9 个元素，因此有 9 个参数，但旋转却只有 3 个参数。没错，这也正是为什么任意的 3×3 矩阵都不是旋转矩阵的原因。我们暂且不讨论细节，只想知道，旋转矩阵虽然有9个元素，但只有3个自由度。

PS. 这里使用 ChatGPT 延伸一下。

Let's consider a rotation around the z-axis by an angle of π/4 radians (or 45 degrees). The rotation matrix for this transformation is:

让我们考虑绕z轴旋转π/4弧度(或45度)的角度。此变换的旋转矩阵为：

This matrix represents a rotation in 3D space. The first two rows and columns correspond to a 2D rotation matrix for an angle of π/4, while the last row and column ensure that points on the z-axis remain unchanged.

该矩阵表示3D空间中的旋转。前两行和列对应于角度为π/4的2D旋转矩阵，而最后一行和列确保z轴上的点保持不变。

You can see that the rows and columns of this matrix are orthogonal (their dot product is zero), and the determinant of the matrix is +1. These properties confirm that this is indeed a rotation matrix.

可以看到这个矩阵的行和列是正交的(它们的点积为零)，矩阵的行列式是+1。这些性质证实了这确实是一个旋转矩阵。

Back to our original problem. The world coordinate and the camera coordinates are related by a rotation matrix $\mathbf{R}$ and a 3 element translation vector $\mathbf{t}$ .

回到我们最初的问题。世界坐标和相机坐标通过旋转矩阵 $\mathbf{R}$ 和 3 个元素的平移向量 $\mathbf{t}$ 相关联。

What does that mean?

这是什么意思？

It means that point P which had coordinate values ($X_w$, $Y_w$, $Z_w$) in the world coordinates will have different coordinate values ($X_c$, $Y_c$, $Z_c$) in the camera coordinate system. We are representing the camera coordinate system using red color.

这意味着在世界坐标中具有坐标值 ($X_w$, $Y_w$, $Z_w$) 的点 P 在摄像机坐标系中将具有不同的坐标值 ($X_c$, $Y_c$, $Z_c$) 。我们用红色表示相机坐标系。

The two coordinate values are related by the following equation.

两个坐标值的关系式如下(等式 1)：

Notice that representing rotation as a matrix allowed us to do rotation with a simple matrix multiplication instead of tedious symbol manipulation required in other representations like yaw, pitch, roll. I hope this helps you appreciate why we represent rotations as a matrix.

请注意，将旋转表示为矩阵使我们能够通过简单的矩阵乘法来实现旋转，而不是像偏航、俯仰、滚动等其他表示方法所要求的繁琐的符号操作。

Sometimes the expression above is written in a more compact form. The 3×1 translation vector is appended as a column at the end of the 3×3 rotation matrix to obtain a 3×4 matrix called the Extrinsic Matrix.

有时，上述表达式的写法更为简洁。3×1平移矢量作为一列附加在3×3旋转矩阵的末尾，得到一个3×4矩阵，称为外差矩阵(等式 2)：

where, the extrinsic matrix $\mathbf{P}$ is given by:

其中，外部矩阵 $\mathbf{P}$ 由下式给出(等式 3)：

$$ \mathbf{P} = \begin{bmatrix} \mathbf{R} | \mathbf{t} \end{bmatrix} $$

Homogeneous coordinates : In projective geometry, we often work with a funny representation of coordinates where an extra dimension is appended to the coordinates. A 3D point ($X$, $Y$, $Z$) in cartesian coordinates can written as ($X$, $Y$, $Z$, $1$) in homogenous coordinates. More generally, a point in homogenous coordinate ($X$, $Y$, $Z$, $W$) is the same as the point ($X/W$, $Y/W$, $Z/W$) in cartesian coordinates. Homogenous coordinates allow us to represent infinite quantities using finite numbers. For example, the point at infinity can be represented as ($1$, $1$, $1$, $0$) in homogenous coordinates. You may notice that we have used homogenous coordinates in Equation 2 to represent the world coordinates.

齐次坐标：在射影几何中，我们经常用一种有趣的坐标表示法，在坐标上附加一个额外的维度。笛卡尔坐标中的3D点 ($X$, $Y$, $Z$) 可以写成齐次坐标中的 ($X$, $Y$, $Z$, $1$) 。更一般地，齐次坐标 ($X$, $Y$, $Z$, $W$) 中的点与笛卡尔坐标中的点 ($X/W$, $Y/W$, $Z/W$) 相同。齐次坐标允许我们用有限的数字来表示无限的量，例如，无穷远处的点可以在齐次坐标中表示为 ($1$, $1$, $1$, $0$) 。您可能会注意到，我们在等式 2 中使用了齐次坐标来表示世界坐标。

Image Coordinate System 图像坐标系

图 2. P点在图像平面上的投影如图所示。

Once we get a point in 3D coordinate system of the camera by applying a rotation and translation to the points world coordinates, we are in a position to project the point on the image plane to obtain a location of the point in the image.

通过对点的世界坐标进行旋转和平移，在摄像机的三维坐标系中得到该点后，我们就可以将该点投影到图像平面上，从而得到该点在图像中的位置。

In the image above, we are looking at a point P with coordinates ($X_c$, $Y_c$, $Z_c$) in the camera coordinate system. Just a reminder, if we did not know the coordinates of this point in the camera coordinate system, we could transform its world coordinates using the Extrinsic Matrix to obtain the coordinates in the camera coordinate system using Equation 2.

在上图中，我们看到点 P，它在相机坐标系中的坐标为 ($X_c$, $Y_c$, $Z_c$) 。只是提醒一下，如果我们不知道这个点在相机坐标系中的坐标，我们可以使用外部矩阵转换它的世界坐标，使用等式 2 获得相机坐标系中的坐标。

Figure 2, shows the camera projection in case of a simple pin hole camera.

图 2 显示了简单针孔摄像机的投影情况。

The optical center (pin hole) is represented using $O_c$ ，In reality an inverted image of the point is formed on the image plane. For mathematical convenience, we simply do all the calculations as if the image plane is in front of the optical center because the image read out from the sensor can be trivially rotated by 180 degrees to compensate for the inversion. In practice even this is not required. Reader Olaf Peters pointed out in the comments section — “It is even simpler: a real cameras sensor just reads out from the most bottom row in reverse order (from right to left), and then from bottom to top for each row. By this method the image is automatically formed upright and left and right are in correct order. So in practice there is no need to rotate the image anymore.”

光学中心(针孔)用 $O_c$ 表示。实际上，该点的倒像形成在像平面上。为了数学上的方便，我们简单地进行所有的计算，就好像图像平面在光学中心的前面，因为从传感器读出的图像可以轻微地旋转 180 度以补偿反转。实际上，甚至这也不是必需的。读者Olaf Peters 在评论部分指出——“这甚至更简单:一个真正的相机传感器只是以相反的顺序(从右到左)从最底部的一行读取，然后从底部到顶部读取每一行。通过这种方法，图像被自动地形成为直立的，并且左和右处于正确的顺序。所以实际上没有必要再旋转图像了。”

The image plane is placed at a distance $f$ (focal length) from the optical center.

图像平面与光学中心的距离为 $f$ (焦距)。

Using high school geometry ( similar triangles ), we can show the project image ($x$, $y$) of the 3D point ($X_c$, $Y_c$, $Z_c$) is given by:

使用高中几何(相似三角形)，我们可以通过下面的等式 4 得出 3 维点 ($X_c$, $Y_c$, $Z_c$) 在投影图像上的位置 ($x$, $y$) 。

$$ x = f{X_c \over Z_c} $$

$$ y = f{Y_c \over Z_c} $$

The above two equations can be rewritten in matrix form as follows:

上述两个方程可以矩阵形式重写如下(等式 5)：

The matrix $K$ shown below is called the Intrinsic Matrix and contains the intrinsic parameters of the camera.

下图(等式 6)所示的矩阵 $K$ 称为本征矩阵，包含相机的本征参数。

The above simple matrix shows only the focal length.

上述简单矩阵仅显示焦距。

However, the pixels in the image sensor may not be square, and so we may have two different focal lengths $f_x$ and $f_y$ .

然而，图像传感器中的像素可能不是正方形的，因为我们可能有两个不同的焦距 $f_x$ 和 $f_y$ 。

The optical center ($c_x$, $c_y$) of the camera may not coincide with the center of the image coordinate system.

相机的光学中心 ($c_x$, $c_y$) 可能与图像坐标系的中心不重合。

In addition, there may be a small skew $ \gamma$ between the x and y axes of the camera sensor.

此外，相机传感器的 x 轴和 y 轴之间可能存在小小的 $\gamma$ 倾斜。

Taking all the above into account, the camera matrix can be re-written as.

考虑到上述所有因素，摄像机矩阵可改写为(等式 7)：

图3. 显示了一个更真实的场景，当图像像素坐标系的原点在左上角时，固有相机矩阵需要考虑主点的位置、轴的倾斜以及沿不同轴的潜在不同焦距。

However, in the above equation, the x and y pixel coordinates are with respect to the center of the image. However, while working with images the origin is at the top left corner of the image.

然而，在上面的等式中，x和y像素坐标是相对于图像中心的。但是，在处理图像时，原点位于图像的左上角。

Let’s represent the image coordinates by ($u$, $v$):

让我们用以下等式 8 表示图像坐标 ($u$, $v$)：

Where,

$$ u = {u' \over w'} $$

$$ v = {v' \over w'} $$

原文来源：https://learnopencv.com/geometry-of-image-formation/

中文翻译来自：搜狗翻译， Deepl 翻译

部分内容由 ChatGPT 生成