import json
import numpy as np
= json.loads(open('data/2020-02-23-An-Adventure-In-Camera-Calibration/A300.json','r').read())
JSON = {}
polygons for shape in JSON['shapes']:
= shape['label'].split(',')
coords = int(coords[0]),int(coords[1])
x,y = shape['points'] polygons[x,y]
The Context
It’s February 1972; the A300 airliner is being unveiled in Toulouse; let’s go on an adventure (In-camera calibration!).
We have seen this photo published in a magazine, and we want to learn as much about the dimensions of Airbus’s new aircraft as possible. To do so, we will need to mathematically reconstruct the camera used to take the photo and the scene itself.
Control points
In this case, we are lucky because we notice the hexagonal pattern on the floor. In particular, it’s a tessellating hexagonal pattern, which can only happen if all the hexagons have identical dimensions.
While we don’t know the hexagon’s dimensions, each side is approximately 1.6m long, based on the height of the people in the photo. If we assume some point on the ground, say the centre of a polygon is the point 0,0, we can work out the X & Y location of each other polygon vertex we can see. Hence the Z coordinate of each point is 0. Furthermore, we could also assume that the factory floor is flat and level.
Let’s spend ±5 minutes annotating the image using an annotation tool like label me. I’ve generated a file, which you can find attached here:
Firstly, lets load in all of the x and y points:
Ok, now do some maths, and work out the locations of each vertex of our hexagons.
from sklearn.neighbors import KDTree
= []
points = sorted(polygons.keys())
keys
for key in keys:
= polygons[key]
poly = zip(*poly)
(pts_x, pts_y)
= list(pts_x)
pts_x = list(pts_y)
pts_y
#Magic analytic formula for working out the location of each point, based on which vertex, of which polygon it is.
= 0.5 * np.array([1,2,1,-1,-2,-1])
x_vertex = 0.5 * np.array([np.sqrt(3),0,-np.sqrt(3),-np.sqrt(3),0,np.sqrt(3)])
y_vertex
= key
row,col = row * 1.5 + x_vertex
x = col * 0.5 * np.sqrt(3) + y_vertex
y
#From before, we assume the sides of each polygon is 1.6m
*=1.6 #meters
x*=1.6 #meters
y
for idx in range(0,6):
= []
point = pts_x[idx]
i = pts_y[idx]
j = x[idx]
X = y[idx]
Y = 0.0
Z
points.append([i,j,X,Y,Z])
Now we are presented with a minor problem; in many cases, we annotated the same point up to 3 times, where the vertices of the hexagons meet. So let’s go and find points that are within 10 pixels and then take their average. If we do this, we can effectively over-weight some points in the image at the expense of others.
= np.asarray(points)
points "data/2020-02-23-An-Adventure-In-Camera-Calibration/points.csv", points)
np.savetxt(
= KDTree(points[:,0:2], leaf_size=5)
tree
= []
merged_indicies = []
unique_points for i in range(0,points.shape[0]):
if i not in merged_indicies:
= tree.query(points[i,0:2].reshape(-1, 2), k=3)
dist, ind
= []
indicies_to_merge for j in range(0,3):
if dist[0][j]<10:
0][j])
indicies_to_merge.append(ind[0][j])
merged_indicies.append(ind[
= np.mean(points[indicies_to_merge,:],axis=0)
mean_points
unique_points.append(mean_points)
= np.asarray(unique_points) unique_points
Camera Parameters
So, now we have many 3D points and corresponding 2D points in the photo.
Now it’s time to turn to the real magic, bundle adjustment. Our task is to find a camera that best fits our measured data.
Let’s talk more about cameras.
Mathematically, cameras are composed of two parameters, Intrinsic and Extrinsic. The Extrinsic parameters define the position and rotation of the camera with respect to the origin of the points it’s observing.
The Intrinsic parameters define the camera’s parameters, such as the Focal length, the location of the camera’s radial centre, and the distortion induced by the lens.
The Extrinsic parameters comprise 6 degrees of freedom, given our world is three-dimensional, and there are three dimensions to rotate around.
The Intrinsic parameters are more complex. There are some great resources, for example, Multiple View Geometry in Computer Vision or the OpenCV documentation. However, In this case, I am assuming that the principal point, the focal length, and the radial parameters are unknown.
First, let’s import a bunch of stuff we will need later.
from __future__ import print_function
from warnings import filterwarnings
'ignore') filterwarnings(
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import least_squares
from scipy.spatial.transform import Rotation as Rot
%matplotlib inline
"figure.figsize"] = (20,20) plt.rcParams[
= unique_points[:,0:2]
points_2d = unique_points[:,2:5]
points_3d
print('We have {} unique points'.format(points_2d.shape[0]))
We have 51 unique points
Modelling the Camera
Intrinsic Matrix
Now we come to the real magic.
\[\begin{equation*} x = PX \end{equation*}\]
This function models the camera, taking points in 3D space and converting them into points in 2D space.
There are lots of things going on here.
Firstly, let’s talk about the camera’s intrinsic matrix.
It converts points from 3D space to 2D space.
\[\begin{equation*} K = \begin{bmatrix} f & 0 & c_{x} \\ 0 & f & c_{y} \\ 0 & 0 & 1 \end{bmatrix} \end{equation*}\]
We have the focal length, \(f\), and the camera optical centre \(c_x\) and \(c_y\).
Extrinsic Matrix
Now let’s talk about the camera’s extrinsic matrix.
The 6 degrees of freedom describe its position and orientation within the world. That’s 3 degrees for the position and 3 for the orientation. At its heart, what we are doing is simple but confusing.
There are so many ways to represent our setup: 1. Coordinate systems: 2D and 3D. * Left-Handed or Right Handed?
- Rotations:
- Quaternions?
- Proper Euler angles (6 different ways)?
- Tait–Bryan angles (6 different ways)?
- Rodrigues rotation formula?
- A rotation matrix?
- The location of the camera in the world. (2 Different ways).
Today, we will use two ways to represent the rotations: a Rodrigues rotation vector representation and a rotation matrix.
We use two different representations because it’s easier to optimise when we have 3 degrees of freedom than a naive rotation matrix with nine numbers representing 3 degrees of freedom.
R represents the camera’s orientation in the World Coordinate Frame (The frame we use to describe our 3D points).
In python, we can use convert from the Rodrigues rotation vector to the Rotation matrix as follows:
from scipy.spatial.transform import Rotation as Rot
= camera_params[:3]
rotation_vector = Rot.from_rotvec(rotation_vector).as_matrix() R
\[\begin{equation*} R = \begin{bmatrix} R_1 & R_2 & R_3 \\ R_4 & R_5 & R_6 \\ R_7 & R_8 & R_9 \end{bmatrix} \end{equation*}\]
Now, let’s talk about the Project Matrix \(P\) of the camera. This takes the points from their location in 3D world coordinates to pixel coordinates, assuming we have a camera without radial distortion. There are two main ways this could be formulated.
Firstly: \[\begin{equation*} P = KR[I|−C] \end{equation*}\]
Secondly: \[\begin{equation*} P = K[R | t] \end{equation*}\]
Where \(t\) is: \[\begin{equation*} t = −RC \end{equation*}\]
Let’s go with the first method, where C is :
\[\begin{equation*} C = \begin{bmatrix} -C_X\\ -C_Y\\ -C_Z \end{bmatrix} \end{equation*}\]
Lens distortion
However, there is one subtlety alluded to before, which is the impact of radial distortion. The camera’s lens distorts the rays of light coming in, in a non-linear way.
We can model it using a Taylor series:
\[\begin{equation*} x_c = x(1 + k_1 r + k_2 r^2 + k_3 r^3) \end{equation*}\] \[\begin{equation*} y_c = y(1 + k_1 r + k_2 r^2 + k_3 r^3) \end{equation*}\]
In python, we end up with the following:
= np.sqrt(np.sum(points_proj**2, axis=1)))
r = 1 + k1 \times r + k2 * r**2 + k3 * r**3
r *= r[:, np.newaxis] points_proj
Putting it all together
def project(points, camera_params):
"""Convert 3-D points to 2-D by projecting onto images."""
#Rotation
= camera_params[:3]
rotation_vector = Rot.from_rotvec(rotation_vector).as_matrix()
R
#Camera Center
= camera_params[3:6].reshape(3,1)
C = np.hstack([np.eye(3),-C])
IC = np.matmul(R,IC)
RIC
#Make points Homogeneous
= np.hstack([points,np.ones((points.shape[0],1))])
points
#Perform Rotation and Translation
#(n,k), (k,m) -> (n,m)
= np.matmul(points,RIC.T)
points_proj
#perspective divide
= points_proj[:, :2] / points_proj[:, 2, np.newaxis]
points_proj
= camera_params[6]
f
= camera_params[7]
k1 = camera_params[8]
k2 = camera_params[9]
k3 = camera_params[10]
c_x = camera_params[11]
c_y
#Radial distortion
= np.sqrt(np.sum(points_proj**2, axis=1))
r
= points_proj[:,0]
x = points_proj[:,1]
y
0] = (1 + k1 * r + k2 * r**2 + k3 * r**3)*x
points_proj[:,1] = (1 + k1 * r + k2 * r**2 + k3 * r**3)*y
points_proj[:,
#Make points Homogeneous
= np.hstack([points_proj, np.ones((points_proj.shape[0],1))])
points_proj
= np.asarray([[f, 0, c_x],
K 0, f, c_y],
[0, 0, 1.0]])
[
= np.dot(points_proj,K.T)
points_proj = points_proj[:,:2]
points_proj return(points_proj)
Initial Parameters
Let’s start by providing hints to the optimiser about the solution by putting in some reasonable starting conditions.
We know both the image width and height, and we can assume that the principal point is in the centre of the image.
The camera is about 10 meters off the ground.
Let’s rotate the camera, so it faces directly down to make optimisation easier. This means that the points should be in front of/below it.
Let’s also assume that the camera is centred above the points. It’s not strictly correct, based on what we see in the image, but it’s not horrifically wrong.
= 2251
image_width = 1508
image_height
= 2000
estimated_focal_length_px
= np.zeros(12)
camera_params
= Rot.from_euler('x', 180, degrees=True).as_rotvec()
r
#Rotation matrix
0] = r[0]
camera_params[1] = r[1]
camera_params[2] = r[2]
camera_params[
#C
3] = points_3d[:,0].mean()
camera_params[4] = points_3d[:,1].mean()
camera_params[5] = 10
camera_params[
#f,k1,k2,
6] = estimated_focal_length_px
camera_params[7] = 0
camera_params[8] = 0
camera_params[9] = 0
camera_params[
#c_x,c_y
10] = image_width/2.0
camera_params[11] = image_height/2.0 camera_params[
Optimisation
This section below is well explained by here.
We are optimising to minimise a geometric error, that is, the distance between the 2D points we see and the projection of their 3D counterparts.
Through optimisation, we aim to find parameters that result in a low error, which means they should represent the actual parameters of the camera.
def fun(camera_params, points_2d, points_3d):
#Compute residuals.
= project(points_3d, camera_params)
points_proj return(points_proj - points_2d).ravel()
= camera_params.ravel()
x0
= least_squares(fun, x0, verbose=1, x_scale='jac', ftol=1e-4, method='lm',
optimization_results ='linear',args=(points_2d, points_3d)) loss
`ftol` termination condition is satisfied.
Function evaluations 970, initial cost 3.7406e+07, final cost 1.7398e+02, first-order optimality 3.63e+03.
Results
Now let’s go and check out the results of our optimization process.
= optimization_results.x
camera_params
= camera_params[0:3]
R_Rodrigues = camera_params[3:6]
C
= Rot.from_rotvec(R_Rodrigues)
r = r.as_matrix()
R_matrix
= Rot.from_matrix(R_matrix.T)
r = r.as_quat()
R_Quaternion
print('Quaternions: X: {:.3f} Y: {:.3f} Z: {:.3f} W: {:.3f} '.format(R_Quaternion[0],R_Quaternion[1],R_Quaternion[2],R_Quaternion[3]))
print('Camera position relative to the origin in (M): X: {:.2f}, Y: {:.2f}, Z: {:.2f}'.format(C[0],C[1],C[2]))
= camera_params[6]
focal_length_px = camera_params[7]
k1 = camera_params[8]
k2 = camera_params[9]
k3
= camera_params[10]
c_x = camera_params[11]
c_y
print('Focal length (Pixels): {:.2f}'.format(focal_length_px))
print('CX, CY: {:.2f} {:.2f}'.format(c_x,c_y))
print('K_1, K_2, K_3 : {:.6f}, {:.6f}, {:.6f}'.format(k1,k2,k3))
print('Mean error per point: {:.2f} pixels '.format(optimization_results.cost/points_2d.shape[0]))
Quaternions: X: 0.894 Y: -0.408 Z: 0.084 W: -0.166
Camera position relative to the origin in (M): X: -6.85, Y: -12.92, Z: 2.75
Focal length (Pixels): 1010.93
CX, CY: 1038.58 2663.52
K_1, K_2, K_3 : -0.327041, 0.175031, -0.030751
Mean error per point: 3.41 pixels
Ok, the mean error per point is 3-4 pixels. We have found a decent solution; however, some interesting things are happening.
In particular, the principal point lies outside the image, which is curious. One possibility is that the image was cropped.
Now let’s have a quick look at the errors for each point.
abs(optimization_results.fun),density=True)
plt.hist('Histogram of Residuals')
plt.title('Absolute Residual (Pixels)')
plt.xlabel(
plt.grid() plt.show()
So the histogram looks pretty good, apart from the one point with a high residual, probably due to sloppy labelling/annotation.
Now let’s compare the points we annotated, with where they would be projected, using the camera parameters we found:
= project(points_3d, optimization_results.x)
points_2d_proj
= plt.imread('data/2020-02-23-An-Adventure-In-Camera-Calibration/A300.jpg')
img
plt.imshow(img)0],points_2d[:,1],label='Actual',c='r',alpha=0.5)
plt.scatter(points_2d[:,0],points_2d[:,1],label='Optimised',c='k',alpha=0.5)
plt.scatter(points_2d_proj[:, plt.show()
Again, this looks great. Finally, let’s overlay the hexagons on the floor, to build confidence in our solution visually.
def plot_verticies(row,col):
= 0.5 * np.array([1,2,1,-1,-2,-1,1])
x_vertex = 0.5 * np.array([np.sqrt(3),0,-np.sqrt(3),-np.sqrt(3),0,np.sqrt(3),np.sqrt(3)])
y_vertex
= row * 1.5 + x_vertex
x = col * 0.5 * np.sqrt(3) + y_vertex
y
*=1.6
x*=1.6
y
= np.vstack([x,y,np.zeros(7)]).T
points_3d = project(points_3d, optimization_results.x)
points_2d_proj
return(points_2d_proj)
plt.imshow(img)for row in range(0,10,2):
for col in range(0,10,2):
= plot_verticies(row,col)
points_2d_proj 0],points_2d_proj[:,1],color='B',alpha=0.25)
plt.plot(points_2d_proj[:,0]), np.mean(points_2d_proj[:,1]), str(row)+','+str(col), horizontalalignment='center',verticalalignment='center')
plt.text(np.mean(points_2d_proj[:,
for row in range(1,11,2):
for col in range(1,11,2):
= plot_verticies(row,col)
points_2d_proj 0],points_2d_proj[:,1],color='R',alpha=0.25)
plt.plot(points_2d_proj[:,0]), np.mean(points_2d_proj[:,1]), str(row)+','+str(col), horizontalalignment='center',verticalalignment='center')
plt.text(np.mean(points_2d_proj[:,
plt.show()
In Conclusion
We have found a semi-reasonable solution.
However, I need to figure out the location of the principal point of the image. Usually, this is near the centre of the image with most cameras. In our case, it isn’t. There are several reasons why this could be the case; for example, the image might have been cropped; however, it’s a little concerning.
I’m also worried about the camera’s height, only 2.75M above the ground. The camera looks approximately the same height as the aircraft’s roof, which is 7-10m above the ground.
In the future, Let’s look at how we can extract more useful information from this image and understand how confident we can be in our solution.
Thanks to Nikolay Mayorov who created the excellent optimisation demo in Scipy that I built upon, you can find the original code here.
Multiple View Geometry in Computer Vision is an incredible book that I learn more from each time I read it. in particular, for further information, see:
- Finite cameras. Page 153, Multiple View Geometry in Computer Vision (Second edition)
- Note: Minimizing geometric error. Page 176, Multiple View Geometry in Computer Vision (Second edition)