Computer Vision — Mathematical Foundations

Teaching
Machines to See

A comprehensive, interactive exploration of the mathematics, algorithms, and deep learning architectures that enable computers to understand visual information.

8+ Core Topics
Math Equations
Live Interactive Demos
01 — Image Formation

Light, Sensors &
Representation

A digital image is fundamentally a 2D function $f(x, y)$ where $(x,y)$ are spatial coordinates and the value $f$ represents intensity (for grayscale) or a vector of intensities (for color). This function is sampled at discrete pixel locations and quantized to a finite number of intensity levels.

$$f : \mathbb{R}^2 \to \mathbb{R}^c \quad \text{where } c = 1 \text{ (grayscale)}, c = 3 \text{ (RGB)}$$ $$I[i,j] \in \{0, 1, \ldots, 2^b - 1\}^c, \quad b = \text{bit depth}$$

Color Spaces

Raw sensor data is in the Bayer mosaic format — a grid of Red, Green, Blue photosites in a 2:1:1 ratio (RGGB pattern). Demosaicing algorithms reconstruct full-color images. Different color spaces have different mathematical relationships:

$$\text{RGB} \to \text{Grayscale}: \quad Y = 0.299R + 0.587G + 0.114B$$ $$\text{RGB} \to \text{HSV}: \quad H = \text{atan2}\!\left(\sqrt{3}(G-B),\; 2R - G - B\right) \cdot \frac{180°}{\pi}$$ $$\text{RGB} \to \text{YCbCr}: \begin{pmatrix} Y \\ C_b \\ C_r \end{pmatrix} = \begin{pmatrix}0.299 & 0.587 & 0.114 \\ -0.169 & -0.331 & 0.500 \\ 0.500 & -0.419 & -0.081\end{pmatrix}\begin{pmatrix}R\\G\\B\end{pmatrix}$$

Sampling & Quantization

The continuous scene is sampled spatially (pixels) and in intensity (quantization). By the Nyquist–Shannon theorem, we must sample at least twice the highest spatial frequency to avoid aliasing:

$$f_s \geq 2 f_{\max} \quad \Leftrightarrow \quad \Delta x \leq \frac{1}{2 f_{\max}}$$
🔬

Spatial Resolution

Number of pixels per unit length. Higher resolution recovers finer spatial frequencies. Modern sensors: 12–50 MP.

🌈

Bit Depth

$b$ bits → $2^b$ intensity levels. 8-bit: 256 levels. 16-bit: 65,536 levels (medical/scientific imaging).

📷

Dynamic Range

Ratio of max to min luminance $= L_{\max}/L_{\min}$. Human eye: ~$10^5$:1. Typical camera: ~$10^3$:1.

📡

Noise Models

Photon shot noise: $\mathcal{N}(0, \sigma^2)$. Read noise, thermal noise. Poisson statistics at low light: $\text{SNR} = \sqrt{N}$.

Interactive — Image Sampling & Quantization
Sampled & Quantized

Drag the sliders to see how spatial resolution (sampling rate) and bit depth (quantization levels) affect image quality.

At low resolution, aliasing (staircase artifacts on diagonals/circles) becomes visible. At low bit depth, banding appears in smooth gradients.

The total storage cost of an image is:
$\text{Size} = W \times H \times c \times b$ bits

02 — Image Filtering

Convolution &
Spatial Filters

Spatial filtering is the foundation of classical computer vision. The core operation is discrete convolution — sliding a small kernel $K$ over the image $I$ and computing weighted sums at each location.

$$(I * K)[i,j] = \sum_{m=-\lfloor k/2 \rfloor}^{\lfloor k/2 \rfloor} \sum_{n=-\lfloor k/2 \rfloor}^{\lfloor k/2 \rfloor} I[i-m,\, j-n] \cdot K[m,n]$$

In practice (and in deep learning), what's called "convolution" is actually cross-correlation, where the kernel is not flipped:

$$(I \star K)[i,j] = \sum_{m,n} I[i+m,\, j+n] \cdot K[m,n]$$

The Gaussian Filter

The isotropic 2D Gaussian is the only filter that is simultaneously separable, isotropic, and achieves the theoretical minimum in the uncertainty principle (optimal joint localization in space and frequency):

$$G_\sigma(x,y) = \frac{1}{2\pi\sigma^2} \exp\!\left(-\frac{x^2 + y^2}{2\sigma^2}\right)$$

Its separability means the 2D convolution decomposes into two 1D passes — reducing complexity from $O(k^2)$ to $O(2k)$ per pixel:

$$G_\sigma(x,y) = G_\sigma(x) \cdot G_\sigma(y), \quad G_\sigma(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-x^2/2\sigma^2}$$ $$\Rightarrow I * G_\sigma = (I *_{\text{col}} G_\sigma) *_{\text{row}} G_\sigma$$

Edge Detection — The Sobel Operator

Edges correspond to regions of rapid intensity change — high image gradient magnitude. The Sobel operator approximates the image gradient using finite differences:

$$K_x = \begin{pmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{pmatrix}, \quad K_y = \begin{pmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{pmatrix}$$ $$G_x = I * K_x, \quad G_y = I * K_y$$ $$|\nabla I| = \sqrt{G_x^2 + G_y^2}, \quad \theta = \arctan\!\left(\frac{G_y}{G_x}\right)$$

Canny Edge Detector

The Canny detector (1986) is optimal under three criteria: good detection, good localization, and single response. Its pipeline:

🔵
Gaussian Blur
📐
Sobel Gradient
🔎
Non-Max Suppression
🔗
Hysteresis Threshold
Edge Map

Non-maximum suppression thins edges to single-pixel width by suppressing gradient values that are not local maxima along the gradient direction $\theta$. Hysteresis uses two thresholds $T_{\text{high}} > T_{\text{low}}$: strong edges $>T_{\text{high}}$ are definite; weak edges in $[T_{\text{low}}, T_{\text{high}}]$ are kept only if connected to a strong edge.

Interactive — Convolution Kernel Explorer
Original (draw here!)
Filtered Output

Draw on the left canvas, then apply different kernels. The kernel matrix shows the weights used at each pixel position.

Each output pixel = weighted sum of neighborhood

03 — Feature Detection

Corners, Keypoints
& Descriptors

Feature detection identifies stable, distinctive points in an image that can be reliably re-detected across viewpoint and illumination changes — fundamental for image matching, 3D reconstruction, and tracking.

Harris Corner Detector

Harris & Stephens (1988) formalized the intuition that corners have large intensity variation in all directions. The structure tensor (second moment matrix) $M$ captures this:

$$M = \sum_{(x,y)\in W} w(x,y) \begin{pmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{pmatrix} = \begin{pmatrix} \sum w I_x^2 & \sum w I_x I_y \\ \sum w I_x I_y & \sum w I_y^2 \end{pmatrix}$$

where $I_x = \partial I/\partial x$, $I_y = \partial I/\partial y$ are image gradients and $w(x,y)$ is a Gaussian weighting window. The eigenvalues $\lambda_1, \lambda_2$ of $M$ classify the region:

Flat Region

$\lambda_1 \approx \lambda_2 \approx 0$

No significant gradient in any direction.

Edge

$\lambda_1 \gg \lambda_2 \approx 0$

Large gradient in one direction only.

Corner ✓

$\lambda_1 \approx \lambda_2 \gg 0$

Large gradient in all directions.

Rather than computing eigenvalues directly, the Harris response function uses the trace and determinant:

$$R = \det(M) - k \cdot \text{tr}(M)^2 = \lambda_1\lambda_2 - k(\lambda_1+\lambda_2)^2$$ $$R > 0 \Rightarrow \text{Corner}, \quad R < 0 \Rightarrow \text{Edge}, \quad |R| \approx 0 \Rightarrow \text{Flat}$$

Typical value: $k \in [0.04, 0.06]$. Harris corners are not scale-invariant — this led to the development of SIFT.

SIFT — Scale-Invariant Feature Transform

Lowe (2004) addressed scale invariance by detecting keypoints in scale-space — a family of progressively blurred images parameterized by $\sigma$:

$$L(x, y, \sigma) = G_\sigma * I(x,y)$$ $$D(x,y,\sigma) = L(x,y,k\sigma) - L(x,y,\sigma) \approx (k-1)\sigma^2 \nabla^2 G * I$$

Extrema of the Difference-of-Gaussians (DoG) across scale and space approximate the Laplacian of Gaussian — a blob detector that finds characteristic scales. Each keypoint's descriptor is a 128-dimensional histogram of gradient orientations, normalized for robustness:

$$\mathbf{d} \in \mathbb{R}^{128}: \quad \text{4} \times \text{4 spatial bins} \times \text{8 orientation bins}$$ $$\hat{\mathbf{d}} = \frac{\mathbf{d}}{\|\mathbf{d}\|_2}, \quad \text{then clip at } 0.2, \text{ renormalize}$$

Feature Matching & RANSAC

Given two sets of SIFT descriptors, matches are found via nearest-neighbor with the Lowe ratio test: accept match if $d_1 / d_2 < 0.8$. Outliers (mismatches) are then removed using RANSAC:

$$N_{\text{iter}} = \frac{\log(1 - p)}{\log(1 - \epsilon^s)}$$

where $p$ = desired probability of success, $\epsilon$ = inlier ratio, $s$ = minimum sample size (e.g. 4 for homography)

04 — Camera Geometry

Projective Geometry
& 3D Vision

The Pinhole Camera Model

The fundamental model of image formation: a 3D world point $\mathbf{X} = (X,Y,Z)^T$ projects to image point $\mathbf{x} = (x,y)^T$ through the camera center (pinhole). In homogeneous coordinates:

$$\tilde{\mathbf{x}} = P \tilde{\mathbf{X}}, \quad P = K[R \mid \mathbf{t}]$$ $$\begin{pmatrix} u \\ v \\ 1 \end{pmatrix} \sim \underbrace{\begin{pmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix}}_{K \text{ (intrinsic)}} \underbrace{\begin{pmatrix} r_{11} & r_{12} & r_{13} & t_1 \\ r_{21} & r_{22} & r_{23} & t_2 \\ r_{31} & r_{32} & r_{33} & t_3 \end{pmatrix}}_{[R|\mathbf{t}] \text{ (extrinsic)}} \begin{pmatrix}X\\Y\\Z\\1\end{pmatrix}$$

The intrinsic matrix $K$ encodes: focal lengths $(f_x, f_y)$ in pixels, principal point $(c_x, c_y)$, and skew $s \approx 0$ for modern cameras. The extrinsic $[R|\mathbf{t}]$ defines the camera's pose in the world.

Lens Distortion

Real lenses deviate from the pinhole model. Radial distortion is most significant:

$$x_d = x(1 + k_1 r^2 + k_2 r^4 + k_3 r^6)$$ $$y_d = y(1 + k_1 r^2 + k_2 r^4 + k_3 r^6)$$ $$r^2 = x^2 + y^2$$

$k_1 > 0$: barrel distortion. $k_1 < 0$: pincushion distortion.

Homography

When the scene is planar (or the camera rotates in place), points relate by a homography $H$, a $3\times 3$ projective transformation:

$$\mathbf{x}' \sim H\mathbf{x}, \quad H \in \mathbb{R}^{3\times 3}, \quad \det(H) \neq 0$$ $$\begin{pmatrix}x'\\y'\\1\end{pmatrix} \sim \begin{pmatrix}h_{11}&h_{12}&h_{13}\\h_{21}&h_{22}&h_{23}\\h_{31}&h_{32}&h_{33}\end{pmatrix}\begin{pmatrix}x\\y\\1\end{pmatrix}$$

Stereo Vision & Depth from Disparity

A stereo camera rig with known baseline $b$ can recover depth from the disparity $d$ between corresponding points:

$$Z = \frac{b \cdot f}{d}, \quad d = x_L - x_R$$ $$\text{Depth error: } \Delta Z = \frac{Z^2}{b \cdot f} \Delta d \quad \Rightarrow \text{error grows as } Z^2$$

Essential & Fundamental Matrices

The epipolar constraint encodes the geometry between two uncalibrated views. For a world point $\mathbf{X}$, its projections $\mathbf{x}, \mathbf{x}'$ in two cameras satisfy:

$$\mathbf{x}'^T F \mathbf{x} = 0 \quad \text{(Fundamental Matrix, 7 DOF)}$$ $$\mathbf{x}'^T E \mathbf{x} = 0 \quad \text{(Essential Matrix, 5 DOF)}$$ $$E = K'^T F K, \quad E = [t]_\times R$$ $$[t]_\times = \begin{pmatrix}0 & -t_z & t_y \\ t_z & 0 & -t_x \\ -t_y & t_x & 0\end{pmatrix}$$
Interactive — 3D to 2D Projection
3D World View
Camera Image Plane

Adjust the camera's focal length and pose to see how a 3D cube projects onto the image plane.

Notice: longer focal length → more telephoto/compressed. Moving camera ≠ rotating object in 3D.

05 — Deep Learning

Convolutional Neural
Networks

The Convolutional Neural Network (CNN) revolutionized computer vision by learning hierarchical feature representations directly from data. Unlike hand-crafted features, CNNs learn what to look for.

The Convolutional Layer

A conv layer applies $K$ learned filters to the input feature map. With input $\mathbf{X} \in \mathbb{R}^{H \times W \times C_{in}}$ and filter $\mathbf{W}^{(k)} \in \mathbb{R}^{F \times F \times C_{in}}$:

$$\mathbf{Z}^{(k)}[i,j] = \sum_{c=1}^{C_{in}} \sum_{m=0}^{F-1} \sum_{n=0}^{F-1} \mathbf{X}[i+m, j+n, c] \cdot \mathbf{W}^{(k)}[m,n,c] + b^{(k)}$$ $$\mathbf{Y}^{(k)} = \sigma(\mathbf{Z}^{(k)}) \quad \text{where } \sigma = \text{ReLU}(z) = \max(0,z)$$ $$\text{Output spatial size: } H_{out} = \left\lfloor \frac{H - F + 2P}{S} \right\rfloor + 1$$

where $P$ = padding, $S$ = stride. Parameter count: $K \cdot (F^2 \cdot C_{in} + 1)$ — much less than a fully connected layer.

Layer Types
Architectures
Backpropagation
Normalization
Layer TypeOperationPurposeKey Param
Conv2D$\mathbf{X} * \mathbf{W} + b$Feature extractionFilter size, stride
MaxPool$\max_{(m,n)\in R} \mathbf{X}[i+m,j+n]$Spatial downsampling, translation invariancePool size $k$
AvgPool$\frac{1}{k^2}\sum_{(m,n)} \mathbf{X}[i+m,j+n]$Smoother downsamplingPool size $k$
BatchNorm$\hat{x} = \frac{x-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}; \; y = \gamma\hat{x}+\beta$Training stability, regularization$\gamma, \beta$ (learned)
Dropout$\mathbf{z} = \mathbf{x} \odot \text{Bernoulli}(1-p)/(1-p)$Regularization, prevents co-adaptationDrop probability $p$
Softmax$p_k = e^{z_k}/\sum_j e^{z_j}$Classification output (probability distribution)Temperature $T$
ArchitectureYearInnovationTop-5 Error (ImageNet)
AlexNet2012Deep CNN + ReLU + Dropout + GPU training15.3%
VGGNet2014Very deep (16-19 layers) with 3×3 convolutions only7.3%
GoogLeNet/Inception2014Inception modules: parallel multi-scale convolutions6.7%
ResNet2015Residual connections: $\mathbf{y} = \mathcal{F}(\mathbf{x},\{W_i\}) + \mathbf{x}$3.57%
DenseNet2016Dense connections: each layer connected to all subsequent~3%
EfficientNet2019Compound scaling of depth/width/resolution1.8%
ViT2020Pure transformer on image patches<1.5%

ResNet's skip connection solves the vanishing gradient problem: during backpropagation, gradients flow directly through identity connections, enabling training of networks with 1000+ layers.

Training minimizes the cross-entropy loss via stochastic gradient descent with backpropagation:

$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \sum_{k=1}^{C} y_{ik} \log \hat{p}_{ik} + \frac{\lambda}{2}\|\mathbf{W}\|_F^2$$ $$\mathbf{W} \leftarrow \mathbf{W} - \eta \nabla_{\mathbf{W}} \mathcal{L}$$

Backprop through a conv layer computes three gradients via the chain rule:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(k)}} = \sum_{i,j} \delta^{(k)}[i,j] \cdot \mathbf{X}[i:i+F, j:j+F, :]$$ $$\frac{\partial \mathcal{L}}{\partial \mathbf{X}} = \sum_k \delta^{(k)} * \text{flip}(\mathbf{W}^{(k)}) \quad \text{(full convolution)}$$

where $\delta^{(k)} = \frac{\partial \mathcal{L}}{\partial \mathbf{Z}^{(k)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{Y}^{(k)}} \odot \mathbf{1}[\mathbf{Z}^{(k)} > 0]$ (ReLU gradient).

Modern optimizers like Adam adapt per-parameter learning rates:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$$ $$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$ $$\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon} \hat{m}_t$$

Normalization layers are critical for training stability. Different methods normalize across different axes of the $(N, C, H, W)$ tensor:

$$\text{BatchNorm:} \quad \hat{x}_{nchw} = \frac{x_{nchw} - \mu_c}{\sqrt{\sigma_c^2+\epsilon}}, \quad \mu_c = \frac{1}{NHW}\sum_{n,h,w} x_{nchw}$$ $$\text{LayerNorm:} \quad \hat{x}_{nchw} = \frac{x_{nchw} - \mu_n}{\sqrt{\sigma_n^2+\epsilon}}, \quad \mu_n = \frac{1}{CHW}\sum_{c,h,w} x_{nchw}$$ $$\text{GroupNorm:} \quad \text{Normalize over groups of channels} \quad (G \text{ groups of } C/G \text{ channels each})$$

BatchNorm is best for large batches; LayerNorm is preferred in transformers; GroupNorm is used when batch sizes are small (detection, segmentation).

Interactive — CNN Feature Hierarchy
Hover over layers to see their role. Early layers detect edges → mid-layers detect textures/parts → deep layers detect objects.
06 — Object Detection

Localization &
Recognition

Object detection simultaneously classifies and localizes multiple objects: output is a set of bounding boxes $(x, y, w, h)$ with class probabilities.

Intersection over Union (IoU)

The standard metric for bounding box quality:

$$\text{IoU}(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{\text{Area of overlap}}{\text{Area of union}} \in [0, 1]$$ $$\text{IoU} > 0.5 \Rightarrow \text{correct detection (PASCAL VOC standard)}$$

Anchor-Based Detection — YOLO

YOLO (You Only Look Once) divides the image into an $S \times S$ grid. Each cell predicts $B$ bounding boxes and $C$ class probabilities simultaneously in a single forward pass:

$$\text{Output} \in \mathbb{R}^{S \times S \times (B \cdot 5 + C)}$$ $$\text{Each box: } (t_x, t_y, t_w, t_h, p_{\text{obj}})$$ $$b_x = \sigma(t_x) + c_x, \quad b_y = \sigma(t_y) + c_y$$ $$b_w = p_w e^{t_w}, \quad b_h = p_h e^{t_h}$$

YOLO Loss Function

$$\mathcal{L} = \lambda_{\text{coord}} \sum_{i,j} \mathbf{1}_{ij}^{\text{obj}} \left[(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2\right]$$ $$+ \lambda_{\text{coord}} \sum_{i,j} \mathbf{1}_{ij}^{\text{obj}} \left[(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2\right]$$ $$+ \sum_{i,j} \mathbf{1}_{ij}^{\text{obj}}(C_i - \hat{C}_i)^2 + \lambda_{\text{noobj}} \sum_{i,j} \mathbf{1}_{ij}^{\text{noobj}}(C_i-\hat{C}_i)^2$$ $$+ \sum_{i} \mathbf{1}_{i}^{\text{obj}} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2$$

Non-Maximum Suppression (NMS)

Detection models produce many overlapping predictions. NMS removes redundant boxes:

$$\text{1. Sort boxes by } p_{\text{obj}} \text{ descending}$$ $$\text{2. Select highest-confidence box } \mathbf{b}^*$$ $$\text{3. Remove all } \mathbf{b}_i \text{ with } \text{IoU}(\mathbf{b}^*, \mathbf{b}_i) > \tau_{\text{NMS}}$$ $$\text{4. Repeat until no boxes remain}$$

Mean Average Precision (mAP)

$$AP = \int_0^1 p(r)\, dr \approx \sum_{k=1}^{N} p(k) \Delta r(k)$$ $$\text{mAP} = \frac{1}{|C|}\sum_{c \in C} AP_c$$ $$\text{COCO mAP} = \frac{1}{10}\sum_{\tau \in \{0.5, 0.55, \ldots, 0.95\}} \text{mAP}(\tau)$$
Interactive — Bounding Box IoU Calculator
Drag the boxes on the canvas to see IoU update in real time.
Drag boxes to adjust overlap
Live Metrics
IoU: 0.00
Intersection: 0 px²
Union: 0 px²
Verdict: Miss
Ground Truth   Prediction

Semantic Segmentation

Assigns a class label to every pixel. FCN (Fully Convolutional Networks) and U-Net use encoder-decoder architectures with skip connections:

$$\mathcal{L}_{\text{seg}} = -\frac{1}{HW} \sum_{i=1}^{H}\sum_{j=1}^{W} \sum_{c=1}^{C} y_{ijc} \log \hat{p}_{ijc}$$ $$\text{Dice Loss} = 1 - \frac{2 \sum_i p_i g_i}{\sum_i p_i^2 + \sum_i g_i^2}$$

Optical Flow — Lucas-Kanade

Optical flow estimates per-pixel motion between frames, assuming brightness constancy:

$$\frac{dI}{dt} = \frac{\partial I}{\partial x}u + \frac{\partial I}{\partial y}v + \frac{\partial I}{\partial t} = 0$$ $$\begin{pmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{pmatrix} \begin{pmatrix}u\\v\end{pmatrix} = -\begin{pmatrix}I_x I_t \\ I_y I_t\end{pmatrix}$$ $$\Leftrightarrow M \mathbf{v} = \mathbf{b} \quad \Rightarrow \quad \mathbf{v} = M^{-1}\mathbf{b}$$

Note: $M$ is exactly the Harris structure tensor! This reveals a deep connection between corner detection and motion estimation.

07 — Vision Transformers

Attention &
ViT Architecture

The Vision Transformer (ViT) treats an image as a sequence of patches and applies the transformer architecture. An image $I \in \mathbb{R}^{H \times W \times C}$ is split into $N = HW/P^2$ patches of size $P \times P$:

$$\mathbf{z}_0 = [\mathbf{x}_{\text{cls}};\; \mathbf{x}_p^1 \mathbf{E};\; \ldots;\; \mathbf{x}_p^N \mathbf{E}] + \mathbf{E}_{\text{pos}}$$ $$\mathbf{E} \in \mathbb{R}^{(P^2 C) \times D}, \quad \mathbf{E}_{\text{pos}} \in \mathbb{R}^{(N+1)\times D}$$

Multi-Head Self-Attention

$$\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$ $$Q = \mathbf{z} W_Q, \quad K = \mathbf{z} W_K, \quad V = \mathbf{z} W_V$$ $$\text{MHA}(\mathbf{z}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O$$ $$\text{head}_i = \text{Attention}(\mathbf{z} W_Q^i, \mathbf{z} W_K^i, \mathbf{z} W_V^i)$$

The $\sqrt{d_k}$ scaling prevents dot products from growing large and pushing softmax into saturation regions with near-zero gradients.

🔀

Global Receptive Field

Every patch attends to every other patch from the very first layer — unlike CNNs which build receptive field gradually through depth.

📍

Position Encoding

ViT uses learned position embeddings. Sinusoidal: $PE(pos,2i) = \sin(pos/10000^{2i/d})$, $PE(pos,2i+1) = \cos(\cdot)$.

Complexity

Self-attention is $O(N^2 d)$ — quadratic in sequence length. For high-res images: window attention (Swin), linear attention, or token reduction.

🏆

DINO / DINOv2

Self-supervised ViTs learn powerful visual features. Attention maps reveal semantic segments without any segmentation labels.

08 — 3D Reconstruction

Structure from Motion
& Neural Radiance Fields

Structure from Motion (SfM)

Recovers 3D structure and camera poses from a set of 2D images. The pipeline: feature extraction → matching → pose estimation via bundle adjustment:

$$\min_{\{R_i, \mathbf{t}_i\}, \{\mathbf{X}_j\}} \sum_{i} \sum_{j \in \mathcal{V}(i)} \rho\!\left(\left\|\mathbf{x}_{ij} - \pi(R_i \mathbf{X}_j + \mathbf{t}_i, K)\right\|^2\right)$$

where $\pi(\cdot)$ is the perspective projection function, $\rho$ is a robust loss (e.g., Huber), and the optimization is over all camera poses and 3D point positions simultaneously.

Neural Radiance Fields (NeRF)

NeRF (Mildenhall et al., 2020) represents a scene as a continuous 5D function $(\mathbf{x}, \mathbf{d}) \mapsto (\mathbf{c}, \sigma)$ — mapping a 3D position $\mathbf{x}$ and viewing direction $\mathbf{d}$ to color $\mathbf{c}$ and volume density $\sigma$:

$$F_\Theta : (\mathbf{x}, \mathbf{d}) \to (\mathbf{c}, \sigma), \quad F_\Theta \text{ is an MLP}$$ $$C(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\,\sigma(\mathbf{r}(t))\,\mathbf{c}(\mathbf{r}(t),\mathbf{d})\,dt$$ $$T(t) = \exp\!\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s))\,ds\right)$$

$T(t)$ is the transmittance — probability that the ray travels from $t_n$ to $t$ without hitting anything. Discrete approximation:

$$\hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i (1 - e^{-\sigma_i \delta_i}) \mathbf{c}_i, \quad T_i = \exp\!\left(-\sum_{j

NeRF is trained purely with a photometric loss $\mathcal{L} = \sum_\mathbf{r} \|\hat{C}(\mathbf{r}) - C(\mathbf{r})\|_2^2$ — no 3D supervision needed, only posed 2D images.

Positional encoding is key to NeRF's success: the MLP input is lifted to high-frequency space via $\gamma(\mathbf{x}) = (\sin(2^0\pi\mathbf{x}), \cos(2^0\pi\mathbf{x}), \ldots, \sin(2^{L-1}\pi\mathbf{x}), \cos(2^{L-1}\pi\mathbf{x}))$, enabling the network to represent high-frequency geometry and appearance.

∞ — The Full Picture

Computer Vision
Taxonomy

📷
Image
Capture
🔍
Preprocessing
& Filtering
📌
Feature
Extraction
🧠
Deep
Learning
🎯
Perception
Output
🏷️

Classification

Assign single class label to image. Output: $p(c|I)$.

📦

Detection

Locate + classify multiple objects. Output: $\{(bbox_i, c_i)\}$.

🎭

Segmentation

Per-pixel classification. Semantic, instance, or panoptic.

🗺️

Depth Estimation

Monocular or stereo depth prediction. Output: $Z(u,v)$.

🏃

Pose Estimation

Estimate body/object pose. 2D keypoints or 6-DoF pose.

🎬

Video Analysis

Tracking, action recognition, optical flow, temporal modeling.

🌐

3D Reconstruction

SfM, MVS, NeRF, 3DGS — scene representation from images.

🖼️

Generation

Image synthesis: GAN, diffusion models (DDPM, DDIM, latent diffusion).