A comprehensive, interactive exploration of the mathematics, algorithms, and deep learning architectures that enable computers to understand visual information.
A digital image is fundamentally a 2D function $f(x, y)$ where $(x,y)$ are spatial coordinates and the value $f$ represents intensity (for grayscale) or a vector of intensities (for color). This function is sampled at discrete pixel locations and quantized to a finite number of intensity levels.
Raw sensor data is in the Bayer mosaic format — a grid of Red, Green, Blue photosites in a 2:1:1 ratio (RGGB pattern). Demosaicing algorithms reconstruct full-color images. Different color spaces have different mathematical relationships:
The continuous scene is sampled spatially (pixels) and in intensity (quantization). By the Nyquist–Shannon theorem, we must sample at least twice the highest spatial frequency to avoid aliasing:
Number of pixels per unit length. Higher resolution recovers finer spatial frequencies. Modern sensors: 12–50 MP.
$b$ bits → $2^b$ intensity levels. 8-bit: 256 levels. 16-bit: 65,536 levels (medical/scientific imaging).
Ratio of max to min luminance $= L_{\max}/L_{\min}$. Human eye: ~$10^5$:1. Typical camera: ~$10^3$:1.
Photon shot noise: $\mathcal{N}(0, \sigma^2)$. Read noise, thermal noise. Poisson statistics at low light: $\text{SNR} = \sqrt{N}$.
Drag the sliders to see how spatial resolution (sampling rate) and bit depth (quantization levels) affect image quality.
At low resolution, aliasing (staircase artifacts on diagonals/circles) becomes visible. At low bit depth, banding appears in smooth gradients.
The total storage cost of an image is:
$\text{Size} = W \times H \times c \times b$ bits
Spatial filtering is the foundation of classical computer vision. The core operation is discrete convolution — sliding a small kernel $K$ over the image $I$ and computing weighted sums at each location.
In practice (and in deep learning), what's called "convolution" is actually cross-correlation, where the kernel is not flipped:
The isotropic 2D Gaussian is the only filter that is simultaneously separable, isotropic, and achieves the theoretical minimum in the uncertainty principle (optimal joint localization in space and frequency):
Its separability means the 2D convolution decomposes into two 1D passes — reducing complexity from $O(k^2)$ to $O(2k)$ per pixel:
Edges correspond to regions of rapid intensity change — high image gradient magnitude. The Sobel operator approximates the image gradient using finite differences:
The Canny detector (1986) is optimal under three criteria: good detection, good localization, and single response. Its pipeline:
Non-maximum suppression thins edges to single-pixel width by suppressing gradient values that are not local maxima along the gradient direction $\theta$. Hysteresis uses two thresholds $T_{\text{high}} > T_{\text{low}}$: strong edges $>T_{\text{high}}$ are definite; weak edges in $[T_{\text{low}}, T_{\text{high}}]$ are kept only if connected to a strong edge.
Draw on the left canvas, then apply different kernels. The kernel matrix shows the weights used at each pixel position.
Each output pixel = weighted sum of neighborhood
Feature detection identifies stable, distinctive points in an image that can be reliably re-detected across viewpoint and illumination changes — fundamental for image matching, 3D reconstruction, and tracking.
Harris & Stephens (1988) formalized the intuition that corners have large intensity variation in all directions. The structure tensor (second moment matrix) $M$ captures this:
where $I_x = \partial I/\partial x$, $I_y = \partial I/\partial y$ are image gradients and $w(x,y)$ is a Gaussian weighting window. The eigenvalues $\lambda_1, \lambda_2$ of $M$ classify the region:
$\lambda_1 \approx \lambda_2 \approx 0$
No significant gradient in any direction.
$\lambda_1 \gg \lambda_2 \approx 0$
Large gradient in one direction only.
$\lambda_1 \approx \lambda_2 \gg 0$
Large gradient in all directions.
Rather than computing eigenvalues directly, the Harris response function uses the trace and determinant:
Typical value: $k \in [0.04, 0.06]$. Harris corners are not scale-invariant — this led to the development of SIFT.
Lowe (2004) addressed scale invariance by detecting keypoints in scale-space — a family of progressively blurred images parameterized by $\sigma$:
Extrema of the Difference-of-Gaussians (DoG) across scale and space approximate the Laplacian of Gaussian — a blob detector that finds characteristic scales. Each keypoint's descriptor is a 128-dimensional histogram of gradient orientations, normalized for robustness:
Given two sets of SIFT descriptors, matches are found via nearest-neighbor with the Lowe ratio test: accept match if $d_1 / d_2 < 0.8$. Outliers (mismatches) are then removed using RANSAC:
where $p$ = desired probability of success, $\epsilon$ = inlier ratio, $s$ = minimum sample size (e.g. 4 for homography)
The fundamental model of image formation: a 3D world point $\mathbf{X} = (X,Y,Z)^T$ projects to image point $\mathbf{x} = (x,y)^T$ through the camera center (pinhole). In homogeneous coordinates:
The intrinsic matrix $K$ encodes: focal lengths $(f_x, f_y)$ in pixels, principal point $(c_x, c_y)$, and skew $s \approx 0$ for modern cameras. The extrinsic $[R|\mathbf{t}]$ defines the camera's pose in the world.
Real lenses deviate from the pinhole model. Radial distortion is most significant:
$k_1 > 0$: barrel distortion. $k_1 < 0$: pincushion distortion.
When the scene is planar (or the camera rotates in place), points relate by a homography $H$, a $3\times 3$ projective transformation:
A stereo camera rig with known baseline $b$ can recover depth from the disparity $d$ between corresponding points:
The epipolar constraint encodes the geometry between two uncalibrated views. For a world point $\mathbf{X}$, its projections $\mathbf{x}, \mathbf{x}'$ in two cameras satisfy:
Adjust the camera's focal length and pose to see how a 3D cube projects onto the image plane.
Notice: longer focal length → more telephoto/compressed. Moving camera ≠ rotating object in 3D.
The Convolutional Neural Network (CNN) revolutionized computer vision by learning hierarchical feature representations directly from data. Unlike hand-crafted features, CNNs learn what to look for.
A conv layer applies $K$ learned filters to the input feature map. With input $\mathbf{X} \in \mathbb{R}^{H \times W \times C_{in}}$ and filter $\mathbf{W}^{(k)} \in \mathbb{R}^{F \times F \times C_{in}}$:
where $P$ = padding, $S$ = stride. Parameter count: $K \cdot (F^2 \cdot C_{in} + 1)$ — much less than a fully connected layer.
| Layer Type | Operation | Purpose | Key Param |
|---|---|---|---|
Conv2D | $\mathbf{X} * \mathbf{W} + b$ | Feature extraction | Filter size, stride |
MaxPool | $\max_{(m,n)\in R} \mathbf{X}[i+m,j+n]$ | Spatial downsampling, translation invariance | Pool size $k$ |
AvgPool | $\frac{1}{k^2}\sum_{(m,n)} \mathbf{X}[i+m,j+n]$ | Smoother downsampling | Pool size $k$ |
BatchNorm | $\hat{x} = \frac{x-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}; \; y = \gamma\hat{x}+\beta$ | Training stability, regularization | $\gamma, \beta$ (learned) |
Dropout | $\mathbf{z} = \mathbf{x} \odot \text{Bernoulli}(1-p)/(1-p)$ | Regularization, prevents co-adaptation | Drop probability $p$ |
Softmax | $p_k = e^{z_k}/\sum_j e^{z_j}$ | Classification output (probability distribution) | Temperature $T$ |
| Architecture | Year | Innovation | Top-5 Error (ImageNet) |
|---|---|---|---|
| AlexNet | 2012 | Deep CNN + ReLU + Dropout + GPU training | 15.3% |
| VGGNet | 2014 | Very deep (16-19 layers) with 3×3 convolutions only | 7.3% |
| GoogLeNet/Inception | 2014 | Inception modules: parallel multi-scale convolutions | 6.7% |
| ResNet | 2015 | Residual connections: $\mathbf{y} = \mathcal{F}(\mathbf{x},\{W_i\}) + \mathbf{x}$ | 3.57% |
| DenseNet | 2016 | Dense connections: each layer connected to all subsequent | ~3% |
| EfficientNet | 2019 | Compound scaling of depth/width/resolution | 1.8% |
| ViT | 2020 | Pure transformer on image patches | <1.5% |
ResNet's skip connection solves the vanishing gradient problem: during backpropagation, gradients flow directly through identity connections, enabling training of networks with 1000+ layers.
Training minimizes the cross-entropy loss via stochastic gradient descent with backpropagation:
Backprop through a conv layer computes three gradients via the chain rule:
where $\delta^{(k)} = \frac{\partial \mathcal{L}}{\partial \mathbf{Z}^{(k)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{Y}^{(k)}} \odot \mathbf{1}[\mathbf{Z}^{(k)} > 0]$ (ReLU gradient).
Modern optimizers like Adam adapt per-parameter learning rates:
Normalization layers are critical for training stability. Different methods normalize across different axes of the $(N, C, H, W)$ tensor:
BatchNorm is best for large batches; LayerNorm is preferred in transformers; GroupNorm is used when batch sizes are small (detection, segmentation).
Object detection simultaneously classifies and localizes multiple objects: output is a set of bounding boxes $(x, y, w, h)$ with class probabilities.
The standard metric for bounding box quality:
YOLO (You Only Look Once) divides the image into an $S \times S$ grid. Each cell predicts $B$ bounding boxes and $C$ class probabilities simultaneously in a single forward pass:
Detection models produce many overlapping predictions. NMS removes redundant boxes:
Assigns a class label to every pixel. FCN (Fully Convolutional Networks) and U-Net use encoder-decoder architectures with skip connections:
Optical flow estimates per-pixel motion between frames, assuming brightness constancy:
Note: $M$ is exactly the Harris structure tensor! This reveals a deep connection between corner detection and motion estimation.
The Vision Transformer (ViT) treats an image as a sequence of patches and applies the transformer architecture. An image $I \in \mathbb{R}^{H \times W \times C}$ is split into $N = HW/P^2$ patches of size $P \times P$:
The $\sqrt{d_k}$ scaling prevents dot products from growing large and pushing softmax into saturation regions with near-zero gradients.
Every patch attends to every other patch from the very first layer — unlike CNNs which build receptive field gradually through depth.
ViT uses learned position embeddings. Sinusoidal: $PE(pos,2i) = \sin(pos/10000^{2i/d})$, $PE(pos,2i+1) = \cos(\cdot)$.
Self-attention is $O(N^2 d)$ — quadratic in sequence length. For high-res images: window attention (Swin), linear attention, or token reduction.
Self-supervised ViTs learn powerful visual features. Attention maps reveal semantic segments without any segmentation labels.
Recovers 3D structure and camera poses from a set of 2D images. The pipeline: feature extraction → matching → pose estimation via bundle adjustment:
where $\pi(\cdot)$ is the perspective projection function, $\rho$ is a robust loss (e.g., Huber), and the optimization is over all camera poses and 3D point positions simultaneously.
NeRF (Mildenhall et al., 2020) represents a scene as a continuous 5D function $(\mathbf{x}, \mathbf{d}) \mapsto (\mathbf{c}, \sigma)$ — mapping a 3D position $\mathbf{x}$ and viewing direction $\mathbf{d}$ to color $\mathbf{c}$ and volume density $\sigma$:
$T(t)$ is the transmittance — probability that the ray travels from $t_n$ to $t$ without hitting anything. Discrete approximation:
NeRF is trained purely with a photometric loss $\mathcal{L} = \sum_\mathbf{r} \|\hat{C}(\mathbf{r}) - C(\mathbf{r})\|_2^2$ — no 3D supervision needed, only posed 2D images.
Positional encoding is key to NeRF's success: the MLP input is lifted to high-frequency space via $\gamma(\mathbf{x}) = (\sin(2^0\pi\mathbf{x}), \cos(2^0\pi\mathbf{x}), \ldots, \sin(2^{L-1}\pi\mathbf{x}), \cos(2^{L-1}\pi\mathbf{x}))$, enabling the network to represent high-frequency geometry and appearance.
Assign single class label to image. Output: $p(c|I)$.
Locate + classify multiple objects. Output: $\{(bbox_i, c_i)\}$.
Per-pixel classification. Semantic, instance, or panoptic.
Monocular or stereo depth prediction. Output: $Z(u,v)$.
Estimate body/object pose. 2D keypoints or 6-DoF pose.
Tracking, action recognition, optical flow, temporal modeling.
SfM, MVS, NeRF, 3DGS — scene representation from images.
Image synthesis: GAN, diffusion models (DDPM, DDIM, latent diffusion).