Back to Explore

Computer Vision — Comprehensive Study Notes Summary & Study Notes

These study notes provide a concise summary of Computer Vision — Comprehensive Study Notes, covering key concepts, definitions, and examples to help you review quickly and study effectively.

1.1k words3 views
Notes

🖼️ Overview

Computer vision studies how to make computers see and understand images and videos. It combines image formation, signal processing, geometry, and machine learning to solve tasks like detection, recognition, tracking, and 3D reconstruction. These notes summarize core ideas and practical points you need to master.

📸 Image Formation & Camera Models

A simple model is the pinhole camera: a 3D point XX maps to an image point xx by projection through the camera center. In homogeneous coordinates: x=K[Rt]Xx = K [R\mid t] X, where K is the camera intrinsic matrix, and R, t are extrinsics. Real cameras add lens distortion (radial/tangential) that must be modeled or corrected.

🎨 Color & Image Representation

Images are arrays of pixels in RGB, but alternative spaces like HSV or YCbCr separate intensity from color and help tasks such as segmentation or compression. Grayscale is often used for algorithmic simplicity.

🧮 Sampling, Quantization & Noise

Images are discrete samples of a continuous scene. Aliasing occurs if sampling is too coarse. Quantization limits intensity precision. Real sensors introduce noise (shot, thermal) — denoise with smoothing or learned models.

🔍 Convolution & Linear Filters

Convolution with a kernel is the basic linear operation: smoothing, sharpening, and derivative filters are all convolutions. A small kernel like Gaussian smooths; derivative kernels (Sobel) approximate gradients. Convolution is linear and shift-invariant; separable filters (e.g., Gaussian) are faster.

⚡ Frequency Domain & Fourier

Fourier transform decomposes an image into frequencies. Low frequencies encode smooth content; high frequencies encode edges/detail. Filtering can be done in frequency domain to remove noise or enhance detail.

✂️ Image Gradients & Edge Detection

Image gradients measure change in intensity. Use gradient magnitude and direction for edges. Canny edge detector is a reliable pipeline: smoothing → gradient → non-maximum suppression → hysteresis thresholding. Non-maximum suppression keeps thin edges.

🔎 Corners & Interest Points

Corners are points with intensity change in two directions. Algorithms: Harris (response from gradients), Shi–Tomasi, and FAST (fast binary test). Good interest points are repeatable under viewpoint/illumination changes.

🧩 Feature Descriptors

Descriptors convert local patches into vectors for matching. Classic descriptors: SIFT (scale-invariant, robust), SURF, ORB (binary, fast), BRIEF. SIFT includes scale-space detection and a histogram-of-gradients descriptor.

🔗 Matching & Verification

Match descriptors using distances (Euclidean for floats, Hamming for binary). Use ratio test (e.g., Lowe's) to reject ambiguous matches. Always verify matches geometrically (see RANSAC).

🔁 Robust Estimation: RANSAC

RANSAC fits a model (e.g., homography, fundamental matrix) by iteratively sampling minimal sets, fitting, and counting inliers. It tolerates many outliers. Tune iterations and inlier threshold based on noise and expected outlier fraction.

🧭 Homography & Planar Transformations

A homography maps points between views of a plane or between images when the scene is planar: xHxx' \sim H x. Estimate with 4 point correspondences (in homogeneous coords) and refine with all inliers.

🎯 Epipolar Geometry & Stereo

Two-camera geometry relates corresponding points via the fundamental matrix FF: xFx=0x'^\top F x = 0. If cameras are calibrated, use the essential matrix EE (depends on R,t): xEx=0x'^\top E x = 0. Epipolar constraints reduce matching from 2D→1D along epipolar lines.

🔍 Stereo Depth & Triangulation

Given correspondences and known camera poses, triangulate 3D points by intersecting rays. Depth accuracy depends on baseline and measurement noise. Rectified stereo simplifies correspondence to horizontal search.

🧭 Camera Calibration & PnP

Calibration estimates K and distortion using images of a known pattern (chessboard). PnP (Perspective-n-Point) recovers camera pose from 3D-to-2D correspondences; common solvers include EPnP and iterative refinements.

📈 Structure from Motion (SfM)

SfM reconstructs scene structure and camera motion from multiple images. Steps: feature detection → matching → robust pairwise geometry → incremental bundle adjustment. Bundle adjustment jointly optimizes camera parameters and 3D points to minimize reprojection error.

🧭 Optical Flow & Motion Estimation

Optical flow estimates per-pixel motion between frames. Classical methods: Lucas–Kanade (local, assumes small motion, solves linearized equations) and Horn–Schunck (global smoothness). Modern methods use deep networks (e.g., RAFT).

🎯 Object Detection & Localization

Detect objects and output bounding boxes. Classical pipeline: sliding windows + features + classifier (HOG + SVM). Modern detectors: Two-stage (R-CNN family: region proposals then classification) and one-stage (YOLO, SSD) for speed. Evaluate with IoU and mean Average Precision (mAP).

🧠 Image Classification & CNNs

Convolutional neural networks (CNNs) learn hierarchical features. Key layers: convolution, pooling, batch normalization, fully connected. Famous architectures: AlexNet, VGG, ResNet (residual connections). Training involves data augmentation, weight decay, and learning rate schedules to avoid overfitting.

🧩 Segmentation

Segmentation labels every pixel. Semantic segmentation assigns class labels; instance segmentation separates object instances. Architectures: FCN, U-Net, DeepLab. Losses often are cross-entropy or dice loss for class imbalance.

🧠 Deep Features & Transfer Learning

Pretrained CNNs provide powerful features. Fine-tune for new tasks with limited data. Use feature extraction for SVM classifiers or end-to-end fine-tuning depending on data size.

🔎 Tracking

Tracking follows objects over time. Approaches: tracking-by-detection (detect each frame + associate), correlation-filter trackers (e.g., MOSSE), and learned trackers (Siamese networks). For state estimation, use Kalman filter (linear Gaussian) or particle filters (nonlinear/non-Gaussian).

🏷️ Evaluation & Datasets

Common datasets: ImageNet (classification), COCO (detection/segmentation), KITTI (autonomous driving), Middlebury (stereo). Metrics: classification accuracy, precision/recall, IoU, mAP for detection, end-point error for flow.

🧰 Practical Tips & Tricks

  • Normalize images and input statistics consistently.
  • Use data augmentation (flip, crop, color jitter) to reduce overfitting.
  • Start with pretrained models and small learning rates.
  • Visualize filters, activations, and matches to debug.
  • Carefully handle coordinate conventions (pixel centers, homogeneous forms).

⚠️ Common Pitfalls

  • Forgetting camera distortion leads to bad geometry.
  • Mixing coordinate frames (image vs. normalized) causes errors in projection.
  • Overfitting small datasets without augmentation or regularization.
  • Relying on naive matching without geometric verification produces many false matches.

📐 Key Formulas (Quick Reference)

  • Pinhole projection: x=K[Rt]Xx = K [R\mid t] X (homogeneous).
  • Epipolar constraint: xFx=0x'^\top F x = 0.
  • Homography: xHxx' \sim H x.
  • Reprojection error (for bundle adjustment): minimize xπ(P,X)2\sum |x - \pi(P,X)|^2 where π\pi projects 3D to 2D.

✅ Final Notes

Understand both classical geometric methods and modern learning-based methods: geometry gives provable constraints and interpretability; learning gives robustness and end-to-end performance. Practice by implementing pipelines end-to-end: detect → describe → match → estimate geometry → refine. That experience ties theory to real-world behavior and common failure modes.

Sign up to read the full notes

It's free — no credit card required

Already have an account?

Create your own study notes

Turn your PDFs, lectures, and materials into summarized notes with AI. Study smarter, not harder.

Get Started Free
Computer Vision — Comprehensive Study Notes Study Notes | Cramberry