New course launch: "Computer vision from scratch"

Machine Learning for practical computer vision

and

Feb 27, 2025

About 10 years ago, I worked on an interesting problem statement during my undergraduate at IIT Madras. It was to build a traditional (non-ML) algorithm for detecting rice grain varieties from various types of images. This was my first deep dive into machine vision research.

I used no ML, just pure math and logic applied on the results obtained from traditional image filters for thresholding and edge detection. Our paper got accepted to the highly competitive International Conference for Machine Vision (ICMV) held at Nice, France. [Paper link]

While presenting at ICMV, one thing was very clear to us: the field was swiftly evolving. More and more researchers were abandoning purely rule-based vision techniques and gravitating toward Machine Learning, particularly deep neural networks. This shift signaled a major turning point in computer vision research.

In the decade since, deep learning has radically reshaped computer vision. Techniques that were once considered advanced (like hand-engineered features and classical algorithms) have given way to neural networks capable of learning powerful, generalized representations of images. This move has fueled breakthroughs in areas ranging from self-driving cars and facial recognition to medical imaging and robotics. While working on cellular-image processing at MIT, I had no doubt that I should use Deep Learning models.

After experiencing the initial wave of deep learning revolution, and watching it transform industries over the past ten years, I realize there is a pressing need for a structured curriculum that covers computer vision including its history, basic foundations, and modern techniques for practical CV applications.

Thus I decided to create a full-blown course titled “Computer Vision from scratch” which I will be releasing for free on Vizuara’s YouTube channel. I will start from the very basics of image processing, covering the foundations of computer vision in the first part of the course. In the second part of the course I will be covering practical aspects of computer vision, for folks who wish to transition their career to computer vision.

The Future of Tesla's Autopilot Feature – VORSPRUNG®

Elon Musk has long been an advocate for camera-based vision systems in Tesla vehicles, arguing that they should operate much like humans do. Humans rely on their eyes to perceive depth and motion without external devices, so Musk believes that cars equipped with advanced neural networks and cameras can match or surpass human-level perception. This approach deliberately excludes LIDAR, which Musk has often described as unnecessary for achieving robust autonomy. By teaching the car to interpret visual cues the way we do, Tesla aims to replicate the natural behaviors of human drivers in an autonomous setting.

Elon Musk In 2025: What To Know About The World's Richest Person | Bankrate

Whether the future Tesla self-driving cars use LIDAR extensively or not, there could not be a better time to take a deep dive into computer vision. Computer vision is transforming industries like healthcare, manufacturing, and retail, enabling breakthroughs in medical diagnostics, quality control, and immersive user experiences. By automating the extraction of insights from visual data, it is revolutionizing everything from autonomous vehicles and robotics to surveillance and consumer applications - making now the ideal moment to harness its potential and shape the future.

So let us start.

Course structure

PART 1: Module 1-6 Computer Vision foundations

Module 1: Machine Learning for computer vision

Lecture 1: Rule-based v/s ML-based approaches

Introduction to computer vision
Evolution from rule-based to ML approaches
Inception of AlexNet
Differences between ML and traditional programming
Deep learning use cases in computer vision

Module 2: Let us build some basic models

Lecture 2: Building a simple linear model (no activation function)

Defining the dataset
Reading images, scaling, and resizing
Using TensorFlow
Building a simple linear model for classification

Lecture 3: Building a simple Neural network (no convolution)

Difference between linear and non-linear models
Hidden layers and activation functions
Gradient descent, back-propagation and optimizers
Hyper-parameter tuning
Defining and training the neural network
Testing the neural network on custom dataset

Lecture 4: Overfitting

What is overfitting?
L1 and L2 penalties, dropout, early stopping
Using a validation dataset for monitoring
Balancing model complexity with dataset size

Module 3: Convolutional Neural Networks

Lecture 5: What is a Convolutional Neural Network?

Convolutional filters and local receptive fields
Parameter sharing for efficient image feature extraction
Historical development of CNN theory
Kernel size, stride, and padding
Max pooling versus average pooling

Lecture 6: Historical CNN architectures

AlexNet breakthrough
Key ideas behind VGG and Inception
Role of competition datasets like ImageNet

Lecture 7: Deeper networks – ResNet and DenseNet

Skip connections and residual blocks
Trade-offs in network depth versus width
Dense connectivity patterns
Efficiency and accuracy trade-offs

Lecture 8: Transfer learning and fine-tuning

Loading pre-trained models for feature extraction
Full fine-tuning versus partial freezing
Learning rate scheduling and differential rates

Module 4: Vision transformer

Lecture 9: Vision transformer theory

Why transformers can replace convolutions in some cases
Attention mechanisms in computer vision
Strengths and weaknesses of transformer-based architectures

Lecture 10: Vision transformer application

Practical steps for training a vision transformer model
Adapting existing transformer libraries and checkpoints
Comparing performance and efficiency to CNN approaches

Module 5: Object detection

Lecture 11: Intro to object detection

Bounding boxes and intersection over union
Classical detection vs deep learning methods
Overview of relevant datasets

Lecture 12: YOLO architecture and training

Real-time detection principles
Anchor boxes and label formats
Scaling yolo for different domains

Lecture 13: RetinaNet and focal loss

Addressing class imbalance in detection
Single-stage detection improvements
Understanding the focal loss function

Module 6: Image segmentation

Lecture 14: Fundamentals of image segmentation

Semantic segmentation vs instance segmentation
Evaluation metrics like MIoI and Dice coefficient
Classical segmentation vs neural approaches

Lecture 15: U-Net and Mask R-CNN

Encoder-decoder design patterns in U-Net
Instance segmentation with Mask R-CNN
Practical applications and labeling challenges

PART 2: Module 7-12 Computer Vision practicals

Module 7: Creating vision datasets

Lecture 16: Dataset collection and labeling

Collecting images from various sources
Manual labeling for classification and detection
Multilabel tasks and bounding box considerations
Crowdsourcing and large-scale labeling services

Lecture 17: Automated labeling and addressing bias

Labels from related data and self-supervised learning
Noisy student approach
Recognizing selection bias and measurement bias
Splitting datasets and minimizing data leakage

Module 8: Data preprocessing

Lecture 18: Data quality and transformations

Image resizing, cropping, and color space conversions
Ensuring consistent aspect ratios
Common pitfalls in data pipelines

Lecture 19: Data augmentation and training-serving consistency

Random flips, rotations, and color distortion
Information dropping (cutout, mixup)
Avoiding training-serving skew
Integrating preprocessing in the model vs external scripts

Module 9: Training pipeline

Lecture 20: Efficient data ingestion

Storing data in tfrecord format
Parallel reads, caching, and sharding
Maximizing gpu utilization

Lecture 21: Distributing training

Data parallelism with gpus
Mirrored and multiworker strategies
Introduction to tpus and their advantages

Lecture 22: Checkpoints and automated workflows

Checkpointing best practices for resilience
Model export using savedmodel and deployment formats
Hyperparameter tuning with serverless pipelines

Module 10: Model quality and continuous evaluation

Lecture 23: Monitoring training and debugging

Using tensorboard for metrics and visualizations
Detecting anomalies in gradients or loss curves
Interpreting weight histograms

Lecture 24: Metrics for classification, detection, and segmentation

Accuracy, precision, recall, and f1 score
ROC, AUC, and confusion matrices
Intersection over union and mean IOU

Lecture 25: Ongoing evaluation, bias, and fairness

Sliced evaluations for subpopulations
Measuring bias in model outcomes
Setting up continuous evaluation in production

Module 11: Model predictions and deployment

Lecture 26: Prediction workflows

Batch prediction using apache beam
Real-time serving with tf serving and rest apis
Handling pre- and post-processing at inference

Lecture 27: Edge deployment

Model compression and quantization strategies
Tensorflow lite for mobile and embedded devices
Overview of federated learning and privacy considerations

Lecture 28: Trends in production ML

Pipelines and orchestration with kubeflow
Explainability methods (grad-cam, saliency maps)
Comparing no-code solutions to custom approaches

Module 12: Advanced vision problems and generative models

Lecture 29: Advanced object measurement and pose estimation

Ratio-based measurement using reference objects
Counting objects via density estimation
Keypoint detection and multi-person setups

Lecture 30: Image retrieval and search

Building image embeddings for similarity search
Large-scale indexing and retrieval methods
Practical considerations for dimensionality reduction

Lecture 31: Autoencoders and Generative Adversarial Networks

Autoencoder architectures for reconstruction and anomaly detection
Introduction to GANs for image-to-image translation
Super-resolution and inpainting applications

Lecture 32: Image captioning and multimodal learning

Image-to-text pipeline fundamentals
Combining CNN and transformer-based language models
Future directions in multimodal AI

Lecture 33: Course summary

Recap of entire course modules
Major concepts and practical takeaways from each module
Common pitfalls and strategies for overcoming them
Recommendations for further learning and advanced reading
How to continue building real-world computer vision applications

Where will the lectures be released?

The lectures will be released on the “Computer Vision from Scratch” playlist on Vizuara’s YouTube channel: Link

Here is the link to the introductory lecture: