Original date published: July 10, 2025

The Pioneer

The story of LeNet and the origin of deep learning


The Pioneer

LeNet-5

Before LeNet, computer vision relied on hand-crafted features. This involved researchers and engineers spending several hours, days, or even months manually designing image features based on domain knowledge and intuition about what visual patterns might be important for specific tasks (I suspect you probably had to be very smart for this). So, what are some of these features?

  • Edge detection: using mathematical formulas to manually identify edges in an image by computing the gradient change from the vertical to horizontal direction
  • Texture descriptors: identifying certain textures within an image
  • Corner and interest point detection
  • Template matching and correlation
  • Fourier and frequency domain features

Etc, etc (very mathy).

This was very time-consuming, did not generalize, and was not scalable due to the amount of manual effort required. One day in ~1998, a young Frenchman thought to himself, "I'm going to be unhappy if I spend the rest of my life writing down equations for the computer to detect images. Why am I working for the computer? The computer should work for me!" And then he started thinking: how can I get the computer to learn these features by itself, and then use the learned features to do the intended task? The name of this young Frenchman... Yann LeCun.

The challenge was to learn the visual patterns directly from raw pixel data. His challenge? Document recognition! Now, given how good ChatGPT and the rest are at document recognition, you might be thinking that must be trivial, but back in the late 90s, that was seen as a miracle. At the time, Yann was exploring the use of neural networks, a gradient-based way of learning from high-dimensional patterns. Particularly, he was interested in Convolutional Neural Networks, which are particularly designed for the variability of 2D shapes.

Yann was trying to make it more efficient to read checks (for those as young as me, a check is a paper that tells a bank you want to give someone some money—kind of like a Venmo/PayPal/Stripe transaction, except that it was on paper, and someone physically had to read it, understand it, decide if you are a liar or not, then give you the money. Must've been fun!)—both business and personal checks. Since document understanding was a multi-stage problem, he used convolutional neural networks as one of the modules within a learning paradigm called Graph Transformer Network (lol, the foreshadowing), which had different jointly learned modules for field extraction, segmentation, recognition, and language modeling.

Why didn't anyone do this before? Well, people tried—it was just too computationally expensive and required too much data to try and work with neural networks on high-dimensional data. Lucky for Yann, the computers just got faster and data was there (this is a recurring trend in deep learning as we will see, my fellow journey<man/woman/person>).


So let's talk about Convolutional Neural Networks. The core of Yann's idea was the use of this customized version of the general neural architecture that was particularly good at dealing with 2D data.

Feeding an image directly into a neural network would not be effective, as the dimensions would be large. On a 10 x 10 image, that's 100 features; on a 1024 x 1024 image, that's 1,048,576 features. As the size of the image goes higher, it becomes completely unrealistic to feed a neural network data inputs with such a high dimension, as learning would become difficult and unstable.

Another issue with feeding the entire image directly into the neural network is that there would be no inbuilt variance or robustness for local distortion. What does this mean? Remember, the problem Yann is trying to solve here is handwriting recognition. Different people write very differently—from the way the i's are dotted to the way the c's are curved. There is no exact way that everyone writes, but in most cases, if you see a character you can tell what it is.

If the models were to learn from the pixels directly, they need to be able to handle changes in rotation, fold, or other forms of distortion to the image, because in reality that is what would be experienced when the system is deployed to banks for reading checks.

To solve the invariance problem, Yann believed that we needed several smaller networks at various parts of the images, identifying certain patterns and coming back together to share their results, then agreeing on whether a particular feature was identified or not. Each of those small networks would have the same weights and are basically target hunters, going through the entire image looking for a particular shape, color, texture, or pattern, and then aggregating the responses from all the various parts of the image to conclude if the target was found or not.

Another challenge that Yann had to solve was very particular to images: images have structure. The raw values of the pixels may not mean much, but the structural context provides the relevant information—i.e., what are the values of its neighbors and what is its positional relationship to said neighbors.

Convolutional Networks tackle both problems by having local receptive fields (don't look at the image as a whole, also don't look at just the pixel; rather, look at a pixel and its neighbors). It also enabled shared weights/weight replication (spoiler alert: this was discarded and not used in future iterations of CNNs).

Yann's idea for receptive fields also came from studying cat eyes (I'm not a biologist, I am definitely not elaborating further). The receptive fields would build low-level patterns in earlier layers, such as lines and edges, and in later layers would use those lines and edges to detect shapes and groups, and in future layers detect objects and targets. The idea was that shallow layers learn simpler features which are then built upon by deeper layers.

The various small networks that are positioned as a plane across the image for detecting features are together called a feature map. A single convolutional layer would have several feature maps, with each feature map trying to learn a particular thing about the image (based on the depth of the network, this could be lines, colors, etc).

(This is still a WIP, images, equations and code coming soon :>)

Last updated at: July 10, 2025