One Layer CNN Unplugged

Initializing model weights…

Draw a letter

Upload an image

Drop or click to upload

step 3 — convolution

3 The magnifying glass

Instead of looking at all 784 pixels at once, the network uses a tiny 3×3 window — a filter — that slides across every position in the image. At each position it multiplies its 9 weights by the 9 pixels underneath, sums them up, and records the result. A high result means "I found my pattern here." This model has 32 different filters, each trained to notice something different.

Select a filter to inspect

Filter weights

neg pos

Sample computation at position (10,10)

This calculation runs at all 676 positions → producing a 26×26 feature map.

Why only one convolutional layer?

This model uses a single convolutional layer — by design. One layer is enough to detect the basic building blocks of letters: straight strokes, curves, and edges. With 32 filters operating on a 28×28 image, it achieves 91.74% validation accuracy. That's strong performance for a lean architecture.

A network with two or three convolutional layers would build on these basics. The second layer's filters would take the first layer's outputs as input, allowing it to detect more complex patterns — corners, crossings, letter fragments. You might see a "plus sign" detector emerge there, built by combining the vertical and horizontal detectors from layer one. A third layer could combine those into even more abstract shapes.

This model doesn't have a crossing detector — but it doesn't need one as a single filter. When both the vertical filter and horizontal filter fire near the same location, the dense layers downstream have learned to interpret that combination as "there's a crossing here." The combinatorial reasoning gets distributed across the network rather than concentrated in a single filter.

The principle: more layers means more expressive power, but also more training data required, longer training time, and greater risk of overfitting. For 26 letter classes on a 28×28 image, one conv layer keeps the model efficient and effective without asking it to do more work than the task requires.

Architecture: 28×28 → Conv2D(32, 3×3, linear) → MaxPool(2×2) → Flatten → BatchNorm → Dense(512, ReLU) → Dense(128, ReLU) → Dense(37, softmax).
Optimizer: Adam (lr=0.0001, clipnorm=1.0). Batch size: 32. Epochs: 10. Training accuracy: 97.93%. Validation accuracy: 91.74%.
Dataset: EMNIST Letters (Cohen et al., 2017) — 88,799 samples, 80/20 train/val split. Classes: A–Z (indices 1–26).
Part of AI Literacy curriculum. Inside the Machine — DSC234, University of New England.