Every image is resized to 28×28 pixels. Each cell holds a brightness value: 0 = white background, 1 = dark ink. The network receives all 784 values at once. Hover any cell to see its value.
Instead of looking at all 784 pixels at once, the network uses a tiny 3×3 window — a filter — that slides across every position in the image. At each position it multiplies its 9 weights by the 9 pixels underneath, sums them up, and records the result. A high result means "I found my pattern here." This model has 32 different filters, each trained to notice something different.
Why only one convolutional layer?
This model uses a single convolutional layer — by design. One layer is enough to detect the basic building blocks of letters: straight strokes, curves, and edges. With 32 filters operating on a 28×28 image, it achieves 91.74% validation accuracy. That's strong performance for a lean architecture.
A network with two or three convolutional layers would build on these basics. The second layer's filters would take the first layer's outputs as input, allowing it to detect more complex patterns — corners, crossings, letter fragments. You might see a "plus sign" detector emerge there, built by combining the vertical and horizontal detectors from layer one. A third layer could combine those into even more abstract shapes.
This model doesn't have a crossing detector — but it doesn't need one as a single filter. When both the vertical filter and horizontal filter fire near the same location, the dense layers downstream have learned to interpret that combination as "there's a crossing here." The combinatorial reasoning gets distributed across the network rather than concentrated in a single filter.
The principle: more layers means more expressive power, but also more training data required, longer training time, and greater risk of overfitting. For 26 letter classes on a 28×28 image, one conv layer keeps the model efficient and effective without asking it to do more work than the task requires.
After sliding each filter across the full image, we get a 26×26 map showing where that pattern appeared. Warm orange = strong activation (pattern found), cool blue = negative activation, light = near zero. Five of the 32 filters are shown below.
Each 26×26 feature map is compressed to 13×13 by taking the maximum value from every 2×2 block. This keeps the strongest signals, discards the weakest, and makes the network less sensitive to exactly where a stroke appears. Two maps shown for comparison.
After pooling, all 32 maps are flattened into a single list of 5,408 numbers, then normalized by a BatchNormalization layer before passing to the dense layers.
Three dense layers convert the 5,408 pooled values into a confidence score for each letter. The softmax function ensures all 26 scores sum to 100%. Hover a letter to see its score.
Optimizer: Adam (lr=0.0001, clipnorm=1.0). Batch size: 32. Epochs: 10. Training accuracy: 97.93%. Validation accuracy: 91.74%.
Dataset: EMNIST Letters (Cohen et al., 2017) — 88,799 samples, 80/20 train/val split. Classes: A–Z (indices 1–26).
Part of AI Literacy curriculum. Inside the Machine — DSC234, University of New England.