One-Sentence Definition
A convolutional neural network (CNN) is a type of deep learning model designed to process grid-structured data -- especially images -- by using learnable filters that detect visual features like edges, textures, and shapes.
How It Works
A CNN processes an image by sliding small filters (also called kernels) across the pixel grid. Each filter is a small matrix of weights, typically 3x3 or 5x5 pixels, that performs a mathematical operation called convolution. In the earliest layers, these filters learn to detect simple features: horizontal edges, vertical edges, color gradients. In deeper layers, the filters combine simple features into complex ones: eyes, wheels, handwritten digits.
A typical CNN architecture stacks three types of layers. Convolutional layers apply filters and produce feature maps. Pooling layers (usually max pooling) downsample the feature maps, reducing their spatial dimensions while keeping the most important information. Fully connected layers at the end take the extracted features and produce a final output -- a classification label, a bounding box, or a probability score.
The landmark architectures tell the story of the field. AlexNet (2012) proved deep CNNs could dominate image classification. VGGNet stacked more layers for better accuracy. ResNet introduced skip connections that allowed networks to go hundreds of layers deep without degrading. EfficientNet optimized the balance between depth, width, and resolution. In 2026, CNNs are often combined with transformer-based vision models -- architectures like ConvNeXt blend convolutional operations with design principles borrowed from transformers.
Why It Matters
CNNs are the reason your phone can recognize faces, your car can read speed-limit signs, and your doctor can get AI-assisted readings on medical scans. Tesla's Autopilot uses CNN-based vision systems to interpret road conditions in real time. Google Photos relies on CNNs for image search and face grouping. In healthcare, CNN models from companies like PathAI analyze pathology slides, and Aidoc uses them to flag critical findings in radiology scans.
Even as vision transformers (ViTs) have gained ground, CNNs remain widely deployed because they are computationally efficient, well understood, and highly effective for tasks where spatial locality matters. Many production systems in 2026 use hybrid architectures that combine convolutional layers with attention mechanisms.
Key Takeaway
Convolutional neural networks use learnable filters to extract visual features from images layer by layer, and they remain a foundational architecture for computer vision despite the rise of transformer-based alternatives.
Part of the AI Weekly Glossary.