What Are State Space Models? The Challenger to Transformers
Transformers have dominated AI since 2017. Every major language model, from GPT-4 to Claude to Gemini, is built on the transformer architecture and its self-attention mechanism. But transformers have a fundamental limitation: self-attention scales quadratically with sequence length. Processing a sequence twice as long costs four times as much. This makes long-context applications, like analyzing entire codebases, processing book-length documents, or maintaining extended conversations, expensive and slow.
State space models (SSMs) offer a fundamentally different approach. Instead of attending to every token in a sequence simultaneously, SSMs process sequences through a recurrent state that updates linearly with each new token. The result is linear scaling: processing a sequence twice as long costs only twice as much. This makes SSMs dramatically faster and more memory-efficient for long sequences.
In 2026, SSMs and their hybrids have moved from academic curiosity to production deployment. Mamba, Mamba-2, Mamba-3, Jamba, and Bamba represent a growing family of architectures that challenge the transformer's dominance. Understanding state space models is essential for anyone tracking the future of AI architecture.
The Transformer's Quadratic Problem
To appreciate why state space models matter, you need to understand what they fix.
How Self-Attention Works
In a transformer, every token attends to every other token in the sequence. For a sequence of length n, this creates an n-by-n attention matrix. Each entry in this matrix represents how much one token should attend to another. The matrix is computed, normalized, and used to create weighted combinations of token representations.
This global attention is powerful. It allows any token to directly access information from any other token, regardless of distance. A word at position 10,000 can directly attend to a word at position 1. This is why transformers excel at tasks requiring long-range dependencies.
The Cost of Global Attention
But computing an n-by-n matrix has O(n-squared) time and memory complexity. For a sequence of 1,000 tokens, the attention matrix has 1 million entries. For 10,000 tokens, it has 100 million entries. For 100,000 tokens, it has 10 billion entries.
This quadratic scaling creates practical problems. Processing a 128,000-token context window, standard for frontier models in 2026, requires enormous GPU memory and computation. Extending to million-token sequences, which some applications demand, pushes the limits of even the most powerful hardware.
Various approximations exist, such as sparse attention, linear attention, and sliding window attention, but these sacrifice some of the global connectivity that makes transformers effective.
What Are State Space Models?
State space models come from control theory and signal processing, where they have been used for decades to model dynamical systems. The key idea is to represent a sequence transformation through a hidden state that evolves over time.
The Core Mechanism
An SSM maps an input sequence to an output sequence through a hidden state. At each timestep, the model takes the current input and the previous hidden state, combines them according to learned parameters, and produces an output and a new hidden state.
This is fundamentally recurrent: the model processes one token at a time, maintaining a fixed-size state that summarizes everything it has seen so far. Unlike an RNN, which also processes sequentially, SSMs have a mathematical structure that allows efficient parallel training.
The State Equation
The continuous-time state space model is defined by four matrices: A, B, C, and D. Matrix A governs how the hidden state evolves over time. Matrix B controls how the input affects the hidden state. Matrix C maps the hidden state to the output. Matrix D provides a direct connection from input to output.
In practice, these continuous equations are discretized to work with discrete token sequences. The discretization step converts the continuous parameters into discrete recurrence relations that the model can compute step by step.
Why Linear Scaling
Because the SSM processes tokens one at a time with a fixed-size hidden state, the computation for each token is constant regardless of sequence length. Processing n tokens requires n steps, each with the same cost. Total computation scales linearly as O(n), compared to the transformer's O(n-squared).
During inference, this is straightforward: the model processes each new token by updating its state and producing an output. During training, the recurrence can be unrolled and computed in parallel using efficient convolution or scan operations, maintaining the linear scaling advantage.
S4: The Breakthrough That Started It All
Structured State Spaces for Sequence Modeling (S4), published in 2022 by Albert Gu and colleagues, was the paper that made SSMs competitive with transformers.
The Problem With Naive SSMs
Basic SSMs had been known for decades but performed poorly on real-world sequence tasks. The fundamental problem was that the state matrix A needed to capture long-range dependencies, which required specific mathematical properties that were hard to learn through gradient descent.
The S4 Solution
S4 solved this by initializing the A matrix using the HiPPO framework, a mathematical construction that produces matrices optimized for compressing long sequences into fixed-size states. HiPPO stands for High-Order Polynomial Projection Operators, and it provides a principled way to initialize the state dynamics so that the model can remember information over very long sequences.
S4 also introduced efficient algorithms for computing the SSM as a convolution during training, avoiding the sequential bottleneck of naive recurrence. This gave S4 the best of both worlds: linear scaling at inference time with parallel computation during training.
S4 achieved breakthrough results on the Long Range Arena benchmark, which specifically tests the ability to model dependencies across thousands of timesteps. On tasks where transformers struggled due to sequence length, S4 excelled.
Mamba: Selective State Spaces
Mamba, introduced in December 2023 by Albert Gu and Tri Dao, took SSMs from benchmark success to practical language modeling.
The Limitation of S4
S4 and its variants used fixed, input-independent parameters. The A, B, and C matrices were the same regardless of the input content. This is efficient but limiting. In language, what information to remember and what to forget depends on the content. A model reading a contract should remember key terms and dates while forgetting boilerplate. A fixed-parameter SSM cannot make these content-dependent decisions.
Selective State Spaces
Mamba's key innovation is making the SSM parameters input-dependent. The B, C, and discretization step size parameters are functions of the current input token. This selectivity allows the model to dynamically decide what to store in its hidden state and what to discard based on the content it is processing.
When the model encounters important information, it can adjust its parameters to write that information into the state. When it encounters irrelevant information, it can adjust to ignore it. This content-dependent processing, which Mamba calls a selection mechanism, gives the model the ability to perform context-dependent reasoning while maintaining linear scaling.
Performance Results
Mamba demonstrated remarkable performance. The 3 billion parameter Mamba model outperformed transformers of the same size and matched transformers twice its size on both pretraining metrics and downstream evaluation. Mamba achieved 5x higher throughput than transformers of equivalent size and scaled effectively to sequences of up to one million tokens.
Mamba proved that SSMs could be competitive with transformers on language modeling, the transformer's home turf. This was the result that caught the attention of the broader AI community.
Mamba-2 and Mamba-3: Rapid Evolution
The Mamba architecture has evolved rapidly.
Mamba-2
Mamba-2, published in 2024, established a theoretical connection between SSMs and attention. It showed that the selective state space model can be viewed as a form of structured attention, revealing that SSMs and transformers are not fundamentally different paradigms but rather different points on a spectrum of sequence models.
Mamba-2 also introduced more efficient algorithms for the selective scan operation, making training faster and more hardware-friendly. The architecture achieved better performance than Mamba-1 at equivalent scale.
Mamba-3
Mamba-3, introduced in 2026, represents the latest evolution. Its key contribution is cleaner integration with attention layers in hybrid configurations. Rather than choosing between pure SSM or pure transformer, Mamba-3 supports inserting attention heads at specific layers where global attention adds value while keeping SSM efficiency for most of the model.
This hybrid-friendly design reflects the emerging consensus in the field: the future is not pure SSMs replacing pure transformers, but hybrid architectures that combine the best of both.
Jamba: The Hybrid That Proved the Concept
Jamba, developed by AI21 Labs, is the model that proved hybrid SSM-transformer architectures work at production scale.
Architecture
Jamba interleaves blocks of transformer layers and Mamba layers within a single model. It also incorporates Mixture-of-Experts (MoE) for additional efficiency. The result is a triple-hybrid: transformer attention, Mamba SSM, and MoE sparsity, all in one model.
The transformer layers provide global attention for tasks that require it, like in-context retrieval and precise positional reasoning. The Mamba layers provide efficient sequence processing for the bulk of the computation. The MoE layers provide parameter efficiency by routing tokens to specialized experts.
Practical Benefits
Jamba supports an effective context length of 256,000 tokens, the largest among open-weight models at the time of release. It achieves this with substantially less memory than an equivalent pure transformer, because the Mamba layers require no attention cache that grows with sequence length.
The throughput advantage is dramatic. For long sequences, Jamba provides significantly higher tokens-per-second than pure transformer models of equivalent quality, because the linear-scaling Mamba layers dominate the computation.
The Recipe
Jamba's architecture uses a specific interleaving ratio: one transformer layer for every several Mamba layers. The exact ratio can be tuned based on the target use case. More transformer layers improve performance on retrieval-heavy tasks. More Mamba layers improve efficiency and long-sequence performance.
Where State Space Models Excel
SSMs have clear advantages in specific scenarios.
Long Sequences
This is the primary advantage. Any task that involves very long sequences benefits from linear scaling. Document analysis, book summarization, codebase understanding, genomics, and time-series analysis all involve sequences where quadratic attention becomes a bottleneck.
Mamba's performance improves on real data up to million-length sequences. For applications that need to process extremely long inputs, SSMs are not just more efficient than transformers, they are the only practical option without aggressive approximation.
Inference Efficiency
During autoregressive generation, transformers maintain a key-value (KV) cache that stores the attention keys and values for all previous tokens. This cache grows linearly with sequence length and can consume enormous memory. For a model serving many concurrent users with long conversations, KV cache memory becomes the binding constraint.
SSMs require no KV cache. The fixed-size hidden state summarizes all previous context. This makes SSMs dramatically more memory-efficient during inference, enabling higher throughput and more concurrent users on the same hardware.
Real-Time and Streaming Applications
SSMs are naturally suited to streaming data because they process one token at a time, updating a fixed-size state. This makes them ideal for real-time applications like live transcription, continuous monitoring, and interactive systems where data arrives incrementally.
Code Generation
Codestral Mamba, released by Mistral, demonstrated that a pure SSM with no attention at all could beat CodeLlama 34B at code generation. Code has natural sequential structure that SSMs capture well, and codebases can be extremely long, playing to the SSM's scaling advantage.
Where Transformers Still Win
SSMs are not universally superior. Transformers retain advantages in important areas.
In-Context Retrieval
Tasks that require finding and using a specific piece of information from earlier in the sequence favor transformers. Self-attention creates a direct path between any two tokens, enabling precise retrieval. SSMs must compress all previous information into a fixed-size state, which can lose specific details.
The classic test is the "needle in a haystack" task: finding a specific fact buried in a long document. Pure SSMs struggle with this because the fixed-size state may not retain the specific needle. Hybrid architectures address this by using attention layers for retrieval-critical operations.
Precise Positional Reasoning
Tasks that require precise knowledge of where tokens appear in a sequence, like certain types of string manipulation or formatting tasks, favor attention because the attention matrix explicitly encodes positional relationships.
Few-Shot Learning
Transformers excel at in-context learning, where a few examples in the prompt teach the model a new task. This ability depends on the model attending to the examples and generalizing from them. Pure SSMs are weaker at this, though hybrid architectures recover much of the capability.
Established Ecosystem
Transformers benefit from years of engineering optimization. Hardware like GPUs and TPUs is optimized for attention computation. Frameworks, quantization methods, serving infrastructure, and fine-tuning tools are all designed for transformers. SSMs are catching up but do not yet have equivalent ecosystem support.
Hybrid Architectures: The Emerging Consensus
The field is converging on hybrid architectures that combine SSM and transformer layers.
The Hybrid Recipe
The emerging best practice is to use a small ratio of attention layers, roughly 1-in-8 or 1-in-10, within an otherwise SSM-based model. The attention layers handle tasks requiring global retrieval and precise positional reasoning. The SSM layers handle the bulk of sequence processing efficiently.
This approach captures most of the efficiency benefits of SSMs while retaining the capabilities that attention provides. The resulting models are faster than pure transformers, more capable than pure SSMs, and more memory-efficient than either.
Models Following This Pattern
Jamba pioneered the hybrid approach. Bamba, an IBM research model, followed with a similar architecture. Mamba-3's design explicitly supports hybrid configurations. NVIDIA's research has explored optimal ratios of attention to SSM layers for different task categories.
The consensus is forming: pure transformers are giving way to hybrids, with SSM layers handling most of the computation and attention layers providing targeted global reasoning.
Going Pure Mamba
Some use cases justify pure SSM architectures with no attention at all. If the workload involves very long sequences with no need for precise retrieval, pure Mamba models offer maximum efficiency. Code generation, genomic sequence analysis, and continuous time-series monitoring are examples where pure SSMs have proven effective.
The rule of thumb: start hybrid, use attention layers where retrieval matters, and only go pure Mamba if you have validated that your specific workload does not need it.
The Technical Details: How SSMs Train Efficiently
A common question about SSMs is how a recurrent model trains efficiently. RNNs also process sequences one token at a time, and they are notoriously slow to train because the sequential computation cannot be parallelized.
The Convolution Trick
S4 showed that a linear SSM can be expressed as a convolution operation during training. Instead of processing tokens one at a time, the entire sequence can be processed in parallel using Fast Fourier Transform (FFT) based convolution. This gives S4 the speed of convolutions during training while retaining the efficiency of recurrence during inference.
The Scan Operation
Mamba's selective parameters make the convolution trick inapplicable, since the SSM parameters change at each timestep. Instead, Mamba uses a hardware-aware parallel scan algorithm. The parallel scan computes the recurrence across the entire sequence in O(n) work and O(log n) depth, enabling efficient parallel execution on GPUs.
Tri Dao, the co-author of Mamba and the creator of FlashAttention, designed the scan implementation to maximize GPU utilization, using techniques like kernel fusion and memory-efficient IO patterns. This engineering effort was crucial to making Mamba competitive in wall-clock training time, not just theoretical complexity.
Applications of State Space Models
Beyond language modeling, SSMs are finding applications across domains.
Genomics
DNA sequences are extremely long, often millions of base pairs. SSMs' linear scaling makes them practical for genomic sequence modeling where transformers cannot operate at full resolution. Mamba has shown strong results on genomic benchmarks, modeling long-range dependencies in DNA that correspond to biological function.
Audio and Speech
Audio signals are inherently sequential and long. A few seconds of audio at standard sampling rates contains tens of thousands of timesteps. SSMs process these efficiently, and their streaming capability makes them natural for real-time audio applications.
Time Series
Financial data, sensor readings, and IoT telemetry are all time series that can span millions of timesteps. SSMs' linear scaling and constant-memory inference make them ideal for continuous monitoring and forecasting at scale.
Scientific Computing
Recent research has applied Mamba to solving partial differential equations (PDEs), where the model learns operators that map between function spaces. The linear scaling allows these models to operate on fine spatial and temporal grids that would be intractable for transformers.
The Road Ahead
State space models are on a trajectory that may reshape AI architecture.
Hardware co-design. Current GPUs are optimized for the dense matrix multiplications that attention requires. As SSMs gain adoption, hardware vendors are beginning to optimize for the scan and element-wise operations that SSMs rely on. Custom hardware for SSMs could further widen the efficiency gap.
Scaling laws for SSMs. Transformer scaling laws are well understood. SSM scaling laws are still being established. Early results suggest that SSMs have favorable scaling properties, but comprehensive studies at frontier model scale are ongoing.
Hybrid as default. The trajectory is clear: hybrid SSM-transformer architectures are becoming the new default for efficiency-conscious deployments. From Jamba to Bamba to Mamba-3's hybrid-friendly design, the ecosystem is consolidating around models that use both paradigms.
Attention as a premium feature. In future architectures, attention may be treated as a premium feature used sparingly for tasks that require it, rather than the default computation applied at every layer. This inversion, from attention-everywhere to attention-where-needed, would represent a fundamental shift in how sequence models are designed.
Conclusion
State space models offer a compelling alternative to the transformer's quadratic attention mechanism. By processing sequences through a recurrent hidden state with linear scaling, SSMs deliver dramatically better efficiency for long sequences while achieving competitive quality on standard benchmarks.
Mamba proved that SSMs can match transformers on language modeling. Jamba proved that hybrid architectures work at production scale. Mamba-3 and the growing hybrid ecosystem are establishing SSMs as a core component of next-generation AI architectures.
The question is no longer whether state space models can compete with transformers. It is how to best combine them. For anyone building or deploying AI systems that handle long sequences, process streaming data, or need to serve many concurrent users efficiently, state space models are no longer optional knowledge. They are the architecture that makes the next generation of AI practical.