Nous Research Lighthouse speeds long-context pretraining 1.7x
Key insights
- Lighthouse Attention delivers 1.4-1.7x end-to-end pretraining wall-clock speedup at 98K context with no custom sparse kernels.
- At 512K context on a single B200, forward and backward passes run 17x faster than cuDNN SDPA.
- A brief dense-SDPA resumption near training end converts the checkpoint into a standard full-attention model matching dense baselines.
Why this matters
Long-context pretraining is among the most compute-intensive operations in modern LLM development, and a 1.4-1.7x wall-clock speedup at 98K context with no inference-time tax could compress compute budgets meaningfully for any lab training at scale. The technique sidesteps the two main adoption blockers of prior sparse attention work -- custom CUDA kernels and inference-side complexity -- making it a credible drop-in for existing pretraining pipelines. Open-sourcing the code immediately lowers the barrier for independent researchers and smaller labs, compressing the gap between frontier training infrastructure and community-level runs.
Summary
Nous Research has open-sourced Lighthouse Attention, a training-only sparse attention method that sharply cuts pretraining cost at long context without custom CUDA kernels or any inference-time architectural changes.
The mechanism uses hierarchical, selection-based attention during training to skip low-value token interactions, then runs a brief dense-SDPA resumption phase near training end to produce a standard full-attention checkpoint. The final model is inference-identical to a normally trained transformer and matches or exceeds dense-from-scratch quality at the same token budget.
Essentially: (Nous Research) hands the open-source pretraining community a low-commitment speedup compatible with most existing pipelines.
- At 98K context, Lighthouse delivers 1.4-1.7x end-to-end wall-clock pretraining speedup.
- At 512K context on a single B200, forward and backward passes run 17x faster than cuDNN SDPA.
- No auxiliary losses, no custom sparse kernels, and no inference-side modifications are required.
For labs training at 100K+ context, this translates into lower compute bills or faster iteration cycles without touching inference infrastructure.
Potential risks and opportunities
Risks
- Labs that undersize the dense-SDPA resumption phase could ship models with subtle capability gaps versus dense baselines, a defect unlikely to surface until post-deployment evaluation on long-context benchmarks
- The 17x speedup figure is measured on a single B200 -- multi-node distributed settings may see substantially reduced gains, potentially misleading teams planning large-scale pretraining runs
- Concentrated adoption of one sparse attention recipe means a discovered flaw in Lighthouse's token-selection mechanism could simultaneously affect many open-source pretraining runs, creating correlated quality failures across the ecosystem
Opportunities
- Cloud GPU providers (CoreWeave, Lambda Labs) can package Lighthouse-optimized long-context training as a premium offering, directly targeting labs budgeting for 100K+ context runs
- Open-source model builders (Mistral, EleutherAI, Allen AI) can integrate Lighthouse immediately to push context lengths further within existing compute budgets, accelerating competitive release timelines
- MLOps and training infrastructure vendors (Modal, Determined AI, Weights & Biases) could build first-class Lighthouse support into their orchestration layers, capturing mindshare in the pretraining efficiency workflow segment
What we don't know yet
- Whether the 1.4-1.7x speedup holds across model scales and architectures beyond the specific configurations reported in arXiv 2605.06554
- How the minimum duration of the dense-SDPA resumption phase affects final model quality, and whether shorter resumptions introduce measurable capability degradation
- Whether Lighthouse Attention composes cleanly with distributed long-context techniques like ring attention or sequence parallelism at multi-node scale
Originally reported by marktechpost.com
Read the original article →Original headline: Nous Research Open-Sources Lighthouse Attention: 1.4–1.7× Pretraining Speedup at Long Context, 17× Faster Forward+Backward at 512K on B200