marktechpost.com web signal May 16th 2026

Nous Research Lighthouse speeds long-context pretraining 1.7x

inference open source attention-mechanisms long-context-pretraining inference-efficiency

Key insights

Lighthouse Attention delivers 1.4-1.7x end-to-end pretraining wall-clock speedup at 98K context with no custom sparse kernels.
At 512K context on a single B200, forward and backward passes run 17x faster than cuDNN SDPA.
A brief dense-SDPA resumption near training end converts the checkpoint into a standard full-attention model matching dense baselines.

Why this matters

Long-context pretraining is among the most compute-intensive operations in modern LLM development, and a 1.4-1.7x wall-clock speedup at 98K context with no inference-time tax could compress compute budgets meaningfully for any lab training at scale. The technique sidesteps the two main adoption blockers of prior sparse attention work -- custom CUDA kernels and inference-side complexity -- making it a credible drop-in for existing pretraining pipelines. Open-sourcing the code immediately lowers the barrier for independent researchers and smaller labs, compressing the gap between frontier training infrastructure and community-level runs.

Summary

Nous Research has open-sourced Lighthouse Attention, a training-only sparse attention method that sharply cuts pretraining cost at long context without custom CUDA kernels or any inference-time architectural changes. The mechanism uses hierarchical, selection-based attention during training to skip low-value token interactions, then runs a brief dense-SDPA resumption phase near training end to produce a standard full-attention checkpoint. The final model is inference-identical to a normally trained transformer and matches or exceeds dense-from-scratch quality at the same token budget. Essentially: (Nous Research) hands the open-source pretraining community a low-commitment speedup compatible with most existing pipelines. - At 98K context, Lighthouse delivers 1.4-1.7x end-to-end wall-clock pretraining speedup. - At 512K context on a single B200, forward and backward passes run 17x faster than cuDNN SDPA. - No auxiliary losses, no custom sparse kernels, and no inference-side modifications are required. For labs training at 100K+ context, this translates into lower compute bills or faster iteration cycles without touching inference infrastructure.

Potential risks and opportunities

Risks

Labs that undersize the dense-SDPA resumption phase could ship models with subtle capability gaps versus dense baselines, a defect unlikely to surface until post-deployment evaluation on long-context benchmarks
The 17x speedup figure is measured on a single B200 -- multi-node distributed settings may see substantially reduced gains, potentially misleading teams planning large-scale pretraining runs
Concentrated adoption of one sparse attention recipe means a discovered flaw in Lighthouse's token-selection mechanism could simultaneously affect many open-source pretraining runs, creating correlated quality failures across the ecosystem

Opportunities

Cloud GPU providers (CoreWeave, Lambda Labs) can package Lighthouse-optimized long-context training as a premium offering, directly targeting labs budgeting for 100K+ context runs
Open-source model builders (Mistral, EleutherAI, Allen AI) can integrate Lighthouse immediately to push context lengths further within existing compute budgets, accelerating competitive release timelines
MLOps and training infrastructure vendors (Modal, Determined AI, Weights & Biases) could build first-class Lighthouse support into their orchestration layers, capturing mindshare in the pretraining efficiency workflow segment

What we don't know yet

Whether the 1.4-1.7x speedup holds across model scales and architectures beyond the specific configurations reported in arXiv 2605.06554
How the minimum duration of the dense-SDPA resumption phase affects final model quality, and whether shorter resumptions introduce measurable capability degradation
Whether Lighthouse Attention composes cleanly with distributed long-context techniques like ring attention or sequence parallelism at multi-node scale

Originally reported by marktechpost.com

Read the original article →

Original headline: Nous Research Open-Sources Lighthouse Attention: 1.4–1.7× Pretraining Speedup at Long Context, 17× Faster Forward+Backward at 512K on B200