NanoEuler Implements GPT-2 Scale Transformer in Pure C and CUDA
TL;DR
- NanoEuler is a complete 116M-parameter transformer built in C and CUDA with no PyTorch or external ML framework.
- A hand-written FlashAttention kernel achieves roughly 3x speedup over naive attention using tiled computation and online softmax.
- Gradient verification against central finite differences in double precision yields maximum relative errors under 1.02e-04 across all parameter tensors.
The interesting thing about nanoeuler is not that it produces a capable model. The developer is candid that the roughly 116M parameter result generates "fluent-ish English" with minimal real-world knowledge and no genuine understanding. The interesting thing is that the whole stack, tokenizer through pretraining through supervised fine-tuning, was written by hand in C and CUDA without touching PyTorch or any ML framework.
The architecture is a decoder-only transformer with RMSNorm, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and grouped-query attention. The GPU version uses 768 dimensions, 12 query heads, 4 key/value heads, 16 layers, and a 512-token context window. The byte-level BPE tokenizer uses a 4,096-token vocabulary averaging roughly 3.4 bytes per token on English text. The developer also built a gradient checking suite comparing analytical gradients against central finite differences in double precision, with maximum relative errors across all parameter tensors coming in under 1.02e-04. That is a real verification discipline rather than a hope-it-works approach.
The CUDA engine delegates matrix multiplication to cuBLAS with TF32 tensor cores and includes a hand-written FlashAttention kernel using tiled computation and online softmax, achieving approximately 3x speedup over naive attention. The CPU version carries no external dependencies at all; the GPU path adds cuBLAS, libm, and OpenMP. The codebase is 73.7% CUDA and 23.2% C. Training ran in two stages: pretraining on Project Gutenberg classics and FineWeb-Edu educational content, then supervised fine-tuning on Alpaca instruction data with loss masked to response tokens.
The developer plans DPO (preference optimization) next and describes scaling to roughly 270M parameters as a future milestone. The honest caveat is that the project describes itself as a research artifact rather than a production tool, and the GPU path requires an NVIDIA card with compute capability 8.9 or higher, which narrows who can run the CUDA version without modification. What the project does not provide is a benchmark score or training compute budget, so how this compares in efficiency to similar from-scratch efforts is an open question. For practitioners who want to understand what actually executes on GPU silicon rather than what a framework abstracts away, that is precisely the gap this codebase is designed to address.
Originally reported by github.com
Read the original article →Original headline: Show HN: NanoEuler — GPT-2 Scale Transformer Built in Pure C/CUDA From Scratch, Zero External Dependencies