HAKARI-Bench benchmarks 55 retrieval models in 43 languages
TL;DR
- HAKARI-Bench evaluates 55 models across 551 retrieval tasks and 43 languages under identical conditions, covering dense, sparse, ColBERT, reranker, and BM25 architectures.
- Nano-sets of 50-200 queries and 1K-10K documents reproduce full-benchmark rankings at Spearman correlations of 0.973 to 0.983 across three benchmark families.
- Int8 quantization with rescoring loses only 0.09 nDCG@10 versus full precision, while architecture rankings shift substantially across languages and domains.
Evaluating retrieval models is genuinely expensive. With dozens of embedding architectures, sparse methods, and rerankers to compare across languages and domains, most teams default to a single leaderboard and ship. HAKARI-Bench is a new open evaluation framework designed to change that calculus, covering 35 benchmarks and 551 retrieval tasks across 43 languages under unified conditions.
The framework tested 55 models in total: 33 dense embeddings, 4 sparse representations, 6 late interaction (ColBERT-family) models, 11 rerankers, and BM25 as a lexical baseline, all evaluated on identical task sets. That cross-architecture comparison on the same footing is harder to arrange than it sounds, since most prior leaderboards mix evaluation protocols and make dense-versus-reranker conclusions unreliable.
The core engineering bet is what the authors call Nano-sets: each task is compressed to 50 to 200 queries and 1K to 10K documents. The paper reports that these compact versions reproduce full-benchmark model rankings at Spearman correlations of 0.975, 0.983, and 0.973 when compared against MMTEB, MTEB v2, and BEIR respectively. On quantization, int8 with rescoring loses only 0.09 nDCG@10 versus full precision, effectively lossless for the storage and retrieval cost reduction it enables.
The honest caveats are real. Absolute scores on Nano-sets can differ from official benchmarks by up to 7 points, so they are not a drop-in replacement for final production validation. Reranker evaluation relies on a fixed hybrid candidate set that achieves roughly 87% relevant-document coverage and does not fully reflect end-to-end two-stage pipelines. Inference speed is also explicitly not measured, a meaningful gap for teams where latency is the binding constraint.
The practical upside is scope-aware model selection. Architecture rankings shift substantially depending on language and domain: on the English BEIR subset alone, late interaction models move to first place and learned sparse methods enter the top quartile, patterns that do not hold across all 43 languages. A multi-axis filtering interface covering language, domain, query length, and model size helps teams find the right model for their actual use case rather than the global leaderboard default. The code and Nano-sets are released under an MIT license at github.com/hakari-bench/hakari-bench.
Originally reported by huggingface.co
Read the original article →Original headline: HAKARI-Bench: Lightweight Retrieval Framework Tests 55 Models Across 35 Benchmarks and 43 Languages, Nano-Sets Reproduce Full Rankings at 0.975 Spearman