huggingface.co web signal

NMO benchmark exposes drug-discovery bias in molecular models

TL;DR

  • f-RAG and GenMol scored 0/5 on all three NMO physics tasks; the simpler molGA genetic algorithm cleared two of the three.
  • A new Graph Group SELFIES representation pretrained on 300,000 synthetic graphs reported top molecules at ZT=8.5, κph=0.1 pW/K, and P=9.9.
  • NMO replaces pharmaceutical proxy oracles with xTB quantum simulations on a fixed budget of 10,000 evaluations per seed across five seeds.

A new benchmark from a University of Augsburg group swaps the drug-discovery proxy oracles that generative molecular models usually train and test against for actual quantum chemistry, and the leaderboard collapses. According to the paper on Hugging Face, two of the more visible recent methods, the fragment-retrieval model f-RAG and the diffusion-based GenMol, hit zero on every one of the three new tasks. A plain training-free genetic algorithm called molGA cleared two of them.

The benchmark, Nanotechnology Molecular Optimization, targets three materials problems: single-molecule junctions that block heat (phonon thermal insulators), molecular junctions that convert heat to electricity (thermoelectrics, scored by the figure of merit ZT), and self-assembled monolayers for terahertz detection (molecular optomechanics). The protocol is strict: a fixed budget of 10,000 oracle evaluations per seed, five consecutive seeds, a single configuration across all three tasks, and a shared fragment library, with the underlying physics computed via the semi-empirical xTB method rather than the cheap proxy oracles that drug-discovery leaderboards lean on.

The authors' argument is that recent gains in generative molecular design have ridden on pharmaceutical pretraining and pharma-shaped proxies, and do not carry over once the physics changes. They pair that analysis with a new representation called Graph Group SELFIES, which natively models electrode attachment points via source and sink nodes in a directed graph, pretrain it on 300,000 randomly generated synthetic graphs to avoid pharmaceutical bias, and report top molecules with ZT = 8.5 for thermoelectrics, κph = 0.1 pW/K for phonons, and P = 9.9 for optomechanics, each exceeding the reference numbers the paper compares against.

The honest caveat is that xTB is itself semi-empirical, and the authors flag that it can overestimate up-conversion on the optomechanics task, with DFT validation recommended on top candidates. Only five seeds were run, the fragment library is hand-curated by domain experts (the paper calls this an explicit, controllable bias), and synthetic accessibility scores tell you a molecule might be makeable, not that anyone has made one. What the reporting does not give you is wet-lab confirmation that any of the top hits actually works in a device.

The forward-looking part is what it suggests about the field. If pharmaceutical priors are doing more of the work in generative molecule benchmarks than the headline numbers implied, the part to watch is materials-science groups building their own pretraining sets and their own physics-grounded benchmarks instead of borrowing pharma's.