huggingface.co web signal

MuSViT: first foundation vision model built for sheet music

computer vision music ai-research

TL;DR

  • MuSViT is a ViT encoder pre-trained via Masked Autoencoders on 9.7 million pages of the International Music Score Library Project (IMSLP).
  • Under linear probing on full-page score recognition, MuSViT reports 16.4% SER versus 48.6% for PaliGemma 2 and 56.9% for DINOv3-7B.
  • MuSViT ships in 85M and 25M parameter variants and is evaluated on four tasks: full-page and staff-level recognition, symbol detection, and difficulty classification.

Sheet music is one of those 2D symbolic document types that general-purpose vision transformers keep tripping over, and a new paper on Hugging Face argues the answer is not more scale but a domain-specific foundation model. MuSViT, from a team led by Antonio Rios-Vila, is described as the first foundation vision model for sheet music representation, a ViT encoder pre-trained via Masked Autoencoders on 9.7 million pages from the International Music Score Library Project (IMSLP).

The framing matters because of the size gap. MuSViT ships as an 85M-parameter encoder plus a 25M-parameter light variant, and the authors pit both against much larger general-purpose baselines including a DINOv3-7B model. Under linear probing on full-page music score recognition, the authors report a Symbol Error Rate of 16.4% for MuSViT versus 48.6% for PaliGemma 2, 51.0% for Qwen3-VL and 56.9% for DINOv3-7B. Fine-tuned, MuSViT reports 10.9% SER against a task-specific state-of-the-art the paper puts at 20.0%. On music symbol detection, linear-probed MuSViT reports 79.7% mAP against 70.4% for DINOv3-7B.

The recipe behind those numbers is a two-stage curriculum. Stage one is a synthetic warm-up on DeepScoresV2 at 512×512 with a 50% mask ratio; stage two is MAE on the full IMSLP corpus at 1,024×1,024 with a 70% mask ratio. The authors say training directly on IMSLP without the warm-up causes dimensional collapse, so they describe the curriculum as necessary rather than merely beneficial. An embedding-transcription consistency analysis then reports positive Pearson correlations for MuSViT around 0.606 and 0.665, where general-purpose encoders come out negative. The authors read that as evidence MuSViT is actually encoding symbolic musical structure in its representation space rather than treating the page as generic pixels.

The honest caveat is that this is one research paper doing its own benchmarking on its own suite, and OMR has a contested state-of-the-art picture across historical corpora. What the write-up does not really give you is how the encoder holds up on adversarial inputs, degraded scans well outside the IMSLP distribution, notations outside Common Western Music Notation and mensural, or when it is dropped into a full engraving-to-MusicXML pipeline rather than a benchmark head. Take the reported specifics as reported, not as settled community consensus.

Still, the direction is what interests me. If a domain-specific 85M ViT can outrun a 7B general model on structured symbolic pages, the read-across is that other structured 2D document genres, chemistry drawings, circuit schematics, engineering diagrams, are probably sitting on similar unclaimed foundation-model gains for whichever teams are willing to curate the corpus.