When a music producer searches for a reference track, they rarely want "something that sounds similar overall." They want a track with the same groove but a different melody, or the same timbre but a completely different tempo. Current music similarity tools cannot make that distinction. MERIT, a new framework from researchers at the Advanced Music Technology and AI Lab, changes that.

Published on arXiv on May 26, 2026, MERIT (Music dEcomposed RepresentaTion) introduces disentangled music representations that separate similarity into three independent dimensions: melody, rhythm, and timbre. Each dimension gets its own neural head trained to respond only to its target perceptual quality, ignoring the others.

What Happened

A team led by Abhinaba Roy, Junyi Liang, and Dorien Herremans at SUTD trained three lightweight projection heads on top of a frozen MERT encoder, a music-specific transformer pre-trained on large-scale audio. Each head is a two-layer MLP optimized with Circle Loss to pull together audio pairs that share only one musical factor while pushing apart pairs that differ on that factor.

The training data was generated synthetically using conditional music generation and source separation, creating over 296,000 triplets where exactly one musical dimension changes at a time. That deliberate control is what forces each head to learn only what it is supposed to learn.

The result: the melody head achieves 99.9% accuracy on melody triplets but drops to 58.4% on rhythm triplets, barely above chance. The rhythm head hits 100% on rhythm triplets and falls to 47.7% on melody. The timbre head scores 99.6% on timbre. Each head truly responds to its intended dimension and ignores the others.

Why It Matters for Creators

Most music similarity tools produce a single score. Feed them two songs and you get a number. The problem is that a track with the same chord progression but completely different drums might score low, while a track with the same drum pattern and a totally different melody might score high. The result is unreliable for creative workflows.

MERIT gives producers three separate levers:

  • Melody similarity: find tracks with the same melodic contour, useful for cover detection, composition research, and licensing clearance
  • Rhythm similarity: find tracks with the same groove and tempo feel, critical for DJ mixing, beat selection, and rhythm section reference
  • Timbre similarity: find tracks with the same sonic color, useful for sound design matching, orchestration reference, and stem-level mixing decisions

On zero-shot external benchmarks, the rhythm head reaches 88.0% on the Ballroom dance style dataset, outperforming raw MERT (78.0%) without any task-specific fine-tuning. The timbre head reaches 78.9% on instrument classification from MUSDB18-HQ. These results generalize to new tasks without retraining.

How MERIT Works

The architecture is deliberately lightweight. The MERT backbone stays frozen, meaning the whole system trains quickly on modest hardware. The three projection heads learn to project 768-dimensional MERT embeddings into factor-specific similarity spaces.

MERIT music similarity model with melody rhythm and timbre separation
ComponentRoleTraining Triplets
Melody headPitch contour and harmonic movement125,000 (rhythm + timbre varied)
Rhythm headBeat pattern and temporal feel125,000 (melody + timbre varied)
Timbre headInstrument color and spectral texture46,241 (melody + rhythm varied)

Triplets were created by taking a source track, using conditional generation to regenerate it with only one musical property changed, then using source separation to isolate stems. This gives precise control over what changes between pairs.

The dataset and pretrained models are available on GitHub and the training triplets are on HuggingFace under a CC BY-NC-SA 4.0 license.

Creator Outcome: What You Can Build Today

MERIT is a research release with working code and pretrained models. Here is what it enables right now:

Music search by similarity factor using MERIT AI system
  1. Factor-specific sample search: Index your library with MERIT embeddings, then query by melody, rhythm, or timbre independently using cosine similarity in each embedding space
  2. Mix reference matching: Retrieve songs with the same timbral profile for mixing decisions without being thrown off by tempo or melody differences
  3. Cover detection: The melody head alone is a strong signal for identifying cover versions and melodic interpolations, more reliable than overall similarity scores
  4. DJ playlist curation: Use the rhythm head to build playlists where every track has the same energy and tempo feel, regardless of genre or instrumentation

For producers already working with local stem separation tools or open-source music generation, MERIT adds a semantic search layer on top of the output. Producers using stem-based production workflows can apply MERIT to individual stems rather than the mixed track for more granular similarity matching.

What to Try Next

The MERIT repository includes inference scripts for computing embeddings for any audio file. You need Python, PyTorch, and the pretrained weights from the GitHub release:

Local MERIT music analysis setup with headphones and waveform
  1. Clone the repository: github.com/AMAAI-Lab/MERIT
  2. Install dependencies and download the pretrained checkpoint
  3. Run the embedding script on your audio files to generate factor-specific vectors
  4. Build a nearest-neighbor index using FAISS or cosine similarity over your library
  5. Query by uploading any track and specifying which dimension to match

Human evaluation confirmed the synthetic training data reflects real perceptual differences: melody pairs scored 60.0/100 on melodic similarity, rhythm pairs scored 65.8/100 on rhythmic similarity in blind listening tests.

Frequently Asked Questions

What is the difference between MERIT and a music fingerprinting tool like AudD?

Fingerprinting tools identify exact or near-exact audio matches for copyright detection. MERIT identifies perceptual similarity in specific musical dimensions. A different production of the same song might fingerprint differently but score high on MERIT melody similarity.

Can MERIT analyze any genre of music?

MERIT was trained on a diverse dataset, but evaluation focused on Western music. The rhythm head showed strong results on Ballroom dance styles. Performance on non-Western rhythmic systems has not been formally evaluated.

Does MERIT require a GPU to run?

The MERT backbone benefits from GPU acceleration, but the projection heads are lightweight MLPs. Inference on CPU is feasible for small libraries; larger-scale indexing benefits from GPU.

Is the training data available for fine-tuning on my own music?

Yes. The 296,000+ training triplets are on HuggingFace under CC BY-NC-SA 4.0. You can use them to retrain or fine-tune the projection heads for domain-specific music.

How does MERIT handle songs with multiple instruments?

The MERT backbone processes the full mixed audio signal. Source separation is used during training data generation but not at inference time. The timbre head responds to dominant instruments in the mix, which aligns with how producers typically perceive timbre.