
From Sound to Meaning: Leveraging Audio Language Models for Music Relevance Assessment
How foundation audio models and Audio LLMs bridge the gap between acoustic similarity and perceptual relevance TL;DR Acoustic similarity ≠ perceptual relevance. Two tracks can share timbre and tempo yet serve opposite listening contexts. Embedding models measure sound; Audio LLMs reason about context. Foundation models (MusicFM, CLAP, MuQ-MuLan) provide fast vector retrieval. Audio LLMs (Qwen2-Audio, Music Flamingo, Qwen3-Omni, LTU) add compositional reasoning and explainability. The future of music recommendation is fast retrieval + slow reasoning. A hybrid pipeline retrieves candidates via embeddings, captions them offline, and reranks with an LLM judge — then feeds those judgments back as hard negatives to strengthen the embedding space. Introduction ...