From acoustic similarity to perceptual relevance

From Sound to Meaning: Leveraging Audio Language Models for Music Relevance Assessment

How foundation audio models and Audio LLMs bridge the gap between acoustic similarity and perceptual relevance TL;DR Acoustic similarity ≠ perceptual relevance. Two tracks can share timbre and tempo yet serve opposite listening contexts. Embedding models measure sound; Audio LLMs reason about context. Foundation models (MusicFM, CLAP, MuQ-MuLan) provide fast vector retrieval. Audio LLMs (Qwen2-Audio, Music Flamingo, Qwen3-Omni, LTU) add compositional reasoning and explainability. The future of music recommendation is fast retrieval + slow reasoning. A hybrid pipeline retrieves candidates via embeddings, captions them offline, and reranks with an LLM judge — then feeds those judgments back as hard negatives to strengthen the embedding space. Introduction ...

March 7, 2026 · 22 min · 4623 words · Nikolai Zakharov (罗一阳)

[AI for all] What is Intelligence?

[AI for all] What is Intelligence? 🤔 So what exactly is intelligence? Intelligence is what makes us “Homo sapiens” - literally wise humans, thinking beings. What makes us intellectual is the fact that we can expect that the result of our actions will lead us to achieve our goals. But what makes machines intelligent? The same thing - their actions should lead to achieving their goals. The question is that we set the goals, and machines learn to optimize the process of achieving these goals in the best way. And we must be absolutely sure that we set the goals correctly, otherwise we can over-optimize, especially if (when?) we invent machines that will be much smarter than us and will quite be able to optimize us too :) ...

January 29, 2025 · 1 min · 209 words · Nikolai Zakharov (罗一阳)