Can AI Hear What We Feel?

An Investigation into Machine Music Perception
A Data-Driven Exploration of Artificial Intelligence and Human Emotion
0.047
AI-Human Correlation
(practically zero)
45.6%
Predictions off by
more than 20 points
0/9
Emotions where AI
beats baseline

The AI that can write code, explain physics, and pass the bar exam
completely fails at understanding music emotion.

Consider, for a moment, what happens when you hear a song. Not just any song, but one that stops you mid-step, that floods you with a feeling you can't quite name. Maybe it's nostalgia—a bittersweet ache for something you can't remember losing. Maybe it's joy, pure and unfiltered, making you want to move. Or maybe it's something darker: tension, sadness, a knot in your chest that music somehow knows how to tie.

This is what music does. It reaches past our defenses, past language, past logic, and touches something fundamental in us. We don't question this. We accept it as one of those mysterious gifts of being human.

But what if a machine—an artificial intelligence—could do the same thing? Not feel the music, of course, but understand it. Predict what emotions a song would evoke in a listener. Not by analyzing lyrics or reading metadata, but by listening to the raw audio itself, the way you and I do.

This is not a hypothetical question anymore. It's an experiment we can run. And the results—well, they tell us something profound about both artificial intelligence and ourselves.

The Experiment

The setup was elegantly simple. Take 40 musical excerpts—songs that had already been rated by real humans using a carefully designed emotional framework called GEMS-9 (Geneva Emotional Music Scales). Each listener indicated whether they felt each of nine specific emotions while listening: amazement, solemnity, tenderness, nostalgia, calmness, power, joyful activation, tension, and sadness.

These weren't casual impressions. This was data from the Emotify dataset, where anywhere from 11 to 53 people rated each song, creating a statistical portrait of human emotional response to music. The complete dataset represents hundreds of hours of careful annotation using a validated research methodology.

Then came the AI's turn. Google's Gemini—one of the most advanced multimodal AI systems available—was given the exact same audio files. No lyrics. No artist names. No context. Just the sound waves.

Listen For Yourself

Select a song below and see how AI's perception compares to human ratings

Emotional Profile: AI vs. Humans
Human Ratings
AI Prediction

The Verdict

Here's what you need to know, upfront: Gemini failed. Spectacularly.

0.047
Correlation between AI predictions and human ratings (where 1.0 would be perfect)

That number—0.047—is statistically indistinguishable from zero. It means that across 360 song-emotion pairs, the AI's predictions had essentially no relationship to what humans actually felt.

But that's not even the most damning part. When you compare Gemini to the most brain-dead baseline imaginable—just predicting the average rating for each emotion every single time—Gemini performed worse. Between 1.23 and 2.05 times worse, depending on the emotion.

The AI wasn't just noisy. It wasn't adding any predictive value over a constant guess.

How Bad Is Bad?

Let's put this in perspective. The AI's mean absolute error was 0.227—which, remember, is on a scale from 0 to 1, where the number represents the proportion of listeners who endorsed that emotion.

In plain English: on average, the AI was off by 22.7 percentage points. If 60% of humans felt joy listening to a song, the AI might predict 37%. Or 83%. It was essentially throwing darts.

And nearly half the time—45.6% of predictions—the error exceeded 20 percentage points.

The Pattern in the Chaos

Now, here's where it gets interesting. Because when you drill into the data, you discover that Gemini's failures weren't random. They were systematic. The AI had clear biases, clear blind spots, clear patterns of misunderstanding.

Emotion by Emotion

Some emotions confused the AI more than others. Here's the breakdown of how well Gemini correlated with human ratings for each emotion:

AI Performance by Emotion

Notice something? The "best" performance—nostalgia at r=0.264—is still abysmal by any practical standard. And solemnity actually has a negative correlation: the AI tends to rate songs as solemn when humans don't, and vice versa.

The Bias Problem

But correlation is only part of the story. The AI also showed consistent directional biases—systematically over-predicting some emotions and under-predicting others:

What does this mean in practice? The AI has a kind of Pollyanna syndrome. It hears music as happier, more tender, more powerful than humans do. And it dramatically underestimates tension—particularly when a song is genuinely tense.

The Worst Offenders

Not all songs were equally misunderstood. Some were catastrophically misjudged.

Take Song 40. When you calculate the within-song correlation—whether the AI at least got the relative ranking of emotions right for this particular piece—you get -0.878. That's not just wrong. That's inverted. The emotions the AI thought were strongest were actually the weakest, and vice versa.

The AI predicted this song would evoke primarily joyful activation. Humans felt nostalgia.

Best and Worst Predictions

The Prior Problem

Here's perhaps the most revealing pattern: when you look at which emotion the AI selected as "top" for each song, you see a stunning lack of diversity.

Out of 40 songs:

Meanwhile, human ratings were distributed much more evenly across nostalgia, tension, solemnity, and other emotions.

The AI appears to be operating with a strong prior belief that most music is either 'joyful activation' or 'calmness'—a belief that has little to do with the actual audio.

The Entanglement Effect

There's another problem, more subtle but equally damning. When you analyze how the AI's emotion predictions relate to each other, you discover something strange: they're far more correlated than human ratings are.

For humans, different emotions can coexist independently. A song can be both nostalgic and joyful, both powerful and calm. The average absolute correlation between different emotions in human ratings is around 0.37.

For the AI? 0.58.

What this means is that the AI is producing low-dimensional, template-like responses. Some specific examples:

The AI isn't really hearing nine separate emotional dimensions. It's hearing maybe two or three, and generating the others mechanically based on internal correlations that have little to do with the music itself.

Emotion Correlations: AI vs. Humans

Human Ratings

AI Predictions

Is It Even Listening?

This raises an uncomfortable question: is the AI actually processing the audio, or is it just generating plausible-looking emotion distributions based on its training?

The evidence suggests the latter. The strong priors, the low-dimensional structure, the consistent biases regardless of musical content—these are all signs of a system that's learned to produce human-like emotional distributions in the abstract, but hasn't learned to listen.

To test this properly, you'd need ablation studies: send the AI silence, or white noise, or the same audio under different filenames. If the outputs barely change, you'd know for certain that the "listening" is mostly theater.

What This Means

For companies dreaming of automating music emotion tagging—of replacing human listeners with AI systems that can instantly categorize the emotional content of millions of songs—this experiment is a wake-up call.

The technology isn't there. Not even close.

At best, an AI like Gemini could serve as a rough assistant: suggesting a short list of possible emotions for a human to confirm. The data shows that when you look at the AI's top-3 predictions, there's about a 70% overlap with human top-3 emotions. That's not nothing—it's enough to potentially speed up human labeling work.

But autonomous tagging? Replacing human judgment? The numbers say no. The AI is literally worse than predicting the average every time.

0%
Emotions where AI performance exceeds simple baseline

The Deeper Question

But there's something more interesting here than just a failed automation attempt. This experiment tells us something about the nature of music perception itself.

Music emotion isn't just pattern recognition. It's not just mapping acoustic features (tempo, key, timbre) to emotional labels. If it were, an AI trained on vast amounts of data would excel at this task.

Instead, what we're seeing is that emotional response to music is deeply contextual, deeply personal, deeply tied to human experience in ways that current AI can't capture.

The AI can't under-predict tension by accident. It can't systematically mishear solemnity by chance. These are failures of understanding—of having a model of what these emotions actually mean in the context of human musical experience.

We thought we were testing whether AI could label music. We discovered we were testing whether AI understands what it means to be moved.

The Process

🎵
Audio Input
40 musical excerpts in .opus format
🤖
AI Analysis
Google Gemini 3 Pro processes raw audio
📊
Emotion Prediction
9 GEMS emotions with mean & std
👥
Human Comparison
Validated against Emotify dataset

The Prompt

The AI was given this specific instruction for each audio file (see full code):

You are analyzing a music audio clip. This audio has been listened to and rated by N people. Each person indicated whether they strongly felt each of the following emotions: - amazement (wonder, awe, happiness) - solemnity (transcendence, inspiration, thrills) - tenderness (sensuality, affect, feeling of love) - nostalgia (dreamy, melancholic, sentimental) - calmness (relaxation, serenity, meditative) - power (strong, heroic, triumphant, energetic) - joyful_activation (bouncy, animated, like dancing) - tension (nervous, impatient, irritated) - sadness (depressed, sorrowful) Based only on the audio content, estimate: 1. The average rating (mean, between 0 and 1) 2. The standard deviation of the ratings

The Data

Key Findings

Resources

What Comes Next

This is not the end of the story. It's the beginning of a question.

Could a better prompt help? Almost certainly. The current prompt provides definitions, but it doesn't anchor them with examples. It doesn't calibrate what a "0.6" means versus a "0.3" in practical terms. Few-shot learning—showing the AI a handful of correctly labeled examples—might dramatically improve performance.

Could better post-processing help? Yes. Even with weak raw predictions, statistical calibration could correct for systematic biases. If you know the AI always over-predicts joyful activation by 18 points on average, you can subtract 18 points.

Could a different model help? Maybe. This was one AI system, at one point in time. The field is evolving rapidly.

But here's what won't change: music emotion is hard. It's hard because it's not purely in the signal. It's in the interaction between the signal and a lifetime of human experience, culture, memory, and meaning.

The AI can learn correlations. It can learn patterns. But can it learn what it feels like to hear a song that reminds you of someone you loved? Can it learn the particular ache of a minor chord progression? Can it learn why one person hears power where another hears sadness?

These are not rhetorical questions. They're empirical ones. And right now, the data suggests the answer is no.

Not yet.