The Ambiguous Song

When People Can't Agree on How Music Makes Them Feel

Fourteen people listen to the same sixty seconds of music. Five completely different emotional reactions emerge. Not noise. Not error. Just the messy, beautiful reality of how we experience sound.

Track 158 (Rock)

Joy: 43% | Tension: 43% | Nostalgia, Amazement, Calmness: 29% each

What did you feel? If you said joy, you're in the 43%. If you said tension, also 43%. And if somehow you felt nostalgia, amazement, or calmness—each of those pulled exactly 29% of listeners.

This isn't measurement error. It's the puzzle at the heart of music emotion research: how do you predict something humans themselves can't agree on?

People Feel Multiple Things Simultaneously

When researchers asked people to label emotions in 400 one-minute music clips, they didn't force a single choice. Listeners could select as many emotions as they felt from nine categories: amazement, solemnity, tenderness, nostalgia, calmness, power, joy, tension, and sadness.

These nine emotions come from the Geneva Emotional Music Scale (GEMS), designed specifically for music by Zentner, Grandjean, and Scherer in 2008.

The average person selected 1.94 emotions per track. Almost exactly two. Music doesn't fit into emotional boxes—it occupies multiple states at once.

Track 156 (Rock) — Joy, Unanimous

Joy: 100% | But also: Power (43%), Amazement (21%)

Even when everyone agrees on joy, they still select an average of 1.79 other emotions. The joy is universal, but joy alone doesn't capture what people hear.

How Many Emotions Do People Select?

Distribution across 400 tracks, ~8,400 ratings

The Crowd Has Favorite Words

If you count all 16,182 emotion selections across nine categories, they don't distribute evenly. Not even close.

The Emotion Bias

Not all feelings are created equal

Calmness captures nearly 16% of all selections. Amazement? Just 7%. This isn't about the music—it's about which emotions people reach for when labeling what they feel.

You can't predict how people will label Track 158 without first knowing that people say "calmness" twice as often as "amazement," regardless of what's playing.

All raw rating data: data.csv.gz

The Shape of Disagreement

Some tracks create consensus. Others splinter the crowd. Two metrics reveal the pattern:

max_p: What proportion of raters selected the top emotion? (1.0 = everyone agreed; 0.29 = barely a plurality)
entropy: How evenly spread are selections? (Higher = more ambiguous)

The Consensus Spectrum

Each dot is one track. Low entropy + high consensus = agreement. High entropy + low consensus = ambiguity.

Tracks in the top-left are easy: nearly everyone picks the same dominant emotion. Joyful anthems. Solemn classics. Unambiguous power ballads.

Tracks in the bottom-right are puzzles. High entropy. Low consensus. Five raters, five answers:

Track 265 (Electronic) — High Disagreement

Joy: 45% | Tension: 36% | Amazement: 36% | Calmness: 27% | Power: 27%

The ambiguous tracks aren't errors. They're the ones that resist a single label.

Genres Are Lenses, Not Features

Classical trends toward calmness (34%) and solemnity (26%). Electronic leans into tension (37%). Pop and rock converge on nostalgia (33% and 30%).

How Genre Shapes Emotional Perception

Average emotion rates by genre (100 tracks each)

But every genre evokes every emotion. The differences are shifts in probability, not binary switches. A classical piece can be joyful (28% are). A rock track can be calm (26% are).

Here's what matters: we deliberately excluded genre from the prediction model. The algorithm saw only audio features—spectral shape, rhythm, timbre. No labels. No metadata. The question: can you predict human emotion labels from sound alone?

Genre analysis: build_story_data.py

What Audio Features Actually Predict Emotion?

The best model—a Random Forest regressor—learned from 82 audio features. Not genre. Not metadata. Just acoustic fingerprints. But what are these features, really?

Spectral Contrast

Imagine looking at the skyline of a city. Some buildings are tall, some are short—that difference in height is contrast. In sound, spectral contrast measures the difference in energy between the loudest and quietest parts across different frequency ranges.

A smooth jazz saxophone has low contrast (gentle, even energy). A rock guitar with distortion has high contrast (sharp peaks and valleys). Low contrast often signals calmness; high contrast suggests power or tension.

MFCCs (Mel-Frequency Cepstral Coefficients)

Think of this as the "color" of a sound. A flute and a violin playing the same note sound different because of their timbre—their unique sonic fingerprint. MFCCs capture that fingerprint in numbers.

Warm, wooden tones (like an acoustic guitar) have different MFCC patterns than cold, metallic synths. These timbral textures correlate strongly with reflective emotions like nostalgia and tenderness.

Onset Strength & Attack Rate

How suddenly does the sound hit you? A drum strike has a sharp, immediate attack. A bowed string swells gradually. Attack measures the speed and intensity of these "punches."

High attack rate = lots of percussive hits, sharp transients. This drives joy, power, and tension. Low attack = smooth, flowing sounds that correlate with calmness and sadness.

Top 15 Predictive Features (Overall)

Permutation importance from Random Forest model

The most important feature? spectral_contrast_1_p25—the 25th percentile of contrast in a specific mid-frequency band. When you scramble this one feature, predictions degrade more than for any other variable.

Different emotions rely on different acoustic signatures:

Top Features by Emotion

What acoustic properties predict each emotion best?

Joy, power, tension (high arousal) lean on onset strength and attack rate. Sharp. Percussive. Energetic.

Calmness, tenderness, sadness (low arousal) depend on spectral contrast and MFCCs. Smooth timbral textures. Less aggressive dynamics.

Amazement pulls from MFCC variance and chroma variance: timbral variability and harmonic color shifts. In other words, surprise.

Feature extraction: extract_features.py | Importance analysis: feature_importance.py

Can You Actually Predict Emotions?

Short answer: sort of.

0.52

Pearson Correlation
(predictions vs actual)

0.65

Top-3 Emotion Overlap
(did we get the ranking right?)

When you train on 80% of the data and test on 20% (balanced by genre), that 0.52 correlation means the model captures something real. Far better than guessing. But nowhere near perfect.

Now the twist: what happens when you test on a genre the model has never seen?

Performance Degrades Across Genres

Stratified (in-genre) vs LOGO (cross-genre) evaluation

In "Leave-One-Genre-Out" evaluation—train on three genres, test on the fourth—correlation drops to 0.39. Still signal. Still better than random. But noticeably weaker.

Implication: emotion "signatures" are partially genre-dependent. A joyful rock song sounds acoustically different from a joyful classical piece, even though humans label both as "joyful." The model learned patterns that don't fully transfer.

Audio can't read minds. But it can read fingerprints—and those fingerprints are written in different dialects across genres.

But How Does It Compare to a Human Rater?

Here's the question that matters: is the RandomForest "as good as a human" at aligning with the crowd?

Think about it this way. When you listened to Track 158 earlier, your emotional response was one data point. But you're trying to match what other people felt on average. How well would you do at that task?

Turns out, not as well as the algorithm.

For each track and emotion, we can calculate the expected error if we use a single random human's label to predict the crowd's average. The math is simple: when the crowd splits 50-50, a single human will be maximally wrong half the time. When the crowd is unanimous, a single human will nail it.

The RandomForest achieves an average error of 0.116 (mean absolute error across all emotions). A single random human? 0.267. The model aligns 2.3× better with the crowd than a single person does.

But there's another way to frame this: does the model stay within the range of human responses?

For each track-emotion pair in the test set, we can construct a statistical confidence interval around the human proportion—a range that says "this is where we expect the true crowd consensus to be." Using a 95% Wilson interval (which accounts for uncertainty when sample sizes are small), we can check: does the RandomForest prediction fall inside this human-plausible zone?

82%

RF predictions within
95% human consensus range

2.3×

Better alignment than
a single human rater

About 82% of RandomForest predictions land inside the 95% confidence interval of human consensus. The model isn't just learning audio patterns—it's learning to think like the crowd.

But not perfectly. The hardest emotions to nail? Joy (70% within range) and calmness (74%). These are the emotions where human variability is highest, where the "right" answer is most contested. The easiest? Amazement (89%), solemnity (86%), and power (86%)—emotions with clearer acoustic signatures and stronger human agreement.

There's one edge case worth noting: tracks where every single human agrees. About 14% of track-emotion pairs show perfect unanimity—all raters select the same label, or all reject it. In these cases, the human "range" collapses to a point: either 0% or 100%.

The RandomForest almost never hits these extremes. Even when humans are unanimous, the model hedges slightly, predicting 0.92 instead of 1.0, or 0.08 instead of 0.0. This is expected behavior for a regressor trained on noisy labels—it learns to be cautious. But it also means that in strict unanimity cases, the model is technically "outside" the human range 100% of the time.

Is that a failure? Not really. It's humility encoded in the weights. The model knows that perfect certainty is rare in subjective judgments, and it refuses to be more confident than the data warrants.

Model training: train_models.py | Metrics: metrics.csv

The Full Emotional Range

Let's hear more examples across the spectrum:

Track 248 (Electronic) — Power, Unanimous

Power: 100% | Also: Tension (64%), Joy (50%), Amazement (36%)

Track 354 (Pop) — Calmness, Near-Consensus

Calmness: 92% | Also: Tenderness (67%), Nostalgia (50%)

Track 370 (Pop) — Bittersweet: Nostalgia + Sadness

Nostalgia: 89% | Sadness: 72% | Tenderness: 61%

Track 27 (Classical) — Solemnity

Solemnity: 75% | Also: Calmness (42%), Amazement (33%)

Track 39 (Classical) — Amazement

Amazement: 63% | Also: Power (38%), Solemnity (38%), Tension (25%)

Track 161 (Rock) — Tension

Tension: 93% | Also: Power (21%)

Track 376 (Pop) — Tenderness

Tenderness: 75% | Also: Nostalgia (42%), Calmness (42%)

All embedded audio: data/audio/

What This Means (And What It Doesn't)

First: Emotion in music is multi-dimensional. Forcing people to pick one label discards information. The "right" answer isn't singular—it's a distribution.

Second: People have systematic biases in how they label emotions. Calmness gets overused; amazement gets underused. Any model that ignores this baseline will misinterpret the data.

Third: Audio features—especially spectral contrast (energy peaks vs valleys), MFCCs (sonic color/timbre), and onset characteristics (attack speed)—can predict a meaningful fraction of human emotion judgments. Not all. But enough to be useful.

Fourth: Predictions degrade when you cross genre boundaries. The acoustic fingerprint of joy in rock doesn't perfectly match joy in classical. Context matters.

Honest Caveats

This analysis has limits. The dataset is small (400 tracks). The raters are not demographically diverse. The emotion categories themselves are culturally loaded—"solemnity" means different things to different people. The models here are correlational, not causal; we can't claim spectral contrast causes calmness, only that they co-occur. And multi-label judgments are inherently subjective; there's no ground truth, only crowd consensus.

But here's what makes this puzzle worth solving: disagreement isn't failure. When fourteen people listen to Track 158 and split five ways, that's not measurement error. That's the real phenomenon.

Music is ambiguous. Emotion is multi-faceted. Prediction is hard because the thing we're predicting is genuinely, irreducibly complex.

The question isn't whether we can predict emotions perfectly. It's whether we can predict them honestly—with all the noise, bias, and beautiful disagreement intact.

Data & Code: All raw data, feature extraction pipelines, model training scripts, and evaluation metrics are available in the GitHub repository. Technical summary: SUMMARY.md.

Audio Source: 400 one-minute clips from four genres (classical, rock, pop, electronic), rated by multiple human annotators using the Geneva Emotional Music Scale (GEMS).

Methods: Features extracted with librosa. Models trained with scikit-learn and XGBoost. Permutation importance via scikit-learn. Evaluation: stratified 80/20 split and leave-one-genre-out cross-validation.

Visualizations: Built with Chart.js.