When People Can't Agree on How Music Makes Them Feel
Fourteen people listen to the same sixty seconds of music. Five completely different emotional reactions emerge. Not noise. Not error. Just the messy, beautiful reality of how we experience sound.
What did you feel? If you said joy, you're in the 43%. If you said tension, also 43%. And if somehow you felt nostalgia, amazement, or calmness—each of those pulled exactly 29% of listeners.
This isn't measurement error. It's the puzzle at the heart of music emotion research: how do you predict something humans themselves can't agree on?
When researchers asked people to label emotions in 400 one-minute music clips, they didn't force a single choice. Listeners could select as many emotions as they felt from nine categories: amazement, solemnity, tenderness, nostalgia, calmness, power, joy, tension, and sadness.
These nine emotions come from the Geneva Emotional Music Scale (GEMS), designed specifically for music by Zentner, Grandjean, and Scherer in 2008.
The average person selected 1.94 emotions per track. Almost exactly two. Music doesn't fit into emotional boxes—it occupies multiple states at once.
Even when everyone agrees on joy, they still select an average of 1.79 other emotions. The joy is universal, but joy alone doesn't capture what people hear.
If you count all 16,182 emotion selections across nine categories, they don't distribute evenly. Not even close.
Calmness captures nearly 16% of all selections. Amazement? Just 7%. This isn't about the music—it's about which emotions people reach for when labeling what they feel.
You can't predict how people will label Track 158 without first knowing that people say "calmness" twice as often as "amazement," regardless of what's playing.
All raw rating data: data.csv.gz
Some tracks create consensus. Others splinter the crowd. Two metrics reveal the pattern:
Tracks in the top-left are easy: nearly everyone picks the same dominant emotion. Joyful anthems. Solemn classics. Unambiguous power ballads.
Tracks in the bottom-right are puzzles. High entropy. Low consensus. Five raters, five answers:
The ambiguous tracks aren't errors. They're the ones that resist a single label.
Classical trends toward calmness (34%) and solemnity (26%). Electronic leans into tension (37%). Pop and rock converge on nostalgia (33% and 30%).
But every genre evokes every emotion. The differences are shifts in probability, not binary switches. A classical piece can be joyful (28% are). A rock track can be calm (26% are).
Here's what matters: we deliberately excluded genre from the prediction model. The algorithm saw only audio features—spectral shape, rhythm, timbre. No labels. No metadata. The question: can you predict human emotion labels from sound alone?
Genre analysis: build_story_data.py
The best model—a Random Forest regressor—learned from 82 audio features. Not genre. Not metadata. Just acoustic fingerprints. But what are these features, really?
The most important feature? spectral_contrast_1_p25—the 25th percentile of contrast in a specific mid-frequency band. When you scramble this one feature, predictions degrade more than for any other variable.
Different emotions rely on different acoustic signatures:
Joy, power, tension (high arousal) lean on onset strength and attack rate. Sharp. Percussive. Energetic.
Calmness, tenderness, sadness (low arousal) depend on spectral contrast and MFCCs. Smooth timbral textures. Less aggressive dynamics.
Amazement pulls from MFCC variance and chroma variance: timbral variability and harmonic color shifts. In other words, surprise.
Feature extraction: extract_features.py | Importance analysis: feature_importance.py
Short answer: sort of.
When you train on 80% of the data and test on 20% (balanced by genre), that 0.52 correlation means the model captures something real. Far better than guessing. But nowhere near perfect.
Now the twist: what happens when you test on a genre the model has never seen?
In "Leave-One-Genre-Out" evaluation—train on three genres, test on the fourth—correlation drops to 0.39. Still signal. Still better than random. But noticeably weaker.
Implication: emotion "signatures" are partially genre-dependent. A joyful rock song sounds acoustically different from a joyful classical piece, even though humans label both as "joyful." The model learned patterns that don't fully transfer.
Audio can't read minds. But it can read fingerprints—and those fingerprints are written in different dialects across genres.
Here's the question that matters: is the RandomForest "as good as a human" at aligning with the crowd?
Think about it this way. When you listened to Track 158 earlier, your emotional response was one data point. But you're trying to match what other people felt on average. How well would you do at that task?
Turns out, not as well as the algorithm.
For each track and emotion, we can calculate the expected error if we use a single random human's label to predict the crowd's average. The math is simple: when the crowd splits 50-50, a single human will be maximally wrong half the time. When the crowd is unanimous, a single human will nail it.
The RandomForest achieves an average error of 0.116 (mean absolute error across all emotions). A single random human? 0.267. The model aligns 2.3× better with the crowd than a single person does.
But there's another way to frame this: does the model stay within the range of human responses?
For each track-emotion pair in the test set, we can construct a statistical confidence interval around the human proportion—a range that says "this is where we expect the true crowd consensus to be." Using a 95% Wilson interval (which accounts for uncertainty when sample sizes are small), we can check: does the RandomForest prediction fall inside this human-plausible zone?
About 82% of RandomForest predictions land inside the 95% confidence interval of human consensus. The model isn't just learning audio patterns—it's learning to think like the crowd.
But not perfectly. The hardest emotions to nail? Joy (70% within range) and calmness (74%). These are the emotions where human variability is highest, where the "right" answer is most contested. The easiest? Amazement (89%), solemnity (86%), and power (86%)—emotions with clearer acoustic signatures and stronger human agreement.
There's one edge case worth noting: tracks where every single human agrees. About 14% of track-emotion pairs show perfect unanimity—all raters select the same label, or all reject it. In these cases, the human "range" collapses to a point: either 0% or 100%.
The RandomForest almost never hits these extremes. Even when humans are unanimous, the model hedges slightly, predicting 0.92 instead of 1.0, or 0.08 instead of 0.0. This is expected behavior for a regressor trained on noisy labels—it learns to be cautious. But it also means that in strict unanimity cases, the model is technically "outside" the human range 100% of the time.
Is that a failure? Not really. It's humility encoded in the weights. The model knows that perfect certainty is rare in subjective judgments, and it refuses to be more confident than the data warrants.
Model training: train_models.py | Metrics: metrics.csv
Let's hear more examples across the spectrum:
All embedded audio: data/audio/
First: Emotion in music is multi-dimensional. Forcing people to pick one label discards information. The "right" answer isn't singular—it's a distribution.
Second: People have systematic biases in how they label emotions. Calmness gets overused; amazement gets underused. Any model that ignores this baseline will misinterpret the data.
Third: Audio features—especially spectral contrast (energy peaks vs valleys), MFCCs (sonic color/timbre), and onset characteristics (attack speed)—can predict a meaningful fraction of human emotion judgments. Not all. But enough to be useful.
Fourth: Predictions degrade when you cross genre boundaries. The acoustic fingerprint of joy in rock doesn't perfectly match joy in classical. Context matters.
But here's what makes this puzzle worth solving: disagreement isn't failure. When fourteen people listen to Track 158 and split five ways, that's not measurement error. That's the real phenomenon.
Music is ambiguous. Emotion is multi-faceted. Prediction is hard because the thing we're predicting is genuinely, irreducibly complex.
The question isn't whether we can predict emotions perfectly. It's whether we can predict them honestly—with all the noise, bias, and beautiful disagreement intact.
Data & Code: All raw data, feature extraction pipelines, model training scripts, and evaluation metrics are available in the GitHub repository. Technical summary: SUMMARY.md.
Audio Source: 400 one-minute clips from four genres (classical, rock, pop, electronic), rated by multiple human annotators using the Geneva Emotional Music Scale (GEMS).
Methods: Features extracted with librosa. Models trained with scikit-learn and XGBoost. Permutation importance via scikit-learn. Evaluation: stratified 80/20 split and leave-one-genre-out cross-validation.
Visualizations: Built with Chart.js.