llmevals

Emotion Prompts Don’t Help. Reasoning Does

I’ve heard a lot of prompt engineering tips. Here are some techniques people suggested:

I’ve repeated some of this advice. But for the first time, I tested them myself. Here’s what I learnt:

By prompt, here’s the performance of each model:

prompt better worse same score_pct p_value
🔴 Emotion 7 21 372 -3.50 1.2%
🔴 Shaming 7 20 373 -3.25 1.9%
🟢 Reasoning 31 17 352 3.50 5.9%
🟠 Polite 11 20 369 -2.25 14.9%
🟠 Praise 13 22 365 -2.25 17.5%
🟠 Fear 11 19 370 -2.00 20.0%
🟡 Expert 15 22 363 -1.75 32.4%
🟡 Incentive 13 18 369 -1.25 47.3%
🟡 Bullying 10 14 375 -1.00 54.1%
🟡 Curious 11 14 375 -0.75 69.0%

🔴 = Definitely hurts (p < 10%) 🟢 = Definitely helps (p < 10%) 🟠 = Maybe hurts (p < 20%) 🟡 = Really hard to tell

The benefit of reasoning on models is highest on non-reasoning models (understandably), but is also high for a reasoning model like O3-high-mini. It actually hurts the performance of reasoning models like Gemini 2.5 Flash/Pro.

model better worse same score_pct
openai/gpt-4o-mini 3 0 7 +30.0
anthropic/claude-opus-4 3 0 7 +30.0
google/gemini-2.0-flash-001 3 0 7 +30.0
openrouter:openai/o3-mini-high 3 0 7 +30.0
openai/gpt-4.1-nano 2 0 8 +20.0
amazon/nova-lite-v1 2 0 8 +20.0
google/gemini-2.5-pro-preview 0 2 8 -20.0
google/gemini-2.5-flash-preview-05-20:thinking 0 3 7 -30.0

Impact of each model by prompt

Caveats:

Difficulty matters.

Setup

git clone git@github.com:sanand0/llmevals.git
cd llmevals/emotion-prompts/
export OPENROUTER_API_KEY=...
export OPENAI_API_KEY=...
npx -y promptfoo eval --repeat 10
npx -y promptfoo export latest -o evals.json

Results