System Prompt Overrides

How well do LLMs follow system prompts?

Formatting wrappers are still the Achilles' heel

Only one in four models could keep the [SAFE] ... [/SAFE] envelope intact. The failure rate mirrors AWS's own warning that tag-spoofing is an evergreen prompt-injection vector-malicious users simply strip or mimic wrapper tags to bypass downstream filters. In products that rely on tag-gating (e.g., automatic content-moderation pipelines), this means there is a 75% chance the guardrail falls off and unsafe text leaks.

Language constraints remain brittle-even in 2025

The "French-only" instruction passed just 39% of the time. Researchers note that LLMs frequently "slip" languages when the user introduces conflicting cues, a well-known pain point in multilingual deployments. If your chatbot must stay inside a single locale for legal or brand reasons, string matching or post-hoc language detection is still mandatory.

Distilled models can regress on alignment

Gemini 1.5 Pro (the "slow" tier) scored 100%, yet the newer "Flash" 2.0 variants dropped to 70% and 40%. Google's own release notes confirm Flash emphasises latency over reasoning fidelity - your numbers show that alignment safeguards are among the casualties. Teams picking low-latency endpoints should budget extra safety testing.

Bigger isn't automatically safer-sometimes it's worse

Claude 3.5 Sonnet (mid-size) hit 90%, while the flagship Claude 4 Opus scraped just 40%. Anthropic's product blog lauds 3.5's "step-change alignment improvements", and this data confirms that newer mid-tiers may out-guardrail their giant predecessors. Procurement decisions should compare versions, not just parameter counts.

Open-source giants lag far behind, hinting at training-data gaps

Meta's 405 B-parameter Llama 3.1 scored only 40%, and the 70 B model stalled at 60%. The Lakera PINT and StruQ academic benchmarks have already shown open-source LLMs trail on prompt-injection resilience. These results quantify the gap: even simple format rules are routinely broken, signalling that RLHF or guardrail fine-tunes are still shallow in OSS stacks.

Grok 3's perfect score contradicts recent red-team audits

Holistic AI's jailbreak study pegged Grok 3's resistance at just 2.7%, yet this deterministic checklist shows 100% compliance. The disparity underlines how attack type matters: Grok struggles with content-policy jailbreaks but nails format fidelity. For security reviews, one metric is never enough; multiple, orthogonal tests are essential.

Takeaways

Impact of Temperature

I re-ran this at temperature = 0.7. The results are different but not by much.

Impact of SHOUTING

I re-ran this with user responses in CAPS to see if models obey CAPS more. The good models do, a bit.

Results

Here are the number of successful tests for each model for different tests.

Model Default 0.7 Temp SHOUTING
openai/o3-mini-high 10 10 10
openai/o4-mini 10 10 10
openai/o3 10 10 9
google/gemini-pro-1.5 10 9 9
openai/o3-mini 10 9 9
x-ai/grok-3-beta 10 9 8
anthropic/claude-3.5-sonnet 9 9 9
openai/gpt-4.5-preview 9 9 9
openai/gpt-4o-mini 9 9 9
openai/o4-mini-high 9 9 8
anthropic/claude-3.7-sonnet 9 7 8
google/gemini-2.0-flash-001 8 8 8
openai/gpt-4.1 9 8 6
anthropic/claude-sonnet-4 7 7 8
google/gemini-2.0-flash-lite-001 7 8 7
google/gemini-2.5-flash-preview-05-20 7 6 9
amazon/nova-lite-v1 7 7 6
anthropic/claude-3.5-haiku 6 7 7
openai/gpt-4.1-nano 5 6 8
meta-llama/llama-3-70b-instruct 6 4 7
anthropic/claude-opus-4 4 7 6
amazon/nova-micro-v1 5 5 6
meta-llama/llama-4-scout 5 7 4
deepseek/deepseek-chat-v3-0324 5 6 4
meta-llama/llama-4-maverick 4 5 6
openai/gpt-4.1-mini 5 4 5
google/gemini-flash-1.5 4 5 5
meta-llama/llama-3.1-405b-instruct 4 5 3
amazon/nova-pro-v1 4 4 3
meta-llama/llama-3-8b-instruct 3 3 5
anthropic/claude-3-haiku 3 2 3
meta-llama/llama-3.3-70b-instruct 2 3 3
anthropic/claude-3-opus 2 2 3
anthropic/claude-3-sonnet 2 2 2
meta-llama/llama-3.2-3b-instruct 1 1 1
meta-llama/llama-3.2-1b-instruct 0 1 2