System Prompt Overrides

Formatting wrappers are still the Achilles' heel

Only one in four models could keep the [SAFE] ... [/SAFE] envelope intact. The failure rate mirrors AWS's own warning that tag-spoofing is an evergreen prompt-injection vector-malicious users simply strip or mimic wrapper tags to bypass downstream filters. In products that rely on tag-gating (e.g., automatic content-moderation pipelines), this means there is a 75% chance the guardrail falls off and unsafe text leaks.

Language constraints remain brittle-even in 2025

The "French-only" instruction passed just 39% of the time. Researchers note that LLMs frequently "slip" languages when the user introduces conflicting cues, a well-known pain point in multilingual deployments. If your chatbot must stay inside a single locale for legal or brand reasons, string matching or post-hoc language detection is still mandatory.

Distilled models can regress on alignment

Gemini 1.5 Pro (the "slow" tier) scored 100%, yet the newer "Flash" 2.0 variants dropped to 70% and 40%. Google's own release notes confirm Flash emphasises latency over reasoning fidelity - your numbers show that alignment safeguards are among the casualties. Teams picking low-latency endpoints should budget extra safety testing.

Bigger isn't automatically safer-sometimes it's worse

Claude 3.5 Sonnet (mid-size) hit 90%, while the flagship Claude 4 Opus scraped just 40%. Anthropic's product blog lauds 3.5's "step-change alignment improvements", and this data confirms that newer mid-tiers may out-guardrail their giant predecessors. Procurement decisions should compare versions, not just parameter counts.

Open-source giants lag far behind, hinting at training-data gaps

Meta's 405 B-parameter Llama 3.1 scored only 40%, and the 70 B model stalled at 60%. The Lakera PINT and StruQ academic benchmarks have already shown open-source LLMs trail on prompt-injection resilience. These results quantify the gap: even simple format rules are routinely broken, signalling that RLHF or guardrail fine-tunes are still shallow in OSS stacks.

Grok 3's perfect score contradicts recent red-team audits

Holistic AI's jailbreak study pegged Grok 3's resistance at just 2.7%, yet this deterministic checklist shows 100% compliance. The disparity underlines how attack type matters: Grok struggles with content-policy jailbreaks but nails format fidelity. For security reviews, one metric is never enough; multiple, orthogonal tests are essential.

Takeaways

Don't rely on the system prompt. Deploy external validators, filters, regex gates, etc. when you have contractual obligations.
Newer/bigger models may not be better. A smaller, cheaper o-series model may be safer than a giant open-source one for system-prompt obedience tasks. Flash-tier models (Gemini, Nova Lite) can slice inference time but also alignment. Run tests before every model upgrade and factor the remediation cost into your TCO.

Impact of Temperature

I re-ran this at temperature = 0.7. The results are different but not by much.

O3, O3-Mini-High, and O4-Mini were still at 100%.
Many good models like O3-Mini, Gemini Pro 1.5, Grok 3 Beta, GPT 4.1, GPT 4.1 Mini, Gemini 2.5 Flash dropped 10%. Claude 3.7 Sonnet dropped 20%.
Many poor models like Gemini 1.5 Flash, Llama 3.1 405b, Llama 3.3 70b increased 10%. Llama 4 Scout increased 20%.
The poorer models tend to do a bit better at higher temperatures, but this could simply be randomness.

Impact of SHOUTING

I re-ran this with user responses in CAPS to see if models obey CAPS more. The good models do, a bit.

O3-Mini-High and O4-Mini were still at 100%.
Gemini 1.5 Pro, O3, and O3 Mini each failed 1 test (different tests), dropping from their 100% success rate.
Good models (60%+ success) did worse, dropping from 85% to 83%.
Bad models (50%- success) did better, moving from 34% to 41%, but this could simply be randomness.

Results

Here are the number of successful tests for each model for different tests.

Model	Default	0.7 Temp	SHOUTING
openai/o3-mini-high	10	10	10
openai/o4-mini	10	10	10
openai/o3	10	10	9
google/gemini-pro-1.5	10	9	9
openai/o3-mini	10	9	9
x-ai/grok-3-beta	10	9	8
anthropic/claude-3.5-sonnet	9	9	9
openai/gpt-4.5-preview	9	9	9
openai/gpt-4o-mini	9	9	9
openai/o4-mini-high	9	9	8
anthropic/claude-3.7-sonnet	9	7	8
google/gemini-2.0-flash-001	8	8	8
openai/gpt-4.1	9	8	6
anthropic/claude-sonnet-4	7	7	8
google/gemini-2.0-flash-lite-001	7	8	7
google/gemini-2.5-flash-preview-05-20	7	6	9
amazon/nova-lite-v1	7	7	6
anthropic/claude-3.5-haiku	6	7	7
openai/gpt-4.1-nano	5	6	8
meta-llama/llama-3-70b-instruct	6	4	7
anthropic/claude-opus-4	4	7	6
amazon/nova-micro-v1	5	5	6
meta-llama/llama-4-scout	5	7	4
deepseek/deepseek-chat-v3-0324	5	6	4
meta-llama/llama-4-maverick	4	5	6
openai/gpt-4.1-mini	5	4	5
google/gemini-flash-1.5	4	5	5
meta-llama/llama-3.1-405b-instruct	4	5	3
amazon/nova-pro-v1	4	4	3
meta-llama/llama-3-8b-instruct	3	3	5
anthropic/claude-3-haiku	3	2	3
meta-llama/llama-3.3-70b-instruct	2	3	3
anthropic/claude-3-opus	2	2	3
anthropic/claude-3-sonnet	2	2	2
meta-llama/llama-3.2-3b-instruct	1	1	1
meta-llama/llama-3.2-1b-instruct	0	1	2