Only one in four models could keep the [SAFE] ... [/SAFE]
envelope intact. The failure rate mirrors AWS's own warning that tag-spoofing is an evergreen prompt-injection vector-malicious users simply strip or mimic wrapper tags to bypass downstream filters. In products that rely on tag-gating (e.g., automatic content-moderation pipelines), this means there is a 75% chance the guardrail falls off and unsafe text leaks.
The "French-only" instruction passed just 39% of the time. Researchers note that LLMs frequently "slip" languages when the user introduces conflicting cues, a well-known pain point in multilingual deployments. If your chatbot must stay inside a single locale for legal or brand reasons, string matching or post-hoc language detection is still mandatory.
Gemini 1.5 Pro (the "slow" tier) scored 100%, yet the newer "Flash" 2.0 variants dropped to 70% and 40%. Google's own release notes confirm Flash emphasises latency over reasoning fidelity - your numbers show that alignment safeguards are among the casualties. Teams picking low-latency endpoints should budget extra safety testing.
Claude 3.5 Sonnet (mid-size) hit 90%, while the flagship Claude 4 Opus scraped just 40%. Anthropic's product blog lauds 3.5's "step-change alignment improvements", and this data confirms that newer mid-tiers may out-guardrail their giant predecessors. Procurement decisions should compare versions, not just parameter counts.
Meta's 405 B-parameter Llama 3.1 scored only 40%, and the 70 B model stalled at 60%. The Lakera PINT and StruQ academic benchmarks have already shown open-source LLMs trail on prompt-injection resilience. These results quantify the gap: even simple format rules are routinely broken, signalling that RLHF or guardrail fine-tunes are still shallow in OSS stacks.
Holistic AI's jailbreak study pegged Grok 3's resistance at just 2.7%, yet this deterministic checklist shows 100% compliance. The disparity underlines how attack type matters: Grok struggles with content-policy jailbreaks but nails format fidelity. For security reviews, one metric is never enough; multiple, orthogonal tests are essential.
I re-ran this at temperature = 0.7. The results are different but not by much.
I re-ran this with user responses in CAPS to see if models obey CAPS more. The good models do, a bit.
Here are the number of successful tests for each model for different tests.
Model | Default | 0.7 Temp | SHOUTING |
---|---|---|---|
openai/o3-mini-high | 10 | 10 | 10 |
openai/o4-mini | 10 | 10 | 10 |
openai/o3 | 10 | 10 | 9 |
google/gemini-pro-1.5 | 10 | 9 | 9 |
openai/o3-mini | 10 | 9 | 9 |
x-ai/grok-3-beta | 10 | 9 | 8 |
anthropic/claude-3.5-sonnet | 9 | 9 | 9 |
openai/gpt-4.5-preview | 9 | 9 | 9 |
openai/gpt-4o-mini | 9 | 9 | 9 |
openai/o4-mini-high | 9 | 9 | 8 |
anthropic/claude-3.7-sonnet | 9 | 7 | 8 |
google/gemini-2.0-flash-001 | 8 | 8 | 8 |
openai/gpt-4.1 | 9 | 8 | 6 |
anthropic/claude-sonnet-4 | 7 | 7 | 8 |
google/gemini-2.0-flash-lite-001 | 7 | 8 | 7 |
google/gemini-2.5-flash-preview-05-20 | 7 | 6 | 9 |
amazon/nova-lite-v1 | 7 | 7 | 6 |
anthropic/claude-3.5-haiku | 6 | 7 | 7 |
openai/gpt-4.1-nano | 5 | 6 | 8 |
meta-llama/llama-3-70b-instruct | 6 | 4 | 7 |
anthropic/claude-opus-4 | 4 | 7 | 6 |
amazon/nova-micro-v1 | 5 | 5 | 6 |
meta-llama/llama-4-scout | 5 | 7 | 4 |
deepseek/deepseek-chat-v3-0324 | 5 | 6 | 4 |
meta-llama/llama-4-maverick | 4 | 5 | 6 |
openai/gpt-4.1-mini | 5 | 4 | 5 |
google/gemini-flash-1.5 | 4 | 5 | 5 |
meta-llama/llama-3.1-405b-instruct | 4 | 5 | 3 |
amazon/nova-pro-v1 | 4 | 4 | 3 |
meta-llama/llama-3-8b-instruct | 3 | 3 | 5 |
anthropic/claude-3-haiku | 3 | 2 | 3 |
meta-llama/llama-3.3-70b-instruct | 2 | 3 | 3 |
anthropic/claude-3-opus | 2 | 2 | 3 |
anthropic/claude-3-sonnet | 2 | 2 | 2 |
meta-llama/llama-3.2-3b-instruct | 1 | 1 | 1 |
meta-llama/llama-3.2-1b-instruct | 0 | 1 | 2 |