I asked 67 LLMs to multiply 2 numbers, five times each:
- 12 x 12
- 123 x 456
- 1,234 x 5,678
- 12,345 x 6,789
- 123,456 x 789,012
- 1,234,567 x 8,901,234
- 987,654,321 x 123,456,789
LLMs aren't good tools for math and this is just an informal check. But the results are interesting:
Update: 15 June 2026
- Three newer models got every multiplication right: GPT-5.5, Claude Opus 4.8, and Grok 4.3 scored 35/35.
- Qwen 3.7 Plus nearly matched them, scoring 34/35, followed by Claude Sonnet 4.6 at 32/35.
- Gemini 3 Flash Preview, Gemini 3.1 Pro Preview, and GPT-5.4 Mini each scored 30/35.
- Not every newer model improved: Amazon Nova 2 Lite scored 10/35.
The perfect scores are a major improvement over May 2025, when no model solved the hardest 9-digit multiplication.
The update also fixes the hardest test's expected value, which JavaScript rounded in the earlier run. Older models were not rerun, so comparisons on that test are imperfect.
Original summary: 16 May 2025
- OpenAI's reasoning models cracked it, scoring 6/7, stumbling only on the 9-digit multiplication.
- OpenAI's other models and DeepSeek V3 were next, getting the first 5/7 right. Notably: GPT 4.1 Mini beat GPT 4.1. DeepSeek V3 beat DeepSeek R1.
- 16 models, including the latest Gemini, Anthropic, and Llama models get 4/7 right.
- The Amazon models, older Llama, Anthropic, Google, OpenAI models get 3 or less right.
Models use human-like mental math tricks.
For example, O3-Mini-High calculated 1234567 × 8901234 using a recursive strategy.
DeepSeek V3 double-checks results and hallucinates a "reliable computational tool".
O3 Mini reframes 8901234 as (9000000 − 98766) to simplify the calculation.
For more details, see the repo at github.com/sanand0/llmmath.