I asked 50 LLMs to multiple 2 numbers:
- 12 x 12
- 123 x 456
- 1,234 x 5,678
- 12,345 x 6,789
- 123,456 x 789,012
- 1,234,567 x 8,901,234
- 987,654,321 x 123,456,789
LLMs aren't good tools for math and this is just an informal check. But the results are interesting:
- OpenAI's reasoning models cracked it, scoring 6/7, stumbling only on the 9-digit multiplication.
- OpenAI's other models and DeepSeek V3 were next, getting the first 5/7 right. Notably: GPT 4.1 Mini beat GPT 4.1. DeepSeek V3 beat DeepSeek R1.
- 16 models, including the latest Gemini, Anthropic, and Llama models get 4/7 right.
- The Amazon models, older Llama, Anthropic, Google, OpenAI models get 3 or less right.
Models use human-like mental math tricks.
For example, O3-Mini-High calculated 1234567 × 8901234 using a recursive strategy.
DeepSeek V3 double-checks results and hallucinates a "reliable computational tool".
O3 Mini reframes 8901234 as (9000000 − 98766) to simplify the calculation.
For more details, see the repo at github.com/sanand0/llmmath.