LLM Mental Math

How well can LLMs multiply?

I asked 50 LLMs to multiple 2 numbers:

  1. 12 x 12
  2. 123 x 456
  3. 1,234 x 5,678
  4. 12,345 x 6,789
  5. 123,456 x 789,012
  6. 1,234,567 x 8,901,234
  7. 987,654,321 x 123,456,789

LLMs aren't good tools for math and this is just an informal check. But the results are interesting:

  1. OpenAI's reasoning models cracked it, scoring 6/7, stumbling only on the 9-digit multiplication.
  2. OpenAI's other models and DeepSeek V3 were next, getting the first 5/7 right. Notably: GPT 4.1 Mini beat GPT 4.1. DeepSeek V3 beat DeepSeek R1.
  3. 16 models, including the latest Gemini, Anthropic, and Llama models get 4/7 right.
  4. The Amazon models, older Llama, Anthropic, Google, OpenAI models get 3 or less right.

Models use human-like mental math tricks.

For example, O3-Mini-High calculated 1234567 × 8901234 using a recursive strategy.

DeepSeek V3 double-checks results and hallucinates a "reliable computational tool".

O3 Mini reframes 8901234 as (9000000 − 98766) to simplify the calculation.

For more details, see the repo at github.com/sanand0/llmmath.