LLM Mental Math

I asked 50 LLMs to multiple 2 numbers:

LLMs aren't good tools for math and this is just an informal check. But the results are interesting:

OpenAI's reasoning models cracked it, scoring 6/7, stumbling only on the 9-digit multiplication.
OpenAI's other models and DeepSeek V3 were next, getting the first 5/7 right. Notably: GPT 4.1 Mini beat GPT 4.1. DeepSeek V3 beat DeepSeek R1.
16 models, including the latest Gemini, Anthropic, and Llama models get 4/7 right.
The Amazon models, older Llama, Anthropic, Google, OpenAI models get 3 or less right.

Models use human-like mental math tricks.

For example, O3-Mini-High calculated 1234567 × 8901234 using a recursive strategy.

DeepSeek V3 double-checks results and hallucinates a "reliable computational tool".

O3 Mini reframes 8901234 as (9000000 − 98766) to simplify the calculation.

For more details, see the repo at github.com/sanand0/llmmath.

How well can LLMs multiply?