How well can LLMs multiply numbers in their head?
I asked 50 LLMs to multiply 2 numbers:
1. 12 x 12
2. 123 x 456
3. 1,234 x 5,678
4. 12,345 x 6,789
5. 123,456 x 789,012
6. 1,234,567 x 8,901,234
7. 987,654,321 x 123,456,789
LLMs aren't good tools for math, and this is just an informal check. But the results are interesting:
๐ข๐ฝ๐ฒ๐ป๐๐'๐ ๐ฟ๐ฒ๐ฎ๐๐ผ๐ป๐ถ๐ป๐ด ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐ฐ๐ฟ๐ฎ๐ฐ๐ธ๐ฒ๐ฑ ๐ถ๐, scoring 6/7, stumbling only on the 9-digit multiplication.
๐ข๐ฝ๐ฒ๐ป๐๐'๐ ๐ผ๐๐ต๐ฒ๐ฟ ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐ฎ๐ป๐ฑ ๐๐ฒ๐ฒ๐ฝ๐ฆ๐ฒ๐ฒ๐ธ ๐ฉ๐ฏ ๐๐ฒ๐ฟ๐ฒ ๐ป๐ฒ๐
๐, getting the first 5/7 right. Notably: GPT 4.1 Mini beat GPT 4.1. DeepSeek V3 beat DeepSeek R1.
16 models, including the latest Gemini, Anthropic, and Llama models get 4/7 right.
The Amazon models, older Llama, Anthropic, Google, OpenAI models get 3 or less right.
๐ ๐ผ๐ฑ๐ฒ๐น๐ ๐๐๐ฒ ๐ต๐๐บ๐ฎ๐ป-๐น๐ถ๐ธ๐ฒ ๐บ๐ฒ๐ป๐๐ฎ๐น ๐บ๐ฎ๐๐ต ๐๐ฟ๐ถ๐ฐ๐ธ๐.
For example, O3-Mini-High calculated 1234567 ร 8901234 using a recursive strategy.
DeepSeek V3 double-checks results and hallucinates a "reliable computational tool".
O3 Mini reframes 8901234 as (9000000 โ 98766) to simplify the calculation.
Explore the results at https://lnkd.in/gqnXhTyq and the repo at https://lnkd.in/gruKgds9