# GPT-4 with Calc

An exploration of using calculation generation to improve GPT-4's capabilities for numeric reasoning.

What's it for?
Making AI outputs more reliable, trustworthy, auditable and explicable
Stage
Completed
• Don Syme
• Albert Ziegler
• Johan Rosenkilde
Share

GPT-4 and other LLMs (Large Language Models) are driving a tidal wave of innovation in applied AI. However used without augmentation they have very limited calculational capabilities and make mistakes calculating with numbers. In this project, we describe a simple, general technique to address this, apply it to some widely reported real-world failures of GPT-4-based AI, do some evaluation and discuss related issues.

These models also make mistakes when writing text that involves comparative logic involving numbers. For example, GPT-4 quite happily writes sentences like this (emphasis added):

User

Give a numerical comparison of these two reports, deriving new interesting numbers as part of the comparison.

GPT-4

Both companies reported an increase in net sales compared to the same quarter last year, but Gap Inc. had a much larger absolute and relative increase (\$4.04 billion, up 2%) than lululemon (\$1.9 billion, up 28%).

Unfortunately, 2 is not greater than 28, the text as written is half-right, but contains a serious mistake that makes it non-sensical.

## The hope

The hope lies in this observation:

1. LLMs are weak at numeric calculation but good at writing numeric calculation code
2. Python and many other tools are perfect at evaluating numeric calculation code.

The answer is obvious: get GPT-4 to write the numeric calculation code relevant to the question, and get Python or some other tool to evaluate it, and use GPT-4 for the rest. Our aim is to "equip" or "augment" GPT-4 with a numeric calculator. In practice the approach we use is simple and can be used to augment any LLM invocation. Further, the calculation code and results can be shown to the user as supplementary output, as well as separately checked, audited, explained and kept for record keeping.

## How does it work?

Without equipping, GPT-x models are often used to implement this function:

With numeric calculation equipping:

1. LLM: Question --> Numeric calculation code
2. Evaluator: Numeric calculation code --> Numeric calculation answers
3. LLM: Question + Numeric calculation code + Numeric calculation answers --> Answer

The overall function computed is still `Question --> Answer`.

## Does it work?

We developed some indicative problem sets that exhibit the following characteristics. These sets were developed partly in order to explore the boundary of what this technique can handle. The results indicate dramatic improvements when using numeric calculation equipping, with the exception of date calculations, where improvements were only marginal. This is discussed more in the evaluation.

Raw calculationSource

decimals, integers, math functions

Without Equip
50% mistake rate (77/153)
With Equip
0% mistake rate (153/153)
Financial calculationsSource

mortgage, rate of return, WACC

Without Equip
71% mistake rate (4/14)
With Equip
14% mistake rate (12/14)
Tabular data calculationsSource
Without Equip
50% mistake rate (4/8)
With Equip
0% mistake rate (8/8)
Unit conversion calculationsSource
Without Equip
36% mistake rate (7/11)
With Equip
0% mistake rate (11/11)
Date calculationsSource
Without Equip
30% mistake rate (7/10)
With Equip
20% mistake rate (8/10)
Word puzzlesSource
Without Equip
11% mistake rate (2050/2303)
With Equip
6.2% mistake rate (2158/2303)

Some variations on calculation code generation were also tested, see the technical report and evaluation.

## Example

Here is a classic fail for LLM-based chat. Given this question:

``````# Question

What is the number of years it takes for an item growing at 30% annually to double?
``````

Here is a typical incorrect response:

``````# Answer

It takes about 2.53 years for an item growing at 30% annually to double.
``````

Calculation equipping to the rescue! Instead of using the LLM to answer the questions, we use the sample equipping prompt from the technical report, which begins:

``````# Guidance

calculation and comparisons relevant to answering the question.

After the question write a code block with up to three sections containing
content relevant to answering the question.

...
``````

Executing the LLM with this prompt gives the following calculation code:

``````# Relevant definitions

growth_rate = 0.3 # annual growth rate as a decimal fraction
initial_amount = 1 # initial amount of the item
doubling_amount = 2 # amount of the item after doubling

# Relevant calculations

doubling_time = np.log(doubling_amount / initial_amount) / np.log(1 + growth_rate)
``````

We now evaluate this code using a Python interpreter:

``````doubling_time = 2.6419267958111403
``````

We now fold the calculation code and answers into a request for an answer to the original question:

``````# Question

What is the number of years it takes for an item growing at 30% annually to double?

# Relevant definitions

growth_rate = 0.3 # annual growth rate as a decimal fraction
initial_amount = 1 # initial amount of the item
doubling_amount = 2 # amount of the item after doubling

# Relevant calculations

doubling_time = np.log(doubling_amount / initial_amount) / np.log(1 + growth_rate)

# Results of calculations

doubling_time = 2.6419267958111403
``````

When answering this form of the question, the written answer is now correct:

``````# Answer

The number of years it takes for an item growing at 30% annually to double is approximately 2.64 years.
``````

## Report and Evaluation

A GitHub Next technical report with evaluation and implementation are available.

## What’s next?

We have concluded this investigation and invite you to read our technical report. We will be looking at applying this technique where necessary in our own projects, and encourage you to consider it for your own GPT-based development.