GPT-4 and other LLMs (Large Language Models) are driving a tidal wave of innovation in applied AI. However used without augmentation they have very limited calculational capabilities and make mistakes calculating with numbers. In this project, we describe a simple, general technique to address this, apply it to some widely reported real-world failures of GPT-4-based AI, do some evaluation and discuss related issues.
These models also make mistakes when writing text that involves comparative logic involving numbers. For example, GPT-4 quite happily writes sentences like this (emphasis added):
Give a numerical comparison of these two reports, deriving new interesting numbers as part of the comparison.
Both companies reported an increase in net sales compared to the same quarter last year, but Gap Inc. had a much larger absolute and relative increase ($4.04 billion, up 2%) than lululemon ($1.9 billion, up 28%).
Unfortunately, 2 is not greater than 28, the text as written is half-right, but contains a serious mistake that makes it non-sensical.
The hope
The hope lies in this observation:
- LLMs are weak at numeric calculation but good at writing numeric calculation code
- Python and many other tools are perfect at evaluating numeric calculation code.
The answer is obvious: get GPT-4 to write the numeric calculation code relevant to the question, and get Python or some other tool to evaluate it, and use GPT-4 for the rest. Our aim is to "equip" or "augment" GPT-4 with a numeric calculator. In practice the approach we use is simple and can be used to augment any LLM invocation. Further, the calculation code and results can be shown to the user as supplementary output, as well as separately checked, audited, explained and kept for record keeping.
How does it work?
Without equipping, GPT-x models are often used to implement this function:
- LLM: Question --> Answer
With numeric calculation equipping:
- LLM: Question --> Numeric calculation code
- Evaluator: Numeric calculation code --> Numeric calculation answers
- LLM: Question + Numeric calculation code + Numeric calculation answers --> Answer
The overall function computed is still Question --> Answer
.
Does it work?
We developed some indicative problem sets that exhibit the following characteristics. These sets were developed partly in order to explore the boundary of what this technique can handle. The results indicate dramatic improvements when using numeric calculation equipping, with the exception of date calculations, where improvements were only marginal. This is discussed more in the evaluation.
decimals, integers, math functions
mortgage, rate of return, WACC
Some variations on calculation code generation were also tested, see the technical report and evaluation.
Example
Here is a classic fail for LLM-based chat. Given this question:
# Question
What is the number of years it takes for an item growing at 30% annually to double?
Here is a typical incorrect response:
# Answer
It takes about 2.53 years for an item growing at 30% annually to double.
Calculation equipping to the rescue! Instead of using the LLM to answer the questions, we use the sample equipping prompt from the technical report, which begins:
# Guidance
Do not answer the question. Instead, your task is to write some numeric
calculation and comparisons relevant to answering the question.
After the question write a code block with up to three sections containing
content relevant to answering the question.
...
Executing the LLM with this prompt gives the following calculation code:
# Relevant definitions
growth_rate = 0.3 # annual growth rate as a decimal fraction
initial_amount = 1 # initial amount of the item
doubling_amount = 2 # amount of the item after doubling
# Relevant calculations
doubling_time = np.log(doubling_amount / initial_amount) / np.log(1 + growth_rate)
We now evaluate this code using a Python interpreter:
doubling_time = 2.6419267958111403
We now fold the calculation code and answers into a request for an answer to the original question:
# Question
What is the number of years it takes for an item growing at 30% annually to double?
# Relevant definitions
growth_rate = 0.3 # annual growth rate as a decimal fraction
initial_amount = 1 # initial amount of the item
doubling_amount = 2 # amount of the item after doubling
# Relevant calculations
doubling_time = np.log(doubling_amount / initial_amount) / np.log(1 + growth_rate)
# Results of calculations
doubling_time = 2.6419267958111403
When answering this form of the question, the written answer is now correct:
# Answer
The number of years it takes for an item growing at 30% annually to double is approximately 2.64 years.
Report and Evaluation
A GitHub Next technical report with evaluation and implementation are available.
What’s next?
We have concluded this investigation and invite you to read our technical report. We will be looking at applying this technique where necessary in our own projects, and encourage you to consider it for your own GPT-based development.