Calibrated Estimates

2024-10-04

Why?

We are often required to supply estimates. Project managers would like us to say how long a task will take. That number should be on firm grounding. Sometimes we need to say a single number, sometimes a range like “best case – average case – worst case”.

Most of the time our focus is on one of the range’s numbers, and the other numbers are derived from that. A naive estimation like “plus/minus 30%” is common. We mostly try to set the anchor value reliably.

This estimation is difficult in its own right, but another side is often neglected: what confidence do we have in the estimate?

Demanding 100% confidence is not very sensible, because the ranges get immense. And since the estimate isn’t an end in itself, but is supposed to be used, for example in project plans, it is important to assign a sensible confidence level to the estimate. That confidence level should be quantifiable, consistent and reproducible.

There is no “right” value for the confidence level we aim at, but a good rule of thumb is 90%. Most estimates are correct then, but the estimated ranges do not have to become extreme, in order to include rare outliers.

Now we have an obvious problem: we are bad at estimating a single value. But at least we know deep in our hearts that we’re bad at it. But who can claim to choose their estimates such that on many repetitions a given confidence level is reached?

That is why we need to calibrate our estimates.

This is not about improving the estimated value itself, making it more accurate. It’s about aligning an estimate to a given confidence level.

If I usually give estimates that are aligned to this given confidence level, we say that I’m a calibrated estimator.

And that is what this article is about. How do I find out my current calibration performance and how can I improve my calibrations?

Exercises

Preliminaries

If you do these exercises in a group setting, some ground rules are important:

Every participant answers the questions on a piece of paper all by themselves.
Every participant evaluates their answers by themselves (guidance to follow).
The moderator does not ask for their results.
Every participant takes their piece of paper with their answers with them when they leave the room.
Afterwards, every participant can shredder the piece of paper, throw it into the garbage bin, or post it in the office kitchenette, according to their personal level of extraversion.

Part One: Intervals

Ten questions. The solution to every question is a single number, like a year or a speed.

The task is to estimate an interval “min – max” or “earliest – latest”. Two numbers.

Such that this interval has a confidence level of 90%.

That means that the interval should be wide enough to be almost sure. It should not be so wide that you’re totally sure. For example, when the question is about a year of birth, the answer “from the Big Bang until now” is a very good estimate, as it is certainly correct, but this estimate is also useless and without any value.

If you repeat these question–estimate games many times, ninety percent of the correct solution should lie inside your estimated interval and ten percent should be outside of it.

It does not matter how far inside or outside of the interval the solution lies. Inside is inside, Outside is outside. There is no “almost correct”.

Now the question. I’m sorry, they are kind of German-centric. But it does not matter for the exercise. If you’re from somewhere else, you can still do it.

How much does a Learjet 75 weigh (kg)?
What radius (middle of earth to satellite) has the geostationary orbit (m)?
How deep under water lay the wrecked Russian submarine Kursk (m)?
How long is a 10 Euro bill (mm)?
In which year did the German stock index DAX exceed 5000 for the first time?
At which temperature does Helium vaporize (°C)?
In which year was the German-Language Sesame Street first broadcast?
How many Pokémons are there?
Which year was Macbeth’s premiere?
How much did the Volkswagen Golf 1 cost (Deutsche Mark)?

Part Two: Confidence

This exercise consists of ten statements of fact. Each one is wither correct or wrong.

This time you’re not answering with an interval, but simply with “true” or “false”,

And additionally with your personal confidence: how sure are you about your answer?

You can choose 50%, 60%, 70%, 80%, 90% and 100% as confidence levels. Please don’t answer 82,7%. And not less than 50%. If you tend to 40%, flip the answer and give 60% as your confidence level.

The statements:

A 1 Euro coin is heavier than a Compact Disc.
Buzz Aldrin was the second man on the moon.
World War II is closer to today than to the American Civil War.
Some tortoises live to 200 years.
There are more than 20000 kilometers of Autobahn in Germany.
There were more than 20 German recipients of the Nobel Prize in Physics.
California’s gross domestic product exceeds that of Italy
The distance (as the crow flies) between Vladivostok and Mumbai is larger than the distance between Wuppertal and Moscow.
Hanover (Germany) has more town precincts than Stuttgart.
An ice hockey puck fits into a golf hole.

Evaluation

Let’s evaluate your estimates. First for Part One:

Question	Solution
How much does a Learjet 75 weigh (kg)?	6168
What radius (middle of earth to satellite) has the geostationary orbit (m)?	42157
How deep under water lay the wrecked Russian submarine Kursk (m)?	108
How long is a 10 Euro bill (mm)?	127
In which year did the German stock index DAX exceed 5000 for the first time?	1998
At which temperature does Helium vaporize (°C)?	-269
In which year was the German-Language Sesame Street first broadcast?	1973
How many Pokémons are there?	890
Which year was Macbeth’s premiere?	1606
How much did the Volkswagen Golf 1 cost (Deutsche Mark)?	7995

Mark the solutions that lie inside your estimated intervals. If you’re already a well-calibrated estimator there should be about nine of the ten. Of course there is a natural statistic variation, later more about that.

Now Part Two:

Question	Solution
A 1 Euro coin is heavier than a Compact Disc.	wrong(7,5g vs. 15g)
Buzz Aldrin was the second man on the moon.	true
World War II is closer to today than to the American Civil War.	wrong (79 years vs. 74 years)
Some tortoises live to 200 years.	wrong (more than 176 is the highest estimate)
There are more than 20000 kilometers of Autobahn in Germany.	wrong (>13000 Kilometer)
There were more than 20 German recipients of the Nobel Prize in Physics.	true (23,5 – double citizenship was counted as half
California’s gross domestic product exceeds that of Italy.	true (3.2 trillion USD vs. 2.3 trillion USD)
The distance (as the bird flies) between Vladivostok and Mumbai is larger than the distance between Wuppertal and Moscow.	true (6078 kilometers vs. 2056 kilometers)
Hanover (Germany) has more town precincts than Stuttgart.	wrong (51 vs. 152)
An ice hockey puck fits into a golf hole.	true (3 inches vs. 4.25 inches)

Mark the statements that you correctly answered.

Now convert the confidence levels from percent to number values (70% to 0.7) and add all ten up. So many correct answers do you expect.

An example:

statement	your answer	answer correct?	confidence in %	confidence (number)
wrong	true	yes	50%	0,5
true	true	yes	70%	0,7
wrong	true	no	100%	1,0
wrong	wrong	yes	90%	0,9
wrong	wrong	yes	90%	0,9
true	wrong	no	50%	0,5
true	wrong	no	80%	0,8
true	wrong	no	80%	0,8
wrong	wrong	yes	100%	1,0
true	wrong	no	60%	0,6

That is 5 correct answers and a sum of confidence values of 7.7.

In this example you should have answered 8 correctly, but you only had 5 correct answers.

Don’t worry, results like these (and worse) are normal.

Limitations and objections

Statistical significance

Of course, this was only a single exercise with ten questions, a very small sample size. “I had seven answers inside the interval, that’s within statistical variation” you might be tempted to say. But is that so?

For Part One there is an easy plausibility check. If we assume a Bernoulli distribution (and that is sensible) we can ask:

Assume I’m a calibrated estimator. How probable is my result?

And the answer is this table:

Number correct	Probability
0	1,00E-10
1	9,00E-09
2	3,64E-07
3	8,75E-06
4	1,38E-04
5	1,49E-03
6	1,12E-02
7	5,74E-02
8	1,94E-01
9	3,87E-01
10	3,49E-01

Let’s check it:

The probability of zero correct answers is exactly ten to the power of minus ten. Because as calibrated estimator you’re right in 90% of cases, that is, you’re wrong in ten percent. The probability to answer all ten questions incorrectly is 0,1 × 0,1 × … × 0,1, that is ten to the power of minus ten.
The highest probability is at nine correct answers, as expected. But ten correct answers are much more likely than eight correct answers. Why? Because you’re 90% right, so you err quicker towards “too many correct answers”.

Again as a diagram:

Evaluating Part One of Calibrated Estimates

It’s obvious that seven correct answers are already very improbable, and everything below that practically vanishes.

If you have seven correct answers or fewer it is implausible that you are already a calibrated estimator, despite the small set of questions.

Questions

The next objection that is usually offered is “those weren’t questions in my field of expertise, but trivia” or “the questions were silly”.

That’s right. And on purpose because this set of questions can be used quite independent of the participants, at least in Germany. So I don’t have to prepare new questions for every type of audience. Furthermore, trivia questions make the exercises loosen up. Nobody fears to lose face because they answered a question in their field of expertise incorrectly.

“If you had asked something about electrical engineering my estimates would have been better”.

This objection is similar, but originates in a misunderstanding.

Of course, a typical layouter could more accurately estimate the necessary dimensioning of a capacitor.

I would certainly expect a narrower interval here than with a trivia question.

But we didn’t evaluate the width of the interval at all! At this point many participants pause and flip back to the evaluation. But it’s true. The evaluation was binary: either the estimate was inside the interval, or not. There were no bonus points for choosing a narrow intervall.

The interval’s width did play a role, of course, but not for the question “true” or “wrong”, but for the calibration: do you estimate too conservatively or too brazenly?

These were no trick questions. But they were chosen so that participants have to really think about how sure they are.

Improving

Very few people are natural calibrated estimators. The good news is: almost everybody can improve through exercise (studies say that about 5% don’t improve).

It’s fruitful to repeat this exercise every now and then, with different questions, of course.

A psychological trick is to act as if you’re betting money on your answer. To bet in reality works even better, but acting as if already helps.

The “equivalent bet test” asks “would you bet money on your answer or rather on this wheel of fortune with a probablity of winning of 90%?”. Every participant should be indifferent about this question, of course, but often an instinctive reaction pro or con wheel of fortune indicates a problem with their estimate.

It can help to just assume that the estimate is incorrect, and to question it by an explicit effort.

It is fine to use absurdly wide intervals as a starting point, they are but stepping stones on the way towards a better interval.

With most questions the interval bounds should be symmetric. That means that if you estimate an interval between 100 and 200 with confidence 90%, you should assign both intervals “minus infinity to 200” and “100 to infinity” a confidence level of 95%, because the remaining 10% should split evenly above and below.

In closing

I find it important to align my estimates to a given confidence level. At work I have never encountered anyone to request that, though.

My personal results with these two exercises were catastrophic. I was much too confident (i.e. my intervals were markedly too narrow), and if you believe the studies that is the common case.

So I have resolved to do exercises like these regularly.