How Do You Know When You’ve Tested Enough in Engine Calibration?

Author: Simon Daigneault, Product Marketing Engineer, Monolith

Read Time: 7 mins

How Do You Know When You’ve Tested Enough in Engine Calibration? 

In 2025, we worked on close to a hundred AI projects and proposals across automotive, aerospace, motorsport, and battery development. This included work with global OEMs such as Nissan, advanced aerospace programmes at Vertical Aerospace, and high-performance racing teams including JOTA Sport and PREMA Racing.

Late in a calibration programme, there’s often a moment where the data looks stable. Torque targets are met. Emissions are compliant. Fuel consumption is competitive. The engine behaves consistently across the key operating points. The current map feels strong. 

At that point, the discussion usually shifts from finding improvements to confirming the result. Someone suggests sweeping ±2° spark around the current best region. Another proposes trying a slightly richer lambda. A neighbouring boost plateau gets tested “just to check.” 

These tests are rarely driven by an expectation of improvement. They’re driven by a need for confidence. No one wants to be in a review meeting explaining why a potential improvement wasn’t explored. So the additional runs are scheduled. 

They typically come back slightly worse, or statistically indistinguishable within repeatability limits. The original calibration remains the best. And now there is evidence to support that position. 

This pattern is common. It feels responsible and thorough. It is also where diminishing returns begin to accumulate. 

 

 


From Exploration to Confirmation 

Early-stage calibration is characterised by clear gradients. Parameter changes produce visible shifts in torque, efficiency, combustion stability, or emissions. The relationship between cause and effect is strong enough that progress is easy to recognise. 

As the programme matures, that changes. Improvements become smaller. Trade-offs between competing objectives become tighter. Measurement variability becomes a larger fraction of the observed delta. Late-stage work becomes less about discovering large gains and more about balancing constraints. 

At that point, the objective subtly shifts. Instead of asking, “What improves performance?” the team begins asking, “Have we checked everything that could reasonably be better?” 

This shift is understandable. Calibration decisions have consequences. A small change in spark or lambda can affect knock margin, exhaust temperatures, or aftertreatment behaviour in ways that only become visible under certain conditions. The cost of missing a better solution feels significant. 

Without a structured way to quantify how much opportunity remains, the natural response is to continue testing. 

 


The Role of Noise and Repeatability 

One of the challenges in late-stage calibration is that engines are not perfectly deterministic systems. Even in controlled dyno environments, there are sources of variability: thermal state, intake conditions, fuel variation, sensor drift, transient effects, and cycle-to-cycle combustion differences. 

As improvements become smaller, they approach the repeatability limits of the test environment. A 0.2–0.3% shift in BSFC may be meaningful, but it may also sit close to the measurement floor. When deltas are of the same order as noise, interpreting results becomes less straightforward. 

This creates ambiguity. A new test that appears marginally worse may simply reflect variability. A test that appears marginally better may not be statistically significant. To reduce uncertainty, more runs are added. This often takes the form of local sweeps around the current best point. 

The intention is to build confidence. The result is a growing number of runs that provide limited new information. 

Download white paper 

 


Local Confidence vs Global Coverage 

Another structural issue is dimensionality. Modern powertrains involve many interacting calibration parameters: spark timing, injection timing and strategy, rail pressure, lambda targets, boost control, variable valve timing, EGR rates, throttle behaviour, and often hybrid coordination or aftertreatment strategy.

These variables interact in nonlinear ways across speed, load, and transient conditions. 

 

Local Density vs design space coverage

 

In practice, late-stage testing tends to focus on local regions around the current best candidate. This increases density in that neighbourhood and improves confidence that small variations nearby are inferior. 

However, dense local testing does not necessarily imply broad coverage of the full design space. It is possible to have high confidence locally while remaining uncertain globally. In other words, you may have demonstrated that nothing within ±2° spark is better, but you have not formally demonstrated that no other region in the broader space offers improvement. 

This is not a criticism of calibration practice; it is a limitation of manual exploration in high-dimensional systems. Human reasoning works well in two or three interacting variables. Engine behaviour does not limit itself to two or three dimensions.

 


The Asymptote Effect

If you plot performance against test iteration, most calibration campaigns follow a similar pattern. Early iterations show clear improvement. Mid-stage iterations show smaller gains. Eventually, the “best so far” flattens. 

What is rarely measured directly is the probability that meaningful improvement still exists beyond that plateau. 

A flat best-so-far curve can mean several things: 

  • The global optimum has effectively been reached within practical tolerance.
  • The team is exploring locally around a strong candidate but not testing elsewhere.
  • Improvements are below repeatability limits and difficult to detect.

From the outside, these scenarios look similar. Without a model of the response surface and its uncertainty, it is difficult to distinguish between them. 


Information Gain

 


Reframing the Late-Stage Question 

Instead of asking whether the latest test improved performance, a more useful question in late-stage calibration is: 

What is the probability that a calibration exists which improves the current best by more than a defined, meaningful threshold? 

That threshold might be 0.3% BSFC, a defined emissions margin, or a composite KPI that reflects programme priorities. The specific number will vary, but most teams implicitly operate with such a threshold. 

Once that threshold is explicit, the decision becomes clearer. If the probability of exceeding it is very low, and the cost of another dyno run is known, then the decision to stop or continue can be framed as a comparison between expected benefit and cost. 

Without this framing, the safer path almost always appears to be “run a few more tests.” 

 


A More Structured Approach 

Calibration teams we work with often describe this late-stage phase as the most difficult to manage. The early programme feels productive. The late programme feels uncertain. 

A structured approach involves three key elements. 

First, building an empirical model of how performance and constraints respond to calibration inputs across the tested space. This does not replace engineering knowledge; it formalises it using data already generated during the campaign. 

Second, quantifying uncertainty across the design space. This allows the team to identify not just where performance is high, but where knowledge is weak. Regions of high uncertainty are candidates for exploration; regions of high confidence and low expected improvement are not. 

Third, using that model to recommend the next test based on expected information gain or expected improvement, rather than proximity to the current best point. This reduces the likelihood of becoming trapped in local optima and improves overall design space coverage. 

When the model indicates that the probability of achieving an improvement greater than the defined threshold is low, further testing can be stopped with a defensible rationale. 

 

 

Monolith has designed a tool that balances exploration and exploitation to cover engineering design spaces in the most efficient way.

Discover: Monolith's Next Test Recommender

 


What Changes in Practice 

When calibration programmes adopt this type of approach, the behaviour in late-stage reviews changes noticeably. 

Instead of presenting a series of local sweeps to demonstrate that nearby points are worse, teams can present: 

  • An assessment of overall design space coverage. 
  • An estimate of remaining headroom relative to defined performance thresholds.
  • A comparison between expected improvement and the cost of additional testing. 

The decision to stop becomes less about comfort and more about quantified risk. 

 

Download white paper 

 


Engineering the Decision to Stop 

The proof loop in late-stage engine calibration is common because it is a practical way to manage uncertainty. It is not a failure of engineering judgement. It is what happens when remaining opportunity cannot be quantified directly

With modern data-driven modelling and next-test recommendation techniques, it is possible to make that uncertainty explicit and reduce unnecessary iterations without compromising performance confidence. 

If this reflects challenges you have seen in your own calibration programmes, we are always open to discussing how these approaches have been applied in practice, using real dyno data and real constraints. 

 

Book some time with our team and learn more about how we're applying our approach to achieve 90% reductions in calibration timelines and cutting months of test time to days. 

 

Request a demo

 

 


About the author

Simon B&W HeadshotAn experienced Product Marketing Engineer translating advances in AI into practical insights for battery development. At Monolith, I work across product, engineering, and commercial teams to ensure innovations in our platform deliver real-world value for OEMs. My background includes an MEng in Mechanical Engineering from Imperial College London, with a specialisation in battery testing, and hands-on experience at a battery energy storage startup in pack design, testing, and system integration. 

Share this post

Request a demo

Ready to get started?

BAE Systems
Jota
Aptar
Mercedes Benz
Honda
Siemes