How Much Test Data Do I Need To Use Machine Learning?
In the realm of machine learning, where algorithms have changed the way engineers approach data and make predictions, one crucial question often lingers in the minds of both beginners and seasoned engineers: "How much test data do I need to collect?" It's a query that sits at the core of any machine learning project, as the right answer can be the difference between a model that thrives and one that falls short of expectations.
In a traditional testing approach, the engineering team may consider many different factors and conditions with wide ranges to be validated for a new design. The theoretical test matrix may become too large to implement in a physical test plan.
Test engineers are forced to find the best combination of test parameters that will result in the optimal design and validate it against key performance requirements.
In this blog post, we embark on a journey to demystify the enigma surrounding the amount of test data required for effective machine learning. As we dive into the intricacies of data science, we'll explore the factors that influence this pivotal decision and guide you through the process of striking the perfect balance.
The longer you are in the business and the more design iterations you have already done, the bigger your database becomes. You might look upon decades of engineering experience – besides having designed and produced great products for your customers you have piled up a large amount of data. That is already a treasure of unimaginable engineering wealth. Not making use of it is like hiding your money under your pillow. Our recommendation is to work with algorithms to learn from existing data, build strong and accurate ML models, do optimisations and improve your way of engineering. And if the data was generated without ML in mind make sure to explore it carefully as your design space sampling may not be optimal.
Organisations designing thousands of products with various options and permutations can significantly reduce and automate the design process for new products using AI. This is every engineering CIO or CTOs dream. In this scenario, you can enter customer requirements into a software utility where an AI algorithm leverages your Product data management (PDM) system and create a new product or component for these requirements automatically. We have seen this work for complex systems like EV batteries, fuel cells or components like sealing solutions, pumps, bearings, and so on, with thousands of design permutations and tests every year.
You might have a fair amount of data (maybe a few hundred points), and you are wondering if that is enough data to train a good model. Unfortunately, there is no magic threshold number above which you are guaranteed a model with good accuracy. Although the amount data is one of the most impactful leverages to increase accuracy, it will also depend on other things like how non-linear the problem is, how many dimensions (input parameters) there are, etc. The way to find out is to just train models and assess their accuracy on a separated test set. You might find out for example that for your problem, a random forest, or a gaussian process, might be better than a neural network.
Organisations who have run 200-500 tests can already build recommender systems that deliver useful insights into what other solutions or design approaches are worth investigating. You can run targeted optimisation codes that tell you or highlight specific design factors that meet new performance or compliance requirements. This is the typical size of 'design of experiment' for optimisation studies based on test data, so we tend to get a lot of those at this data level.
AI models can predict the outcome of repetitive processes with very good accuracy - sometimes as good as or even better than simplified physics-based simulations, of course based on the quality of the data provided to the AI model. Using these more accurate AI models, companies can significantly reduce time-to-market for new designs by optimising their test plans to reduce or eliminate different tests in the process.
You want to use insights from product testing or development to help your engineers make better decisions faster? You have noticed that there is considerable repetition in the tests you’ve been conducting? You can learn quite a few things using algorithmic methods. You can detect correlations, investigate failure scenarios, and build AI models to make recommendations of what to test next. At this data level, AI models can be a productive extension of the engineering expertise you already have.
You might – like a lot of engineering companies – have a little amount of data, maybe because acquiring this data is too costly (expensive simulations, physical tests on a prototype, etc.). This might be historical data, or tests that you must do like those required for regulation purposes. In that case, companies often want to reach an optimised design with a minimum number of additional tests or simulations. The truth is that even with small amount of data, AI can be extremely valuable. Methods like Bayesian optimisation and various clever acquisition functions (expected improvement, lower confidence bound, etc.) can tell you what the most valuable next tests are you should be doing to understand and optimise your product faster.
You might happen to be very new in the business and (nearly) no data is available to build Machine Learning (ML) models. This means that might take longer for you before you get all the value AI can offer. On the other hand, there is also good news: eventually, you will produce data by performing tests.
As you start from scratch you can plan your data creation and acquisition strategy to match the needs of ML and AI.
You can sample your data points so that it covers your entire design space. Grid sampling, Latin hypercube or orthogonal sampling are well-established methods to sample your design space. This way you are starting to generate a data base which has the needs of ML in mind from the beginning.
You can accelerate new product design with some thoughtful organisation of your product design and test data. For example, you can make new engineers more productive simply by providing access to previous test information. To give a simple example: let's say you're a test engineer working on battery systems who just joined a company, and you need to test a new line of battery systems. The first step would be to review the last tests the company designed to learn from those to not overtest and exhaust resources. At the same time, you do not want to risk undertesting as this are safety-critical systems, while at the same time providing maximum information gain.
You can make this search and learn process a lot easier if you save the data in a Monolith friendly and commonly used format such as .csv for example. Everything from data pre-processing, training the models, exploring the results as well as sharing the results with customers or inside of your organisation is straightforward.
In conclusion, the amount of test data required for effective machine learning varies depending on the specific context and the stage of your engineering projects. For those with extensive historical data, leveraging machine learning can lead to significant advancements in design optimisation and automation. In cases where you have a moderate amount of data, recommender systems and optimisation studies become feasible, enabling improved decision-making and time-to-market reduction.
Even with a relatively small dataset, AI models can offer valuable insights and help engineers make informed choices. For those just starting out with little to no data, careful planning and systematic data acquisition strategies can set the stage for future AI utilisation, while organised data management can make the transition smoother and more productive. Regardless of the amount of data at your disposal, embracing machine learning and AI can enhance your engineering processes, increase efficiency, and ultimately lead to more successful product designs.
One challenge which engineering organisations face is generating accurate predictions during a given design cycle, and leveraging their data coming from various testing procedures under 10s to 100s of operating conditions. The impact and cost of errors increase significantly as the development workflow progresses.
Simultaneously, there is a much higher chance of a faulty design in the earliest stages of the process. Therefore, the data available to engineers at this stage is vital, and this is typically where AI and machine learning can improve the accuracy of these early-stage predictions.