Accurate Machine Learning Models are not Enough

Pitfalls when Communicating Data-Driven Insights to Decision Makers

Machine learning-based predictive models are increasingly being adopted by decision makers across small and large organizations. These automated models can produce predictions to assist in problems ranging from predictive machine maintenance to customer churn prediction.  

However, with the ever-increasing adoption of these methods, product owners often focus primarily on single numerical performance metrics such as the accuracy of the deployed machine learning model. This can increase the risk of mis-interpretation and lead to suboptimal business decisions based on these tools.  
 
In this post, we discuss an example of a common pitfall caused by a confusion between the reported accuracy and the actual predictive power of such models. In future blog posts, we will then discuss ideas we are developing at Decision Labs AB for interface designs that communicating these differences to business end-users whot may not be themselves data-science experts. 

Predictive Maintenance with High Accuracy alone?

Your team has bought a new deep learning-based machine learning solution to predict when any one of the 10 000 machines needed in your product production line will fail. The team is excited by the new predictive maintenance AI model that reports “prediction accuracy is 98%” in predicting if a given machine will fail in the coming month. 

The new ML system is being presented to the CEO who needs to implement a business decision based on this system. He is presented with the fact that the cost of not replacing a machine that will fail is 100,000 USD since it will shut down the production line, while the cost of replacing a machine before a failure occurs is only 50,000 USD. 
 
How would you implement the decision process for predictive maintenance based on this new machine learning solution?  

Discussion of the problem

First, we observe that the statement of “98% accuracy” has to be broken down into the two components of accuracy of the machine learning-based system: Sensitivity and Specificity. So, more precisely, this is what the engineering test report may actually state: 

Latest Predictive Maintenance Algorithm Test Report:
Sensitivity
99%
Specificity
98%

Explanation: 

The sensitivity of the predictive maintenance solution reports the % of cases where machines that will actually fail have been correctly classified as “will fail” by your new machine learning solution. 

The specificity of the predictive maintenance solution reports the % of cases where machines that will not fail have been correctly classified as “will not fail” by your new machine learning solution. 

What about robustness and reliability of these numbers, you may wonder? Indeed, we can already note that a good user interface design should really report further details on how sensitivity and specificity were estimated and provide intuitive hints on whether the end-user can likely trust these measures – after all these numbers are themselves estimates based on past performance. When accuracy is being reported based on outdated test datasets or a limited number of evaluations neither accuracy measures may be trustworthy to start with. The question of how to communicate confidence in the reported accuracy to users of varying background is however a topic we will investigate in a future blog post. For the purpose of this discussion, let us assume that we can trust these accuracy estimates.

Ask yourself – would you be inclined to replace a machine in the production line whenever a failure is predicted based on this information? What additional question would you ask your engineers before you make a decision on implementing a machine maintenance/replacement strategy here? 

Perhaps surprisingly to some end-users, the ML interface still misses important additional information to help the business owner make an informed decision! In fact a decision should not be taken at all unless this additional information is provided. 

The Missing Insight

The key insight not communicated by the machine learning model is that the correct business decision on when to replace a production machine here depends crucially on the overall prior likelihood of machine failure. Let us consider Scenario 1 where only 1% of your machines are expected to fail overall per month, then for 10000 machines we would expect the following outcome for one month of operation: 

Scenario 1: Expected outcomes after 1 month with 10 000 machines in operation Outcome
Total Failures
100 (1%)
Total failures of which correctly predicted as failure
99 (99% of 100)
Total failures of which incorrectly predicted as not failing
1 (1% of 100)
Total Non-Failures
9900 (99%)
Total-non-failures of which correctly predicted as not failing
9702 (98% of 9900)
Total-non-failures of which incorrectly predicted as failing
198 (2% of 9900)

From this analysis, we can see immediately that given that the model predicts failure, a failure will only actually have occured in 99/(99+198) = 33% of cases. Much lower than most non-trained end-users may have guessed. We can analyse the expected business value based on this example as follows. 

Business Value Analysis, 1% likelihood of failure Outcome
Strategy 1: Replace part whenever failure predicted
Business Cost
50,000*(99+198)+100,000*1 =14,950,000 USD
Strategy 2: Ignore the ML system and replace upon failure
Business Cost
100*100,000 =10,000,000 USD

We conclude that Strategy 2 is best to minimize business cost – we should totally ignore the ML based predictive maintenance system. If instead, we observe a failure rate of 2%, we can repeat the same analysis to obtain.

Scenario 2: Expected outcomes after 1 month with 10 000 machines in operation Outcome
Total Failures
200 (2%) 
Total failures of which correctly predicted as failure
198 (99% of 200)
Total failures of which incorrectly predicted as not failing
2 (1% of 200)
Total Non-Failures
9800 (98%)
Total-non-failures of which correctly predicted as not failing
9604 (98% of 9800)
Total-non-failures of which incorrectly predicted as failing
196 (2% of 9800)

From this analysis, we can see that given that the model predicts failure, a failure will actually have occured in 198/(198+196) = 50.3% of cases. In this case, Strategy 1 is narrowly better than strategy 2 and strategy 2 becomes a clear winner the higher the failure rate is. A resulting observation here, which we will bookmark for our next blog post, is that the system should probably communicate this relationship between failure rate and business value in a clear graphical and textual manner to the end user! 

Business Value Analysis, 2% likelihood of failure Outcome
Strategy 1: Replace part whenever failure predicted
Business Cost
50,000*(198+196)+100,000*2 =19,900,000 USD
Strategy 2: Ignore the ML system and replace upon failure
Business Cost
200*100,000 =20,000,000 USD

While a discussion of the differences between accuracy and predictive power is standard practice in data science, business decision makers may not always be aware of these differences and there is a real risk of a false sense of security provided by numerical accuracy benchmark results of decision support systems. Note also that only a small amount of change is required to influence the underlying probability of failure – maybe the company is implementing a faster production line which puts additional strain on the machines and increases their probability of failure from 1% to 3% – this would immeadiately require a revision of the predictive maintenance frameworks and highlights the need for constant monitoring in a structured decision support framework. We can conclude that interface designs focussing on static accuracy alone are of not sufficient here and need to be supplemented with end-user guidance and and a well thought-out framework for data-driven decision-making.  

The discussed problem of misguided reliance on accuracy is commonplace also in medicine, where this problem occurs in the diagnosis of diseases. The BBC article “Do doctors understand test results” https://www.bbc.com/news/magazine-28166019 highlights the challenges in educating medical professionals on these differences. With the commonplace adoption of machine learning today, there is an increased need to innovate on how we utilize complex support systems and how they are integrated with the business decision processes to avoid similar misinterpretations in the business context. 

Conclussion

We have discussed an example where presenting accuracy as a key KPI for a machine learning based decision support system is not sufficient and may in fact mislead decision makers to take incorrect decisions. We observe that user interfaces need to clearly communicate the business implications of the model outputs and understand the background knowledge available to business decision makers.  
 
At Decision Labs AB, we focus on developing decision support systems and user interfaces that present machine learning-based decision input in the context of business value to mitigate these risks. In the next blog post, we will discuss some strategies for improving user interfaces for the predictive maintenance problem with the goal of assisting end-users with a wide range of backgrounds to maximize business value.