McKinsey: adopting a smart approach to big data
Industrial companies are embracing artificial intelligence (AI) as part of the fourth digital revolution. AI leverages big data; it promises new insights that derive from applying machine learning to datasets with more variables, longer timescales, and higher granularity than ever.
According to a new McKinsey report, using months or even years’ worth of information, analytics models can tease out efficient operating regimes based on controllable variables, such as pump speed, or disturbance variables, such as weather. These insights can be embedded into existing control systems, bundled into a separate advisory tool, or used for performance management.
McKinsey recommends that to succeed with AI, companies should leverage historical data via automation. They will need to adapt their big data into a form amenable to AI. This ‘smart data’ can improves predictive accuracy and support root-cause analysis. Additionally, bolstering and upskilling expert staff to manage the process can result in an EBITDA increase of 5 to 15%.
A common failure mode for companies looking to leverage AI is poor integration of operational expertise into the data-science process. McKinsey advocates applying machine learning only after process data have been analysed, enriched, and transformed with expert-driven data engineering using the following steps:
Define the process
Outline the steps of the process with experts and plant engineers, sketching out physical changes (such as grinding and heating) and chemical changes (such as oxidation and polymerization). Identify critical sensors and instruments, along with their maintenance dates, limits, units of measure, and whether they can be controlled.
Enrich the data
Raw process data nearly always contain deficiencies. Thus, creating a high-quality dataset should be the focus, rather than striving for the maximum number of observables for training. Teams should be aggressive in removing nonsteady-state information, such as the ramping up and down of equipment, along with data from unrelated plant configurations or operating regimes. Generic methods to treat missing or anomalous data should be avoided, such as imputing using averages, “clipping” to a maximum, or fitting to an assumed normal distribution. Instead, teams should start with the critical sensors identified by process experts and carefully address data gaps using virtual sensors and physically correct imputations.
Reduce the dimensionality
AI algorithms build a model by matching outputs, known as observables, to a set of inputs, known as features, which consist of raw sensor data or derivations thereof. Generally, the number of observables must greatly exceed the number of features to yield a generalized model. A common data-science approach is to engineer input combinations to produce new features. When combined with the sheer number of sensors available in modern plants, this necessitates a massive number of observations. Instead, teams should pare the features list to include only those inputs that describe the physical process, then apply deterministic equations to create features that intelligently combine sensor information (such as combining mass and flow to yield density). Often, this is an excellent way to reduce the dimensionality of and introduce relationships in the data, which minimize the number of observables required to adequately train a model.
Apply machine learning
Industrial processes can be characterized by deterministic and stochastic components. In practice, first principle–based features should provide the deterministic portion, with machine-learning models capturing the statistical portion from ancillary sensors and data. Teams should evaluate features by inspecting their importance and therefore their explanatory power. Ideally, expert-engineered features that capture, for example, the physics of the process should rank among the most important. Overall, the focus should be on creating models that drive plant improvement, as opposed to tuning a model to achieve the highest predictive accuracy. Teams should bear in mind that process data naturally exhibit high correlations. In some cases, model performance can appear excellent, but it is more important to isolate the causal components and controllable variables than to solely rely on correlations. Finally, errors in the underlying sensor data should be evaluated with respect to the objective function. It is not uncommon for data scientists to strive for higher model accuracy only to find that it is limited by sensor accuracy.
Implement and validate the models
Impact can be achieved only if models (or their findings) are implemented. Taking action is critical. Teams should continuously review model results with experts by examining important features to ensure they match the physical process, reviewing partial dependence plots (PDPs) to understand causality, and confirming what can actually be controlled. Additional meetings should be set up with operations colleagues to gauge what can be implemented and to agree on baseline performance. It is not uncommon for teams to convey model results in real time to operators in a control room or to engage in on-off testing before investing in production-grade, automated solutions.
Industrial companies are looking to AI to boost their plant operations—to reduce downtime, proactively schedule maintenance, improve product quality, and so on. However, achieving operational impact from AI is not easy. To be successful, these companies will need to engineer their big data to include knowledge of the operations (such as mass-balance or thermodynamic relationships). They will also need to form cross-functional data-science teams that include employees who are capable of bridging the gap between machine-learning approaches and process knowledge. Once these elements are combined with an agile way of working that advocates iterative improvement and a bias to implement findings, a true transformation can be achieved.