Editor's note: The following is a guest article from Matthew Honaker, principal data scientist at Bsquare.
Predictive analytics is dramatically improving operations across today’s internet of things-connected industrial organizations.
By correctly forecasting future events, businesses can avoid downtime, cut costs, increase revenue and prevent customer service interruptions. However, the complexity of predictive programs puts them at greater risk for failure.
A CIO serves an important role in keeping these initiatives on track and achieving a return on investment. Here are five early warning signs CIOs can look out for and correct before it’s too late:
No. 1. No specific end goal
Predictions only work if they help contribute to the end goal. So, if an organization doesn't know what outcomes it's aiming for, it is unlikely to generate meaningful predictions.
That's why it’s important to define both the objective and how achievable predictions affect the objective at the start.
For example, a manufacturing business might want to predict daily manufacturing yields and investigate how small changes in personnel or suppliers affects those yields. And an oil and gas business might want to predict failure of pumps and compressors in order to schedule maintenance in advance and avoid downtime surprises.
The next step is syncing with subject matter experts to determine which predictions will aid in achieving the goal and whether or not organizations have all of the data required to make these predictions.
In some cases, plans can be made to obtain data that is lacking. At other times, an organization may need to reset goals.
Those that have an established end goal at the outset are better able to design strategies for collecting and managing data, build scalable and robust predictive models, and leverage the value of these predictions across the business.
No. 2. Manually entering data
Ensuring good data quality often requires significant time and effort. An excellent start to achieving better data quality is avoiding manual data entry whenever possible.
Application integration tools are one way to automate data entry and reduce the proliferation of typographical errors, alternate spellings and individual idiosyncrasies from the data.
Careful data preparation is also key to good data quality. This involves clear communication and documentation of how the dataset handles placeholder values, calculation and association logic, and cross-dataset keys.
Data collection and storage should also include using well-defined industry standards and continuous anomaly detection, as well as statistical validation techniques (such as tracking frequency and distribution characteristics on incoming and historical data).
No. 3. Big data blindness
Despite the current hype around "big data," an overabundance of data can actually create a host of problems that prevent robust and timely predictive analytics.
In these instances, reducing features and employing data selection and reduction techniques — such as PCA and penalization methods — can provide some relief.
One common misstep is collecting too much data that is unrelated to reaching a goal (see No. 1). If datasets become too large, companies may fall into the trap of developing excellent predictive models that don't deliver results due to a combination of high variance fields and an inability to generalize well.
Conversely, if a company tracks too many occurrences without robust validation procedures and statistical tests in place, rare events may seem more frequent than they actually are. In either case, validation and testing routines are paramount.
No. 4. Not discussing data ownership
Privacy and data ownership issues have become increasingly fraught. To ensure privacy, remove personal identifiers whenever possible and implement security measures in all storage and data transfer solutions.
Aggregating data can also be used to make it very difficult to extract enough information about singular events to make associations with the private details of an individual.
In an ideal situation, data ownership questions are settled by clearly defining boundaries for the data owners of records, as well as the circumstances under which others are allowed access.
Wholesale transfer of data or direct insights across projects should be avoided to prevent violation of data integrity and knowledge ownership.
No. 5. Ballooning costs
Because computational time is fairly cheap these days, the main cost in predictive analytics is expert time and attention. Good documentation practices and excellent communication between groups can greatly reduce redundant work and investigation, saving both time and money.
Robust data collection and quality standards will also mitigate time spent on data cleaning and preparation.
Another way to keep a lid on predictive analytics costs is to employ statistically validated sampling strategies when working with large datasets. This will enable rapid prototyping, saving computational costs and preventing minor analytic inefficiencies from propagating and producing larger problems.
Identifying which variables and sample rates are most relevant to achieve the desired business goal helps reduce the size of data for analysis.