Google Cloud and storage hardware manufacturer Seagate joined forces to develop a machine learning system that can predict hard drive failures.
The system can forecast the probability of a recurring failing disk —a disk that experiences three or more problems in 30 days.
Google Cloud hopes to save time and money by using AI, as a hard drive failure could potentially cause mass outages across a multitude of products and services.
Business case ‘only getting stronger’
Previously, Google Cloud said once a disk was flagged for a problem, repairs were conducted on-site using software, which required draining the data from the drive, isolating the drive, running diagnostics, and then re-introducing it to traffic.
The newly developed ML system can automatically export and analyze data from a disk prior to repair, and uses the findings to predict the probability of the failure happening again.
The system has already been built into several Google Cloud products and services, including Terraform, BigQuery, and Dataflow.
The custom Transformer-based Tensorflow model was built and trained using AI Platform Notebooks for experimentation and AutoML Tables for development.
Initially, Google Cloud and Seagate tested two models — the first, based on AutoML Tables, and the second being a custom creation.
The AutoML Tables model proved the more successful, with precision of 98 percent, compared to the custom-developed model’s 70-80 percent.
"Google's MLOps environment allowed us to create a seamless soup-to-nuts experience, from data ingestion all the way to easy to monitor executive dashboards,” Elias Glavinas, Seagate’s director of quality data analytics, tools, and automation, said.
"AutoML Tables, specifically, proved to be a substantial time and resource saver on the data science side, offering auto feature engineering and hyperparameter tuning, with model prediction results that matched or exceeded our data scientists' manual efforts. Add to that the capability for easy and automated model retraining and deployment, and this turned out to be a very successful project.”
Google Cloud technical program manager Nitin Aggarwal and AI engineer Rostam Dinyari wrote on Google Cloud’s blog that the business case for using an ML-based system to predict HDD failure is “only getting stronger.”
“When engineers have a larger window to identify failing disks, not only can they reduce costs, but they can also prevent problems before they impact end-users. We already have plans to expand the system to support all Seagate drives—and we can't wait to see how this will benefit our OEMs and our customers,” the pair wrote.
For a storage manufacturer like Seagate, one of the most important metrics is MTBF – mean time between failures. Back in 2016, the company faced a class action lawsuit, after hosting provider Backblaze suggested that it purchased more than 4,000 units of certain model, and a whopping 32 percent suffered critical failure in less than four years – despite promised annualized failure rates of less than 1 percent.
However, Backblaze itself was criticized – for the unusual decision to equip its servers with essentially desktop-grade drives – and the lawsuit appears to have been quietly dismissed.