The growth of cloud computing and the corresponding increase in demand for data centres in recent years has been unprecedented. As digital technologies permeate into our societies and businesses, downtime has become unacceptable.
Yet the 2019 annual Uptime Institute Global Data Center Survey (pdf) found that a third of 1,600 respondents have experienced a downtime incident or severe service degradation. Indeed, a number of these incidents resulted in serious business-related financial losses.
And while four out of five said that their most recent outage could be prevented, they had to settle for a full recovery from one to four hours – with one in three reporting a recovery time of five hours or longer.
A Critical to Reliability Strategy
So how can colocation providers and hyperscale providers ensure that their digital systems stay up? According to Mark Bidinger of Schneider Electric, one place to start is to first recognize the difference between product quality and product reliability. A quality UPS might work fine immediately after it is installed. However, this offers no hint about how long it can stay failure-free in a production environment, and how quickly potential failures get rectified.
The challenge mounts when within the context of the data centre. How does one ensure that scores of UPS, switchgear, backup generators and various power-centric equipment that are reliant on each other will continue working for hyperscalers and colocation providers to deliver on their reliability promise?
The solution is to embrace a formulaic approach to ensure that reliability standards are met, explained Andy Durand, a customer advocate with Schneider Electric’s Customer Satisfaction and Quality team.
By leveraging data gathered from physical infrastructure assets and their performance in the field, stakeholders can analyse a fleet of assets to know how long particular systems (and groups of interrelated systems) can run without failures.
This data can be collected by Schneider Electric’s EcoStruxure platform, ensuring a baseline of actual time-until-failure metrics to better understand the true system reliability.
Implementing Failure Analysis
To fix affected systems, a sub-CTR process called issue-to-prevention is implemented which involves automatically generating repair work orders and supporting the relevant processes to dispatch and coordinate service to affected systems.
The intent of the CTR process revolves around pre-empting failures with accurate predictions, document issues when they occur, and rank them in terms of criticality. Failure analysis is hence an important consideration.
The stems for this can be summed up to the below, says Bidinger.
- Establish relevant KPIs around the systems to be tracked.
- Measure the effectiveness and speed of each case dispatched.
- Analyse the data collected to improve the accuracy of reliability forecasts.
- Investigate the reason why systems fail and determine if a systematic issue exists – such as an increase in capacitor failures, for instance.
- Improve product designs to eliminate potential defects.
You can read more about how colocation providers are focusing on reliability for hyperscaler and enterprise customers at this Schneider Electric blog.
Article by Michael Kurniawan, Vice President – Singapore, Malaysia & Brunei, Secure Power Division, Schneider Electric