At most companies, two separate organizations contribute to data center management: IT and facilities. The IT department oversees the data center's computer infrastructure and applications, and typically reports up to the company's CIO. The facilities department handles energy and cooling requirements, and typically reports up to the COO or VP of Corporate Real Estate. This divided organizational structure, long the norm among large businesses, often results in poor communication between the people responsible for maintaining workloads and the people responsible for delivering power to them.
Historically, inadequate consultation between IT and facilities has posed little danger to data center availability. Until recently, workloads and power requirements in even the largest data centers were modest enough that IT managers could safely reposition servers and workloads as they wished without putting excessive strain on electrical or cooling systems.
Today's massive server infrastructures, however, are growing larger, hotter and more power-hungry all the time. Moreover, widespread adoption of blade servers and virtualization—which simplify administration and raise server utilization rates but also dramatically increase compute densities and heat generation—has only accelerated these trends. In today's sprawling, searing data centers, moving workloads or hardware around without consulting a facilities engineer could result in overloaded electrical feeds or overwhelmed HVAC systems, which could in turn bring down critical systems.
Unfortunately, however, while data centers themselves have evolved significantly in recent years, data center organizational structures haven't. IT and facilities continue to be islands apart that too often fail to communicate adequately about important operational matters.
Best practice:To decrease the incidence of power-related downtime, businesses should establish clearly defined and documented procedures for how and when IT managers and facilities managers consult with one another before implementing data center modifications.
To further facilitate communication between IT and facilities, companies should also consider changing their organizational chart such that IT and facilities report up to the same C-level executive. This can make enforcing interaction between IT and facilities personnel easier by subjecting both organizations to a common set of expectations and a common reporting structure.
At many companies, short-term and long-term priorities are in conflict during the construction or renovation of a data center. Senior executives generally urge the people responsible for building data centers to hold down costs and shorten completion times. As a result, supply chain participants, engineers, contractors and project managers on data center construction projects tend to make equipment selections based on who submitted the lowest bid and promised the quickest delivery.
The people responsible for operating data centers, however, have a different set of priorities that are often better aligned with the company's long-term interests. Lowest-bid hardware does indeed save money during data center construction. But if that affordably priced equipment fails to meet operating specifications as defined in the original architectural design, it can wind up costing an organization dearly over time in the form of reduced efficiency and uptime.
Best practice: Executives with review and decision-making authority over a data center construction or renovation project should carefully scrutinize the procurement decisions that line managers and contractors are making to ensure that no one trades long-term risk for short term savings. They should also clearly communicate the importance of adhering scrupulously to original operating specifications, even if it means spending a little more during the construction process.
Companies may also wish to define goals and objectives for facilities construction managers that put less emphasis on near-term cost reduction. Rewarding construction teams for taking a long-term approach to procurement can lessen their incentive to cut corners in ways that adversely impact availability over a data center's lifespan.
IT departments are increasingly utilizing standardized best practice frameworks such as the Information Technology Infrastructure Library (ITIL) to help them systematize and enhance their work processes. Developed by the British government in the 1980s, ITIL defines specific, effective and repeatable ways to handle incident management, service desk operation and other common IT tasks. Organizations that follow ITIL guidelines usually enjoy better control over IT assets, enabling them to more easily diagnose and address IT outages.
Unfortunately, few facilities organizations employ rigorous, uniform maintenance processes such as those defined by ITIL, relying instead on ad hoc procedures and the accrued knowledge of facilities managers. As a result, maintenance standards for power and cooling systems are often lower or less consistent than for IT systems, resulting in increased downtime.
Best practice: Though facilities process frameworks as thorough and proven as ITIL have yet to be developed, facilities departments can and should take steps to develop standardized, documented processes of their own. Performing essential activities in consistent, repeatable ways can significantly lower the likelihood of power and cooling breakdowns while simultaneously increasing the productivity of facilities technicians.
Aviation engineers and maintenance professionals have long understood the importance of strong change management processes. Preserving a thorough and accurate record of all maintenance procedures performed on a given aircraft is critical to ensuring that the plane is safe to fly. Furthermore, should an accident occur, maintenance records can provide vital forensic clues to the root causes underlying a catastrophic system failure.
For similar reasons, ITIL places particular emphasis on carefully tracking all changes to IT resources in a comprehensive change management database (CMDB). Information in the CMDB can help IT employees resolve service interruptions more effectively, and can be especially valuable in emergency situations when accessing important data in a timely manner is critical.
Unfortunately, however, few facilities departments maintain a CMDB. As a result, the only record of how old a data center's uninterruptible power systems (UPSs) are or what servers or other loads they are currently feeding, to cite but two examples, often lie in a facilities manager's head. When that manager leaves for another job or retires, that precious knowledge leaves with him or her, exposing the data center to unnecessary downtime and lengthier recoveries after power and cooling disturbances.
Best practice: Facilities departments should establish and rigorously maintain a CMDB of their own. ITIL guidelines offer a useful starting point for such an initiative, and companies can also draw on a variety of specialized CMDB software applications.
People often use "availability" and "reliability" interchangeably. In fact, however, the two words have related but distinct meanings.
Reliability (as measured by the mean time between system failures, or MTBF) is one of two key components of availability. The other is the mean time required to repair a given system when it fails, or MTTR. The formula for availability is as follows:
Availability = MTBF / (MTBF + MTTR)
A server, switch or power supply may be highly reliable, in that it rarely experiences downtime, yet not highly available because it has a high mean time to repair. Yet IT departments often completely overlook repair time when assessing a system's availability.
To see how that oversight can compromise data center availability, consider the hypothetical case of a company trying to decide whether to use ordinary fluorescent light bulbs or a more sophisticated LED lighting system in its new corporate headquarters. The LED system is highly reliable, as it rarely experiences mechanical problems. But when problems do occur, if spare LED lamps are not kept in local inventory or available from local suppliers, replacing them can be a time-consuming process. Fluorescent bulbs, on the other hand, have a MTBF of approximately 6,000 hours, making them significantly less reliable. But replacing them is typically a quick and relatively inexpensive process, since they are a standard product. Taking both reliability and average repair time into account, then, fluorescent bulbs may actually provide better availability than the LED system.
The same logic applies to power system infrastructure components. Systems designed to run smoothly for long periods without interruption may not provide high availability if repairing them is a time-consuming operation.
Best practice: When evaluating power system components, companies should look for products that are both highly reliable and quickly repairable. In particular, they should carefully investigate how swiftly and effectively a given power system manufacturer can service its products. How many service engineers does the manufacturer employ, where are they stationed, and how rapidly can they be on site at your data center after an outage? Is 24/7 support available? How thoroughly do service engineers know the manufacturer's products? Do they have access to escalation resources if they can't solve a problem themselves? Even the most well-made and reliable power system may ultimately deliver poor availability if its manufacturer can't dispatch properly trained and equipped service personnel promptly after a breakdown.
Companies should also seek out products with redundant, modular designs. Should a module fail in such a system, other modules compensate automatically, increasing the parent unit's MTBF. In addition, replacement modules tend to be more readily obtainable than conventional components, and are usually easy enough for as few as one or two technicians to install quickly, often without manufacturer assistance. The result is lower MTTR, and hence better availability.
Contrary to popular belief, few systems fail without warning, except in disasters. It's just that their warnings too often go unheeded since the monitoring systems in place are reactive in nature.
For example, imagine that a UPS fails late one night, bringing your data center down with it. Odds are good that in the days or hours leading up to the failure, the UPS was emitting signals suggestive of future trouble. Perhaps the UPS or its batteries were beginning to overheat or exhibit degraded performance, for instance. Yet if facilities managers weren't monitoring those performance indicators, they probably knew nothing about the impending breakdown until after it occurred.
Best practice: The latest enterprise management products can help businesses monitor and proactively administer mission-critical equipment, including power, environmental and life/safety systems. But even the best software does little good if it's not consulted diligently. So while deploying power system monitoring and diagnostic software is an important start, facilities departments must also ensure that they have disciplined work processes in place for consulting that software and responding swiftly to signs of danger.
Electrical power system practices
Every data center has critical dependencies on external providers of electricity, fuel and water. And every such external provider is virtually guaranteed to experience a service interruption at some point in time. The only question is whether or not you're prepared for the crisis when it occurs.
Most data centers maintain contingency plans for dealing with a loss of power or water. In the case of a power outage, those plans typically involve utilizing a diesel-powered generator until electrical service is restored. But what if the 24- to 48-hour supply of diesel fuel many companies stockpile runs out before the electricity comes back? That's precisely the situation that confronted numerous organizations in the northeastern United States and parts of Canada in August 2003, when a major blackout left an estimated 55 million people without power for several days. Many companies, including a major financial services provider, exhausted their supply of diesel generator fuel before electrical power was restored. Unlike most of its peers, however, the financial services provider had a large reserve of cash on hand for occasions just like this one. As a result, it was able to get the additional fuel it needed despite skyrocketing demand, while other companies scrambled to gather funds or secure credit.
Best practice: IT and facilities groups have direct control over many of the problems that can bring down a data center. But even the most well-designed and carefully constructed facility is vulnerable to problems beyond an organization's control. Businesses, therefore, must think comprehensively about external issues that could impact their data centers, and carefully weigh the costs and benefits of preparing for them.
For example, stockpiling enough diesel fuel and water for chillers for five days instead of two may be expensive, but it's significantly less costly than three days of downtime. And the chances of losing power for more than 48 hours may be greater than you think: When a massive ice storm struck New England and upstate New York in December 2008, for instance, more than 100,000 customers were still without power nearly a week later.
When it comes to contingency planning, then, "hope for the best but expect the worst" is a sound rule of thumb.
Power system topology has a major impact on procurement costs, operational expenses, reliability and average repair times. The more redundancy you build into a given data center, the more it will cost you to build and run, but the faster it will recover from an outage. The Uptime Institute, an independent research organization that serves owners and operators of enterprise data centers, has defined four power system topologies for mission-critical facilities that illustrate this principle:
A Tier I or II topology will be relatively less expensive than a Tier III or IV topology, but also provide less reliability and uptime.
Best practice: There is no single correct answer when it comes to selecting a power system topology. Organizations should match their power system topology to their particular circumstances and needs.
For example, a Tier II topology might be fine for a data center that hosts a Web application, assuming multiple backup sites are available, because users are unlikely to complain if they occasionally encounter a few seconds of latency. On Wall Street, however, a few seconds of latency can result in lost millions, so a data center that hosts a financial trading application would be wise to utilize a Tier IV topology.
Electrical power anomalies can affect how sensitive electronic equipment operates up to and including component outages that may have significant impacts on an entire enterprise.
Data centers utilize UPS equipment to protect against power anomalies. Such systems cleanse "dirty" electrical systems and provide emergency power during outages. Until recently, however, the most highly available double-conversion UPS systems tended to be the least efficient with respect to power consumption, and vice versa. As a result, organizations looking to hold down operating costs may have implemented energy-efficient UPS products that delivered below-average availability, while organizations more concerned about uptime deployed high-availability UPS systems that wasted electricity.
Best practice: Proven UPS technology available today enables organizations to enjoy both high availability and high efficiency in a single unit. Companies using older UPS technology should consider upgrading to this newer generation of devices so as to increase application availability and reduce total cost of ownership simultaneously.
Most data center managers think they know what their power systems are capable of delivering. Far fewer, however, actually know. That's because most businesses fail to audit their power infrastructure on a regular basis.
Only by auditing power systems and the operational processes you use to support them can you establish your data center's maximum load parameters concretely. Relying instead on product specifications and contractor assurances leaves you at risk of exposing capacity shortfalls the hard way, when you need to put important new IT workloads into production but can't due to insufficient power.
Best practice: Audit your power systems thoroughly and regularly.
Maintaining availability in today's large, hot and complex data centers is more difficult—and more strategically vital—than ever, especially considering global economics, sustainability pressures and an aging and often decreasing workforce. Businesses already utilize a variety of technologies and processes to ensure that mission-critical IT systems enjoy access to clean, dependable power. Yet most organizations could further mitigate their exposure to downtime by adopting the proven best practices discussed in this white paper. Some such practices admittedly require incremental investments in new hardware or software. But many are as simple as getting IT and facilities personnel talking to one another.
Of course, the 10 best practices discussed in this white paper hardly exhaust the myriad ways businesses can protect their data centers from power-related service interruptions. Organizations serious about data center availability should continually and closely study best-of-class data centers for further processes and technologies they can adopt themselves. Time spent on such a task is almost certain to pay off in the form of new ideas for ensuring continuous data center operations.
For further discussions of your Data Center requirements contact Bomara Associates
|Eaton Home||Bomara Homepage||Request Information|
3 Courthouse Lane, Chelmsford, MA 01824 USA