by Dave King
The human race has acquired an insatiable demand for IT services (or rather the 35% that have access to the internet has), services that have to be available 24 hours a day, seven days a week.
As this demand has grown, data centers have evolved to become either the place where all revenue is generated, or the place that enables all revenue generation for a business. Just as I am writing this, an advert has popped up on LinkedIn for a Network Infrastucture Engineer for Greggs the Baker (for our international readers, Greggs are a high street baker; they sell cakes, pasties and various other tasty things). That’s right, the baker needs to employ someone well versed in Linux, Cisco and Juniper!
Back in the old days, operators could fly by the seat of their pants, using gut instincts and experience to keep things running. A little bit of downtime here and there wasn’t the catastrophic, career-ending event it is today. But, as the data center has undergone its transformation into the beating heart of the digital business, the pressure on those poor souls looking after the data center environment to keep it up 24/7 has gone through the roof.
In response to this, managers have invested heavily in monitoring systems to understand just what the heck is going on inside these rooms. Now armed with a vast amount of data about their data center (interesting question: how many data centers’ worth of data does data center monitoring generate?), and some way to digest it, people are starting to breathe a little easier.
But there’s still a nervous air hanging over many operations rooms. Like the bomb disposal expert who is fairly sure it’s the green wire, but who is still going to need a new pair of underwear, people are left watching those monitor traces after any change in the data center, hoping they don’t go north.
Meet Bob. Bob works in data center operations for MegaEnterpriseCorp Ltd. It’s his job is to approve server MACs (moves, adds, changes), and he is judged on two criteria:
- No increase in the PUE value for the facility
- No loss of availability under any conditions, barring complete power failure.
The boss also dictates that unless Bob can prove that a MAC will fail either criteria, as long as there is capacity in the facility, it must be approved.
If a MAC fails on 1 or 2, or if Bob says no to his boss, he risks a new pink slip. Bob has at his disposal the most comprehensive DCIM monitoring solution you can imagine. What would you do in this situation?
Let’s think about this for a minute. Say that the equipment to be installed had a fan more like a jet engine than a server; Bob has a gut feeling that it’s going to cause all sorts of problems. How could he prove that it would fail either criterion? Thanks to his all-singing-all-dancing DCIM stack, he has all the information he could want about the environment inside the data center right now. It’s saying that that all looks fine, mostly because the horrible jet server hasn’t been installed yet.
The only way to find out what kind of carnage that server may wreak on the environment is to install it, switch it on and watch the monitor traces in trepidation to see what happens. If the PUE doesn’t change then great, but how much headroom have you lost in terms of resilience? The only way to find out? Fail the cooling units and see what happens…
The more astute among you will have noticed that this is a lose-lose situation for poor old Bob. He can’t stop any deployments unless he can prove they will reduce availability or have an impact on PUE, but he can’t prove they will cause problems without making the change and see what happens! Catch-22!
The problem is that all the changes are being made to the production environment; there is no testing ground data center to make mistakes in – it’s all happening live! And that’s why everyone is on the edge of their seat, all the time. In many other industries, simulation is used in situations like this – where physical testing is impossible or impractical – to allow people to see and analyze designs changes and what-if scenarios. There is no reason the data center industry should be different.
Let’s go back to Bob, but this time we’ll give him a simulation tool in addition to his DCIM suite. For each proposed MAC, he sets up a model in the simulation tool using the data from the DCIM system and then looks at the simulated PUE and availability. He can fail cooling units in the simulation without any risk to either the IT or himself. If PUE goes up, or availability goes down, Bob can print out the simulation results as proof, say no to his boss and keep his job.
As a senior consultant engineer who has been parachuted into troubled data centers the world over, and who has had the opportunity to advise lots of Bobs over the years, it still amazes me that the uptake of the obvious solution is not more widely spread. The case for simulation is compelling, so why has the adoption of simulation in the data centre industry been so slow? A lack of awareness is certainly a factor, but it has been seen by many as unnecessary, too complicated and inaccurate. Let’s address these points…
While the benefits of simulation have always been there to be had, it is certainly true that in the past there was an argument for placing it on the “nice to have” pile. Thermal densities were much lower and over-engineering more acceptable. But, as data centre operations have been forced by business to become leaner, the operational envelope is being squeezed as tightly as possible. The margin for error is all but disappearing, and having the ability to test a variety of scenarios without risking the production environment places organizations at a big advantage.
Simulation tools can be complicated and it would be wrong to say otherwise. But this complexity was an unfortunate consequence of the deliberate intention to make these tools versatile. Here at Future Facilities, we’ve spent 10 years doing the exact opposite: making a simulation tool that is focused on a single application: data centers. This tool is aimed at busy data centre professionals, not PhD students who have hours to spend fiddling with a million different settings. This means that modelling a data center is now as simple as dragging and dropping cooling units, racks and servers from a library of these ready-to-use items. Take a free trial and have a go yourself!
That just leaves us with the question of accuracy. The accuracy of CFD technology has already been proven – the real problem comes down to the quality of the models themselves. Make a rubbish model and you’ll get meaningless results. Many in the data centre industry have been burned in the past by simulation used badly, but this is a ‘people problem’ – operator error – not an issue with the technology! If you’re going to use simulation, the model has to represent the reality and must be proven to do so before it’s used to make operational changes. This process of calibrating the model ensures that agreement between simulation results and physical measurements is reached (read this paper to find out how the calibration process works). If someone is selling you simulation and isn’t willing to put their money where their mouth is, be very, very wary.
There’s just room here for me to say a few words on “real-time CFD” or ‘CFD-like’ capabilities – the latest strap-lines for a number of DCIM providers. We’ll blog about this separately in the future, but let us be very clear: there is, at present, no such thing for data centers. It is marketing hype. When people talk about real-time CFD they can really mean one of two things: 1) they can either use monitor data to make a picture that looks like the output of a simulation, with zero predictive capability, or 2) they use a type of CFD known as potential flow which trades accuracy for speed by making a lot of assumptions. Renowned physicist, bongo player and all round good guy Richard Feynmann considered potential flow to be so unphysical that the only fluid to obey the assumptions was “dry water“.
So the questions you have to ask yourself is do I want a tool that can actually predict, and do I want a tool that can predict accurately. A full CFD simulation (typically RANS) may not be real-time, but it is the only way to get the real answer!