The challenge with commodity hardware is the integration of software. IT is a complex, sometimes temperamental beast. Solution A, running perfectly well on one brand of commodity, might throw up a strange, undefined critical error on another brand. You never what you might get until you’ve tested it. You lose some hardware awareness when you go ‘software defined’. When you abstract and simplify you also lose detail and control.
Let’s take an object storage system for example: It may be provided as an appliance (more $$$) or as software defined on commodity hardware (less $$$). The system will naturally have hundreds, if not thousands, of hard drives used to provide the capacity required. What happens if the software doesn’t understand the hard drives? What if it doesn’t understand when one hard drive is about to fail or doesn't know how to trigger the warning light if there is a failure? Imagine if the software can't translate how many times it has been written to or read from. What if it doesn’t understand if the drive has been moved from one system to another?… you can see the problems here.
Imagine it this way: A technician is trying to replace a faulty hard drive without an error light. He is not 100% familiar with the server, so trying to find which server in the rack and which drive has failed is a real challenge. The chances of the wrong drive or the wrong server being serviced are worrying high. If a high density server chassis went down and needed to be replaced, you’d have 80+ 8TB hard drives that would all need to be re-striped with data. It could take weeks if the software didn’t understand that these drives already had the right data and only the server chassis had changed!
There is a better way! Rest assured, you can have your cake and eat it too. Some object storage solutions run on industry standard, commodity x86 hardware with awareness built into the software - the software is deeply integrated with the hardware. This provides the appliance experience at commodity prices.
How is this possible? There are two different ideas that make this possible:
1. Combined Operating System (OS) and Software image
2. Rigorous testing and a Hardware Compatibility List (HCL)
Previously, you would install the OS onto your severs and then install the software on top of that. If you had any drivers or OS updates to do then you would have to do that from the server or OS management level. You would then need to manage the software on top of that and make sure the drivers, OS and software were all at compatible versions. Now, by combining the OS and software into one single image, when you update your software it includes the updated OS with correct drivers and versions. You save yourself the time and energy required for those multiple steps and remove the risk of problems occurring.
With the OS and software integrated you can also manage the hardware at the level you require. You can turn on the hard drive lights when they fail; you can track each hard drive's statistics and predict when a failure may occur (and pre-emptively move data off the drive). All of this saves you time and money, and more importantly, it makes your life easier.
In order to combine the OS and include all the relevant drives, you need to know exactly how the hardware will be configured. That’s why you have a HCL. This lists all the commodity hardware that has been tested with and is understood by your chosen software/OS bundle. Generally, the HCL will have a handful of different commodity hardware options, so you can choose the commodity hardware you’re most comfortable with.
I recently heard a story of one company which deployed an object storage system and unfortunately received a bad batch of drives. Roughly 30% failed within the space of two months. Luckily, because they had a solution that understood the hardware and knew when a drive was about to fail, they managed to stay online and retain ALL their data. In other environments this sort of failure would have been catastrophic.
What would happen in your data centre if 30% of your storage system drives failed in a month?