High Availability Options

Lets face it, today’s disk drives are faster, have higher capacity, and are less expensive than ever before, but I don’t think anyone would say they are more reliable. And worse, disks are no longer repaired when they fail, they are replaced. As we store more and more data on our systems it is imperative that we protect that data.

So what are the options for data protection? First, we have to get clear definitions of two terms, namely Fault Tolerance and High Availability. Fault tolerance simply means that a disk failure does not result in lost production data. High availability means not only do you not lose production data as a result of a disk failure, but you also don’t have any system down time also. Logging and shadowing are generally considered fault tolerant solutions. If you are considering high availability or fault tolerance then this paper should help get you versed in the terms you will encounter.

Fault Tolerant products such as Image Database Logging and Netbase Shadowing/Shareplex are still good options, but they require a manual recovery effort to bring the system back up to date after a disk failure. User volumes will greatly lessen the overall impact of a failed disk, but they will not entirely eliminate lost data. Shadowing cannot be considered a high availability option unless several factors are considered. First, the shadow system must have enough horsepower to handle the load that the master system does. Second, there must be an automatic switchover process to move all connections to the backup machine. Typically this would be done via a DNS server pointing all new connections to the slave machine’s IP address. This works very effectively for Web Enabled machines that are processing stateless connections. Third, much planning of source control and production control must be maintained so that the shadow machine is ready to handle the processing. This is can be a major issue in terms of performance and application viability on systems where the shadow system is also used for testing and development.

High Availability options include:

  • EMC Disk Arrays. Generally high-end applications as far as size. EMC disks are reliable, scaleable, and the most expensive. You can also host multiple systems (and combined types of systems MPE, UNIX, NT) on a single array which is an attractive option. Also, EMC does all of the maintenance on the array, diagnostics, etc. Some people like to outsource the whole thing; others don’t like the feeling that they don’t have complete control.
  • HP XP256 Disk Array. Similar to EMC disk array. I don’t know of anyone using them in production on an Hpe3000.
  • HP Surestore E Disk Array Model 12H (a.k.a. Autoraid Model 12H.) This is a middle of the road offering as far as sizability. Performance wise the can effectively go up to about 80GB. The Autoraid is supposed to handle everything for you automatically. There is management software that runs on HPe3000 host, but it is still relatively new. At this point in time I consider Autoraid on MPE is still evolving technology.
  • HP Nike Model 10/Model 20 disk arrays. A somewhat older technology, these can be found on the used market and offer an attractive price.
  • MPE Mirror/iX Software. Operating system maintains mirroring by simultaneously writing to disks in pairs.

All of the hardware options have options for the RAID level, whether it is full mirroring RAID1, or striping RAID5. The Autoraid can even switch back and forth dynamically. You configure a LUN (logical unit) that is n GB of disk space to present to the system as a single disk, then the operating system is configured with a single LDEV number for each LUN. The disk array handles which physical disk(s) to store the data. Moreover, all of the disk arrays include hot swappable disks so there is no need to shut down to replace a failed disk (High Availability.) The disk arrays generally include large read and write memory cache to increase throughput; however, performance can be a serious issue with any of them.

On the other hand, Mirror/iX software has some distinct differences. It only offers RAID1 (mirroring); no other RAID levels are available. The system actually sees two different LDEVs for each logical volume of a volume set. That is, PROD:MEMBER2 exists on both ldev 32 and 42, they are mirrored images of each other. The advantages of this are: A) When reading data the operating system can choose which of the pair of disks to read from which can give a performance improvement. B) Since you configure the mirrored partners on different SCSI channels there is built in redundancy. The down side to Mirror/iX is that you can only use it on user volumes, you can’t mirror the system volume set. This is not big problem for most users. They will simply move all of their production data to the user volumes, and only the operating system and (third party) utilities are left on the system volume set. In the event of a lost disk in the system volume set, a simple re-install and a relatively small restore and you are back up and running. Those that are looking to be close to 100% uptime will use a Nike disk array for the system volume set.

In the past,HP would recommend that you use hot swappable disks (not arrays, just hot swappable enclosures) known as Jamaica disks for mirrored volume sets. In this way, if a disk failed you could replace the disk while the system was still running, eliminating downtime. We prefer an alternative to that. By utilizing standard disks and disk enclosures and keeping several spare disks on the system at all times, you can provide your own “hot swappable” disks. If a disk fails, rebuild the mirror onto a different disk that is already configured and ready to go. Then you can plan downtime to replace any disks that have failed. This can be a significant cost savings as the Jamaica enclosures and the Jamaica disks are more expensive than the traditional disks and enclosures. One other advantage to the Mirror/iX product is the ability to split the mirror (stop mirroring) and run your backup against the static copy of the data, while users continue to work on the “master” copy. Once the backup is complete the mirroring software automatically rebuilds the mirrored copies. This gives you an alternative to 24×7 backups. The drawback to this is that your data is not protected from a disk failure during the time of the backup.

Conclusion

Zero downtime is technically impossible under MPE. Other platforms handle it with complicated clusters of redundant systems. You can reduce the amount of downtime caused by a disk failure, but you have to consider all of the components of the CPU that are not redundant (power supplies, I/O card cages, I/O adapters etc.) With that in mind, any of these options should give you no downtime and close to zero downtime due to a disk failure. One of the reasons there are so many options is because no one option fits all circumstances. Please feel free to call and speak to us about your specific situation.