Mission Critical Systems - RAID & Clusters Abstract: RAID - a new kind of bug spray for computers or what? Given the complexity of today's computers it is a wonder that they run at all. Yet users' require and demand flawless operation. The nature of hardware is that it WILL fail at some point in time. (Even software has bugs, - known and unknown.) These are facts that cannot be changed; however, we can do something to mitigate damages. Mechanical devices like hard discs, are the most prone to failure. Hence, a system's up time can be significantly increased by concentrating on what to do when there is a failure. This is the essence of a RAID array of discs (Redundant Array of Independent Discs). There are various levels of RAID, but essentially, the idea is to write redundant data to two or more discs. If data is lost in one location, the second can be used to recover the original data. RAID 0,1, 5 & 50 will be discussed as some of the more popular arrangements. AMI's RAID controllers will be introduced as one of the industry standard products used by many OEM's today. Clustered systems build on the RAID technology and applies these ideas to the logical extension of a whole system acting redundant to a second system. The latest AMI products in this field will be discussed as examples of Clustering solutions in a Quad Pentium-2 system for high end PC, mission critical applications. Contents: Abstract: Have you ever had a system crash? How about a power failure in the middle of saving an important piece of work? Weaknesses is computer handware, environmental stresses or just plain software bugs can cause loss of important data. The consequences vary enormously. The home user may be annoyed at needing to re-enter a few hours of work. But the bank that has just electronically transferred 1 Billion dollars is going to have a very feverish concern at a failure during the transfer. There are many examples dealing with funds, private information like medical records, or security data in the military, in which the sudden loss of data is catastrophic. There is no doubt that in time as costs decrease, that every computer sold will have some level of RAID. It will, un-noticed by the user, simply correct and recover data. For now, only the more "mission critical" applications can justify this expense. Our objective is to provide data protection. This is usually talked about in terms of recovering from any single point of failure. Since the hard disc in the main data storage device, it only makes sense to look at how we can make this component of the computer more reliable. The manufacturers of hard disc devices have certainly assisted in this mission by improving their production techniques. It is common to now see MTBF (Mean Time Between Failure) in the 1 million hr. level compared to 20,000 hr at the beginning of the PC era. Nevertheless, 'mean times' are based on statistics and probability of failure. For any single disc, failure in all or part of the disc is a random event. What happens when a failure occurs? RAID (Redundant Array of Independent Discs) provides the answer to this question. In a RAID 1 array (level 1) two discs mirror each other. In essence, data is written to both discs at the same time. If one disc crashes, the other disc can be used to recover and thereby reconstruct the data on the failed disc. RAID arrays are typically built with SCSI controllers. The firmware on the controller can detect the occurrence of a failure and take some action, like sounding an audible alarm, to notify operators. The MegaRAID Controller from AMI behaves in this manner. We will use this device to refer to practical implementations of concepts. There are other RAID levels. We will discuss levels ,5,10 and 50. RAID 0 provides a method for Striping data across an array of discs. Assuming an array of 4 discs, the controller in RAID 0 operation will split the data into four sections and write the data across all four discs. Striping does not improve reliability of the data, but it does not improve performance. This is due to the fact that semiconductor speeds are generally several orders of magnitude faster than a mechanized device like a hard disc. Hence, the SCSI controller can easily cycle a write request to each hard disc and be back to the first disc before the disc heads are aligned. At a system level, it appears that the stripped data is written to all four discs in parallel. Of course there are limits. For example, a MegaRAID, 3 channel controller can handle up to 45 hard discs on its own. [As an aside, up to 6 such controllers can be installed in a system which opens the door to very large and very fast disc arrays.] With Ultra-2 discs each disc can stream data maximum of 80 Mb/s. The PCI (V2.1) bus has a capacity of 266 MB/Sec. The option of RAID 0 opens the possibility of tuning the performance to an optimum in which bandwidth is matched from disc controller to PCI bus to memory. This level of RAID is the most popular for 'transaction' processing (many small burst of data verses a long continuos stream.) which is the common data exchange found in a commercial office. In a RAID 5 array data is stripped. Then a "CRC" is computed for the stripped data. With four discs the data would be stripped to three discs and the CRC stored in the fourth. For each write of data the discs save the CRC changes. If a data track or an entire disc fails, the controller will detect the failure on reading back the data and comparing to the CRC. Lost data is reconstructed and saved properly. All this happens transparently to the user. Of course there are more complexities. In rewriting an existing file the stripped data must first be read, modified with the new data, a new CRC computed, and then written back to the discs. This takes more time than a straight write. Fortunately, the effects of Striping help keep RAID 5 arrays at high speeds. Reads will be faster than on a single disc. There are more reads than writes so generally the RAID 5 array shows improved performance. Fundamentally RAID 5 provides high reliability by allowing continued operation on a single point of failure (1 hard disc). RAID 5 arrays can be made very large. Six controllers with 45 discs each allows up to 270 discs! The chances of a signal failure increase as the number of discs increases. A RAID 50 array allows an improvement in both performance an reliability. Under RAID 50 the bank of HDD is divided into 2 groups of RAID arrays. Data is them stripped to these arrays. For example,. 8 discs would be divided into 2 groups of 4 discs which each sub-group logically organized as a RAID array. The data is then striped to the 2 banks. This organization delivers a performance boost from the Striping. Since we now have 2 RAID 5 arrays, we can tolerate a single point of failure in each array. RAID 50 begins to show the complexities of the firmware in a RAID controller. The entire bank of drives is seen as 1 logical drive by the user. The RAID firmware provides
the system manager with the ability to manipulate physical drives into the logical configuration desired. It is also the firmware's responsibility to notify the system manager when a disc error occurs. The balance of this paper is aimed at providing system level improvement that assist in the reliability and up time of a system (usually a server). We will then discuss Clustering and finally the concept of remote analysis is a product of MegaRAC. At a system level, in hardware, on cannot simply open the system and disconnect power and SCSI cable without problems. Hence, discs must be housed in 'Docking Kits' which provide for key switch power disconnect. We also like to use docking kits with SCSI ID setting control. The SCSI cable termination should be done with an external termination on the cable, not with the last disk on the cable chain. This avoids the problem of removing the discs with the SCSI termination. With a docking tray them MTTR is very fast in that the system need not be opened. The docking tray is simply unplugged from the front and a new drive inserted. ] A spare disc and docking tray will improve the swap time even more. Notice that even though there will be time needed to replace the hard disk that the system has never stopped. The MTTR does not hurt system up time and ability of users to access data. Duplexing is the ability for 2 controllers to control the same RAID array in a redundant mode. Should one controller fail, the second takes over operation. As with hot swap discs it is possible to also have hot swap controllers. In this case, the motherboard must support the ability to hot swap. Also, since the controller is at one end
of the cable, the SCSI termination needs to be properly handled. RAID Controllers like the MegaRAID house cache memory to improve performance. (32 to 64MB typical Size). It is possible for a power failure to occur in the process of a disc write. In such a case the operating system will believe that it has written data successfully, however it is in cache memory and has not reached the disc. By supplying a battery to keep the cache alive the data is saved, the data is saved in this memory. Without this feature, on restart the system will detect corrupt data in the file and will be lost. With the battery backup the controller has the smarts to begin operation at the point
it was stopped and thereby complete the write operation properly. As well as redundant power, it is useful to have hot swap power. Hence, if a failure occurs, the failed supply can be replaced without shutting down the system. The MegaPlex System from AMI shows 3 power supplies in a redundant hot swap arrangement. For the sake of completeness, real critical data should always have a secondary backup. Tape drives or even Read-write CDs are good choices. By monitoring these, system failure can be minimized and preventative maintenance action can be taken early, before a permanent failure and before any user is prevented from doing their work. It is possible to monitor these parameters remotely (using Mega RAC as an example) so that proper support can be called in as required. RAID systems will also have discs that most likely fill gradually. Tracks and sectors begin to have problems before an entry disc is lost. By monitoring these smooth failures
we can determine when the disc is going bad. Another example is the new Mega RAC controllers. This is an intelligent card based on the Intel I960 RISK processor and a proprietary ASIC created by AMI. This board is batter backed up so even with a general power failure it can report remotely. As well as all the system health parameters, the Mega RAC has built in diagnostic ability. The boot process can be monitored remotely ( for all POST codes). The video stream is captured and can be viewed remotely. Keyboard input can be provided remotely to walk a system through a boot process. Also, a watchdog timer is provided to reboot a system that has locked up. In short, system health can be monitored remotely to provide early warning and response. As well, stop gap measures can be exercised remotely while a technical person travels to the system site. In the past, clustered systems were very expensive. running into the millions of dollars. With the introduction of Microsoft phase 1 cluster software for windows NT (enterprise) and AMI complementary hardware we now have the ability to provide cluster servers at the PC level. Compaq, Digital, HP, IBM and Intel, NCR, Tandem are all involved with Microsoft's initiative. AMI is the first to have a cluster kit, certified by Microsoft provide the essential components for the cluster system. This discussion will be limited to Phase 1 of the Wolfpack cluster software as code named by Microsoft. Microsoft refers to Pfister definition of a cluster as: Microsoft also states that " the goal of a cluster is to make it possible to share a computing load over several systems without either the users or system administrators needing to know that more than one system is involved." The Wolfpack phase 1 cluster uses a "Shared Nothing" Organization. This means that any resource is owned and controlled by any one in the cluster at any one time. A system in a cluster is referred to as a Node. A system in a cluster is referred to
as a Node. In this type of organization, resources are owned by a 'node'. When a failure occurs, the failing node loses control of all its controlled resources to the failover node.
In this way. Intricate sharing schemes are avoided. When a failure occurs in a Node, a Failure occurs to the remaining node moving ownership and control of resources from the failed node. The failure is transparent to the User. On recovery (repair of failed node) a Fallback occurs restoring the Cluster. From the hardware point of view, all system components must be redundant so that there are not single points of failure. The general architecture is shown in the diagram. In essence, two full systems are connected in a way to control a RAID 1 array of discs. The RAID 1 array is a mirrored array as previously described and, therefore, has built in redundancy. Furthermore, the array uses a dual bus architecture so that we can even lose a cable without going down. Each node must have a RAID controller card. In our example, the MegaRAID '428' controller is used. The controller has been modified to be Cluster aware. Hence, when one controller sees another controller on its SCSI bus it able to identify its counterpart in a Cluster. The System motherboard must also have modifications to be cluster aware. One important consideration in the RAID Cluster Structure is the SCSI bus termination. The bus must remain properly terminated (to allow reliable access to the data) at all times. A cable removal, system removal or power failure on one node must not interfere with reliable access. This is accomplished with the AMI bus extender/terminators. These are the essential components in a Cluster. These, of course, require the Windows NT Cluster software to complete the systems operation. A final note on the hardware configuration is that dual UPS's each tied to separately fused AC lines should be used. This completes the full redundancy of the System. AMI has created a cluster kit consisting of a pair of RAID controllers made Cluster ready as well as the RAID storage unit and components to make ensure full redundancy
and proper SCSI termination on the failure of any single component. Below is our Rack Mount Cluster system, which is one convenient package for a full Cluster. The main Computers are based on either a dual P2 motherboard or a Quad Ppro System. A Quad P2 System is also available but requires a different packaging to accommodate the unique physical features of the Quad P2 board. Other options include a second storage Rack with all the accompanying cabling. The customer may also choose the size and quantity of hard disc drives required. As an added
point of redundancy on the Network, we also offer dual network cards for the global net [plus an NIC for the inter (local) LAN connection ] along with an Adaptive Switch. This
arrangement maintains connectivity even with an NIC failure. Indeed, up to 4 extra NIC's may be added. Each NIC will share the total load. On a NIc failure, the remaining NICs
will adapt and share the total load without any loss to the user. Space Notes: This paper has briefly covered developments in PC systems today that deal with reliable, mission critical systems. More and more applications continue to be added to computers. The Internet has become a backbone of commerce. Cutting the vital link can effectively shut down an organization. Computers control the link to Internet highways. This is one universal example that will drive to need to have 100% uptime and, therefore, the desire to have RAID and Cluster systems as essential, commonplace components of an enterprise wide network of computers. Peter DeVita |
[Systems] [Components / Software] [Design Service] [Geophysical Survey] [Technology] [Profile] [Feedback] [How to Buy] |
|