High Performance NT4 Optimization and Tuning:Clustering Technologies

Clustering Implementations

Since clustering implementations do vary from manufacturer to manufacturer, a closer look may help to illustrate some of the key items provided by these manufacturers. In the following sections, you will explore two of the major differences between these implementations. Let’s begin with a look at the Qualix Group offering, then move on to the Microsoft and Digital Equipment Corporation implementation.

Qualix Group Clustering Implementation

The Qualix Group clustering implementation (Octopus HA+) can use a node-based methodology to connect two servers over a LAN or WAN connection, as shown in Figure 10.2. These servers are configured in nondedicated mode to provide rollover support if either server fails. This is an implementation of a redundant server configuration. As long as the LAN or WAN connection is still operating, if Server A fails then Server B will assume the responsibilities of Server A, and clients will continue to operate, as shown in Figure 10.3. These clients may operate at slower speeds if a WAN connection is used, because a WAN connection usually operates far slower than a LAN connection, but they will continue to be able to access system resources.

Figure 10.2 Using Octopus HA+ to connect two servers as a single node.

Figure 10.3 Maintaining client access with Octopus HA+ during a single server failure.

You can also configure Octopus HA+ to function as a generic protection server. A protection server can assume the capabilities of any member server. It can even assume the role of more than one member server at a time. This is an implementation of a fault tolerant server. Consider the following as an example. Assume you have two servers (A and B) with clients connected to each server on separate network subnets. Both subnets are also connected to the protection server (C). This scenario is depicted in Figure 10.4.

Figure 10.4 Protecting your clients using Octopus HA+ and a protection server.

If Server A fails, the protection server (Server C) assumes the role of the failed server. This example is illustrated in Figure 10.5. If Server B then fails too, the protection server assumes the role of Server B, as well. This example is shown in Figure 10.6. As far as I have been able to determine, this is a unique feature of Octopus HA+. Not only does it offer server redundancy but server fault tolerance, as well. It also operates using LAN or WAN connections. To my mind, this solution shows promise for use in a mission-critical environment.

Figure 10.5 A protection server assuming the role of a single failed member server.

Figure 10.6 A protection server assuming the role of dual failed member servers.

Microsoft And Digital Equipment Corporation Clustering Implementation

The Microsoft and Digital Equipment Corporation clustering implementation is designed strictly to provide redundancy on a per-server or per-service basis. Both implementations are designed around two servers operating as a single node. But this is where they begin to diverge, as well.

The Digital Equipment Corporation clustering implementation, like the Qualix Group, is based on standalone servers. Each server contains its own hardware that is used in the case of a single point failure. Both the Microsoft and Digital Equipment Corporation implementations require that you configure the system drives to provide the suggested data redundancy on each server. To protect your server in case of a disk failure, you must use mirror sets on the shared disk system. To improve performance, you could use a stripe set with parity and gain a bit of fault tolerance as well, or you could use both mirror sets and stripe sets with parity. Both the Microsoft and Digital Equipment Corporation implementations are also designed to have the slave server assume the functions of the master server if the master fails.

The Microsoft implementation, on the other hand, uses a shared disk subsystem. There are actually two servers, but only one disk subsystem, as shown in Figure 10.7.

Figure 10.7 A shared disk subsystem in a Microsoft node.

This makes it easier to replicate data because, in actuality, there is only one copy of the data. But it also introduces a single point failure into the clustering implementation. Consider a power supply failure, for example, on the shared disk subsystem. If you do not have a redundant power supply, the entire shared disk subsystem would fail. Then, it doesn’t do you much good to have redundant servers, does it? Of course, in reality, no one would have a shared disk subsystem without a redundant power supply, nor should your servers be without a redundant power supply. You should also have a UPS available to filter the incoming AC power and provide at least 10 minutes of AC power during an outage under full load conditions. I prefer a minimum of 20 minutes of AC power under full load conditions myself, but 10 minutes is adequate. A 20-minute power supply provides you with a bit more capacity, which might help out in the long run, because all batteries deteriorate over time. When your batteries reach half of their load capacity, it’s time to replace the batteries in your UPS.

TIP: Don’t forget to schedule your UPS software to automatically test the UPS each day to make sure it is functioning properly. You should also schedule a self-discharge at least once a month to verify the battery capacity. If your UPS can record temperature and humidity variations, have the UPS log the temperature and humidity levels as well. This can help you with long-term reliability problems.

One advantage of the Microsoft clustering implementation is that you can use it every day to improve the performance of your network. You don’t have to use the clustering software only when a server failure occurs. Much as a redundant disk system under Windows NT improves read throughput by accessing both disks simultaneously, the Wolfpack implementation can access both servers simultaneously. This does not mean that you have both servers providing the same service (such as SQL Server), but rather that one server in the node can provide one service (such as SQL Server) while the other server in the node can provide another service (such as shared files). This enables you to load balance your virtual server—the two nodes joined into a single whole—and provide improved response time to your clients. This feature makes the Microsoft implementation more cost effective than some other solutions because you are making full use of both servers in the node.

But, being cost effective isn’t the only requirement that you should consider, which brings us to the subject of our next discussion—choosing the best clustering implementation to suit your needs.

Table of Contents