WO2001082079A2

WO2001082079A2 - Method and apparatus for providing fault tolerant communications between network appliances

Info

Publication number: WO2001082079A2
Application number: PCT/US2001/012864
Authority: WO
Inventors: Daniel A. Davis; Marty P. Johnson; Ben H. Mcmillan, Jr.
Original assignee: Ciprico, Inc
Priority date: 2000-04-20
Filing date: 2001-04-20
Publication date: 2001-11-01
Also published as: WO2001082079A3; AU2001257132A1; WO2001082079A9

Abstract

A method and apparatus for performing fault-tolerant network computing using redundant communications modules. The apparatus comprises a pair of network appliances coupled to a network. The appliances interact with one another to detect a failure in one appliance and instantly transition operations from the failed appliance to a functional appliance. Each appliance monitors the status of another appliance using multiple, redundant communication channels are formed using a plurality of network interface cards.

Description

ϋc y uuv

-1-

METHOD AND APPARATUS FOR PROVIDING FAULT TOLERANT COMMUNICATIONS BETWEEN NETWORK APPLIANCES

BACKGROUND OF THE DISCLOSURE

1. Field of the Invention

The invention relates to network appliances and, more particularly, the invention relates to a method and apparatus for providing fault tolerant communications between network appliances.

2. Description of the Background Art

Data processing and storage systems that are connected to a network to perform task specific operations are known as network appliances . Network appliances may include a general purpose computer that executes particular software to perform a specific network task, such as file server services, domain name services, data storage services, and the like. Because these network appliances have become important to the day-to-day operation of a network, the appliances are generally required to be fault-tolerant.

Typically, fault tolerance is accomplished by using redundant appliances, such that, if one appliance becomes disabled, another appliance takes over its duties on the network. However, the process for transferring operations from one appliance to another leads to a loss of network information. For instance, if a pair of redundant data storage units are operating on a network and one unit fails, the second unit needs to immediately perform the duties of the failed unit. However, the delay in transitioning from one storage unit to another may cause a loss of some data.

One factor in performing a rapid transition between appliances and rapid recovery from a failure is to enable each redundant appliance to effectively communicate with ECCS 007

-2- another redundant appliance. The communications network amongst appliances ensures that the appliances have knowledge of the present configuration information of other appliances that are connected to the network. Such communications are accomplished through a single link that informs another appliance of a catastrophic failure of a given appliance. The communications is provided by using a remote procedure call (RPC) technique that is supported by a number of software manufacturers. Such notification causes the other appliance to take over the network functions that were provided by the failed appliance. However, such a single link is prone to false failure notifications and limited diagnostic information transfer. For example, if the single link between appliances is severed, the system may believe the appliance has failed when it has not .

Therefore, a need exists in the art for a method and apparatus for providing robust, fault tolerant communications between fault tolerant network appliances .

SUMMARY OF THE INVENTION

The disadvantages associated with the prior art are overcome by the present invention of a method and apparatus for performing fault-tolerant network computing using redundant communications modules that communicate with one another through a plurality of communications paths . The apparatus comprises a pair of network appliances coupled to a network. The appliances interact with one another to detect a failure in one appliance and instantly transition operations from the failed appliance to a functional appliance. Each appliance monitors the status and present configuration of at least one other appliance using multiple, redundant communication channels. The communication channels are formed using a plurality of network interface cards. ECCS 007

-3-

In one embodiment of the invention, the apparatus comprises a pair of storage controller modules (SCM) that are coupled to a storage pool, i.e., one or more data storage arrays . The storage controller modules are coupled to a host network (or local area network (LAN) ) . The network comprises a plurality of client computers that are interconnected by the network. Each SCM comprises a status message generator and a status message monitor. The status message generators produce periodic status messages (referred to as heartbeat messages) on multiple communications channels . The status message monitors monitor all the communications channels and analyze status messages to detect failed communications channels. Upon detecting a failed channel, the monitor executes a fault analyzer to determine the cause of a fault and a remedy.

The communications module facilitates communication on a plurality of independent logical channels to achieve synchronization of configuration information across the network appliances. The module uses remote ^'procedure calls on multiple channels to create a redundant fault tolerant communications protocol. When a channel fails, the module rapidly reconnects the channel (if possible) or identifies the fault to the fault analyzer.

BRIEF DESCRIPTION OF THE DRAWINGS

The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:

FIG. 1 depicts a block diagram of one embodiment of the present invention;

FIG. 2 depicts a functional block diagram of the communications channels that interconnect a p'air of storage controller modules; ECCS 007

-4-

FIG. 3 depicts a state diagram for a communications module;

FIG. 4 depicts a series of channel tables that are used during initialization; FIG. 5 depicts a series of channel tables that are used during channel failure;

FIG. 6 depicts a block diagram of the network appliances communicating through a zoned switch; and

FIG. 7 depicts a state diagram for a lightweight communications module.

To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.

DETAILED DESCRIPTION

One embodiment of the invention is a modular, high- performance, highly scalable, highly available, fault tolerant network appliance that is illustratively embodied in a data storage system that uses a redundant channel communications technique to facilitate fault tolerant communications between appliances .

FIG. 1 depicts a data processing system 50 comprising a plurality of client computers 102, 104, and 106, a host network 130, and a storage system 100. The storage system 100 comprises a plurality of network appliances 108 and 110 and a storage pool 112. The plurality of clients comprise one or more of a network attached storage (NAS) client 102, a direct attached storage (DAS) client 104 and a storage area network (SAN) client 106. The plurality of network appliances 108 and 110 comprise a storage controller module A (SCM A) 108 and storage controller module B (SCM. B) 110. The storage pool 112 is coupled to the storage controller modules 108, 110 via a fiber channel network 114. One embodiment of the storage pool 112 comprises a pair of ECCS 007

-5- storage arrays 116, 118 that are coupled to the fiber channel network 114 via a pair of fiber channel switches 124, 126 and a communications gateway 120, 122. A tape library 128 is also provided for storage backup. In storage system 100, the DAS client directly accesses the storage pool 112 via the fiber channel network 114, while the SAN client accesses the storage pool 112 via both the LAN 130 and the fiber channel network 114. For example, the SAN client 104 communicates via the LAN with the SCMs 108, 110 to request access to the storage pool 112. The SCMs inform the SAN client 104 where in the storage arrays the requested data is located or where the data from the SAN client is to be stored. The SAN client 104 then directly accesses a storage array using the location information provided by the SCMs. The NAS client 106 only communicates with the storage pool 112 via the SCMs 108, 110. Although a fiber channel network is depicted as one way of connecting the SCMs 108, 110 to the storage pool 112, the connection may be accomplished using any form of data network protocol such as SCSI, HIPPI, SSA and the like.

The storage system is a hierarchy of system components that are connected together within the framework established by the system architecture. The major active system level components are:

SCM - Storage Controller Module

SDM - Storage Device Module (Storage Pool)

Fiber channel switches, hubs, and gateways

The system architecture provides an environment in which each of the storage components that comprise the storage system embodiment of the invention operate and interact to form a cohesive storage system. ECCS 007

-6-

The architecture is centered around a pair of SCMs 108 and 110 that provide storage management functions. The SCMs are connected to a host network that allows the network community to access the services offered by the SCMs 108, 110. Each SCM 108, 110 is connected to the same set of networks. This allows one SCM to provide the services of the other SCM in the event that one of the SCMs becomes faulty. Each SCM 108, 110 has access to the entire storage pool 112. The storage pool is logically divided by assigning a particular storage device (array 116 or 118) to one of the SCMs 108, 110. A storage device 116 or 118 is only assigned to one SCM 108 or 110 at a time. Since both SCMs 108, 110 are connected to the entirety of the storage pool 112, the storage devices 116, 118 assigned to a faulted SCM can be accessed by the remaining SCM to provide its services to the network community on behalf of the faulted SCM. The SCMs communicate with one another via the host networks. Since each SCM 108, 110 is connected to the same set of physical networks as the other, --they are able to communicate with each other over these same links. These links allow the SCMs to exchange configuration information with each other and synchronize their operation.

The host network 130 is the medium through which the storage system communicates with the clients 104 and 106. The SCMs 108, 110 provide network services such as NFS and HTTP to the clients 104, 106 that reside on the host network 130. The host network 130 runs network protocols through which the various services are offered. These may include TCP/IP, UDP/IP, ARP, SNMP, NFS, CIFS, HTTP, NDMP, and the like.

From an SCM point of view, its front-end interfaces are network ports running file protocols. The front end interfaces are facilitated by execution of communication software. The software that facilitates configuration ECCS 007

-7- infor ation synchronization across multiple SCMs is referred to as the remote SCM communications module (RSCM) 152, 156 that is stored in the memory 150, 154 of each SCM 108, 110. The back-end interface of each SCM provides channel ports running raw block access protocols .

The SCMs 108, 110 accept network requests from the various clients and process them according to the command issued. The main function of the SCM is to act as a network-attached storage (NAS) device. It therefore communicates with the clients using file protocols such as NFSv2, NFSv3, SMB/CIFS, and HTTP. The SCM converts these file protocol requests into logical block requests suitable for use by a direct-attach storage device.

The storage array on the back-end is a direct-attach disk array controller with RAID and caching technologies. The storage array accepts the logical block requests issued to a logical volume set and converts it into a set of member disk requests suitable for a disk drive.

The redundant SCMs will both be connected to the same set of networks. This allows either of the SCMs to respond to the IP address of the other SCM in the event of failure of one of the SCMs. The SCMs support lOBaseT, 100BaseT, and lOOOBaseT. Optionally, the SCMs may be able to communicate with each other through a dedicated inter-SCM network 132. This optional dedicated connection is at least a 100BaseT Ethernet.

The SCMs 108, 110 connect to the storage arrays 116, 118 through parallel differential SCSI (not shown) or a fiber channel network 114. Each SCM 108, 110 may be connected through their own private SCSI connection to one of the ports on the storage array.

The storage arrays 116, 118 provide a high availability mechanism for RAID management. Each of the storage arrays provides a logical volume view of the ECCS 007

- 8- storage to a respective SCM. The SCM does not have to perform any volume management .

The illustrative storage system summarily described above is described in detail in U.S. patent application serial number filed simultaneously herewith

(Attorney docket ECCS 005) which is hereby incorporated herein by reference

FIG. 2 depicts an embodiment of the invention having the SCMs 108, 110 coupled to the storage arrays 116, 118 via SCSI connections 200. Each storage array 116, 118 comprises an array controller 202, 204 coupled to a disk array 206, 208. The array controllers 202, 204 support RAID techniques to facilitate redundant, fault tolerant storage of data. The SCMs 108, 110 are connected to both the host network 130 and to array controllers 202, 204. Note that every host network interface card (NIC) 210 connection on one SCM is duplicated on. the other. This allows a SCM to assume the IP address of the other on every network in the event of a SCM failure. One- of the NICs 212 in each SCM 108, 110 is dedicated for communications between the two SCMs .

On the target channel side of the SCM, each SCM 108, 110 is connected to an array controller 202, 204 through its own host SCSI port 214. All volumes in each of the storage arrays 202, 204 are dual-ported .through SCSI ports 216 so that access to any volume is available to both SCMs 108, 110.

The SCM 108, 110 is based on a general purpose computer (PC) such as a ProLiant 185OR manufactured by COMPAQ Computer Corporation. This product is a Pentium PC platform mounted in a 3U 19" rack-mount enclosure. The SCM comprises a plurality of network interface controls 210, 212, a central processing unit (CPU) 218, a memory unit 220, support circuits 222 and SCSI parts 214.- Communication amongst the SCM components is supported by a ECCS 007

-9-

PCI bus 224. The SCM employs, as a support circuit 222, dual hot-pluggable power supplies with separate AC power connections and contains three fans. (One fan resides in each of the two power supplies) . The SCM is, for example, based on the Pentium III architecture running at 600 MHz and beyond. The PC has 4 horizontal mount 32-bit 33 MHz PCI slots. As part of the memory (MEM) unit 220, the PC comes equipped with 128 MB of 100 MHz SDRAM standard and is upgradable to 1 GB. A Symbios 53c8xx series chipset resides on the 185OR motherboard that can be used to access the boot drive .

The SCM boots off the internal hard drive (also part of the memory unit 220) . The internal drive is, for example, a SCSI drive and provides at least 1 GB of storage. The internal boot device must be able to hold the SCSI executable image, a mountable file system with all the configuration files, HTML, documentation, and the storage administration application. This information may consume anywhere from 20 to 50 MB of disk space. One example of a disk array 116, 118 that can be used with the embodiment of the present invention is the Synchronix 2000 manufactured by ECCS, Inc. of Tinton Falls, New Jersey. The Synchronix 2000 provides disk storage, volume management and RAID capability. These functions may also be provided by the SCM through the use of custom PCI I/O cards.

Depending on the I/O card configuration, multiple Sychronix 2000 units can be employed in this storage system. In one illustrative implementation of the invention, each of the storage arrays 116, 118 uses 4 PCI slots in a 1 host/3 target configuration, 6 SCSI target channels are available allowing six Synchronix 2000 units each with thirty 50GB disk drives. As such, the 180 drives provide 9 TB of total storage. Each storage^ array 116, 118 can utilize RAID techniques through a RAID processor ^'226 ECCS U U7

-10- such that data redundancy and disk drive fault tolerance is achieved.

The purpose of the RSCM 152, 156 is to support communication between the two SCMs over multiple ethernet interfaces, while providing the following two benefits: 1. Provide fault-tolerance by trying multiple channels and allowing modules to avoid known failed or congested network channels. 2. Provide performance enhancements by load-balancing multiple network channels.

When the RSCM 152, 156 starts, the module must build a data structure of logical channel information from available configuration information. The interfaces and IP addresses of the local system are not sufficient for this configuration, as the corresponding interfaces and IP addresses of the remote system are also needed. Additionally, the network interface card (NIC) information includes information on network masks and broadcast addresses. From this information, the RSCM--152, 156 builds a data structure of RSCM logical channels i.e., an RSCM logical channel table. The RSCM logical channel table has a fixed size equal to the number of network interfaces the system can support, and so some channels may not be configured. FIG. 3 depicts a state diagram that depicts the states of each channel extending from the INIT state. Each state transition 301-311 are identified in FIG. 3. A description of the state transitions are described in Table I. ECCS 007

- 11-

TABLE I

Transition Description

301, This transition occurs on initialization of the RSCM for each configured channel .

302 This transition occurs on initialization of the RSCM for each non-configured channel .

303 This loop-transition occurs whenever a channel error or- timeout occurs in the initializing state 312. Until the system has had time to initialize, the channel state is not known and no channel will be failed due to timeouts.

304, This transition to the good state 313 occurs when any successful connection is made on the channel. This includes connections made by the status monitoring module.

305 This transition to the failed state 314 occurs when a certain number of timeouts or communication errors have occurred on any one channel .

306, This transition occurs when a successful connection is made by the status monitoring module on the channel. While in the failed state 314, the channel will not be used by other modules .

307 This transition occurs when the system transitions to degraded mode after the system controller determines that the remote has failed. After this transition successful communication has failed.

308 This transition occurs when the system transitions to Degraded mode after the HASC ECCS υυv

-12-

The primary functional interface for the RSCM is described in TABLE II.

-s us uu/

-13- TABLE II

ECCS 007

-14-

The principle logic that allows the system to load balance and provide fault tolerant communication is the 5 capability of rscm_ClientOpenChannel ( ) to try all RSCM Logical Channels in the Initializing or Good state. The RSCM keeps a pointer into the RSCM Logical Channel Table that allows rscm_ClientOpenChannel ( ) to implement a round- robin algorithm over all RSCM Logical channels. If all 10 Initializing or Good channels fail to make a connection, any configured failed channels are tried next. If this also fails, then rscm_ClientOpenChannel ( ) returns -1 to indicate an error.

FIG. 4 depicts the progression of channel table 15 entries as communications is established between the SCMs. The channel table 400 has eight entries 0-7, and there are four configured channels 0-3. After initialization, the four configured channels 0-3 are initializing, but have not yet successfully connected. The channel to try next is 20 channel 0. For the first call to rscm_ClientOpenChannel ( ) , the system tries channel 0 (as shown by pointer 406) and successfully connects. Channel 0 is now "Good" i.e., in the good state. Table 402 represents the current state of the system. If all channels connect and work, they will ECCS 007 ^"" •

-15 - all be Good after the first four connections are made. These transitions can occur without any specific task checking the goodness before use. The table 404 represents the state of the system with all four configured channel operating.

In another scenario as depicted in FIGs. 5 and 6, the channel table 500 has eight entries 0-7 and there are three configured channels 0-2. All of the channels 0-2 are in the Good state, but a network switch 600 has just failed. The failure affects all the channels except (channel 2) one because the switch is zoned to create two separate LANs (zone 1 602 and zone 2 604) .

Three separate tasks are attempting to make a connection using the failed switch: (a) a shared file manager (task 1) , (b) a status monitor fault analyzer (task 2), and (c) a persistent shared object manager (task 3) using the RSCM when the failure first affects the system. These tasks and others that use the RSCM are described in U.S. patent application serial number , filed simultaneously herewith (Attorney Docket ECCS 005) , which is incorporated herein by reference. For simplicity, this example assumes there are.no connections when the failure first affects the system and the channel pointer 406 points to logical channel 0 at the failure. The three connecting tasks (tasks 1, 2, 3) may all try to connect at the same time, but each will try to connect through a different channel 0, 1, 2.

Each time channel 0 or 1 is tried, error counters identify the channel as non-operative. Meanwhile, task 2 registers a loss of status information and invokes a failure analysis as described in U.S. patent application serial number filed simultaneously herewith

(Attorney docket ECCS 006) which is hereby incorporated herein by reference. After some time, channel 0 and 1 will transition to a failed state. .ECCS U U7

-16-

In an alternative embodiment of the invention, a "light weight" version of the RSCM may be used in conjunction with the RSCM to provide procedure calls to a remotely located SCM. This adjunct module is referred to herein as a remote SCM Light Weight Procedure Call

(RSCMLWPC) module. The purpose of the RSCMLWPC module is to provide a transaction oriented module that uses the RSCM, so that one failed transaction initiated on the local system may be successfully retried. Furthermore, each connection may be handled in a separate light weight process, so that one remote transaction may block waiting for another to complete, and each may use normal, single- system synchronization calls.

The unit of transaction is the remote procedure call. The remote procedure call mechanism is not as elaborate as that used for single channel, remote procedure call techniques of the prior art. The remote system calls rscmlwpc_svcrun() to start the service for a particular port (similar to rscm_ServerOpenChannel ( ) ) . This starts a task to accept concurrent connections to the procedure call service. The service will stop if rsomlwpc_svcsto () is called.

The local system opens a socket connection to the server using the rscamiwpc_crβate() call. Thereafter, a remote function call can be made on that socket by calling rsomlwpc_call() . The task running, the service calls the appropriate function using appropriate arguments once it has received all the data. Even a zero byte reply via rscanlwpc_reply() results in acknowledgement data being returned. An error occurs if the call is made with no reply. rscaniwpσ_caii() blocks until the reply is returned. r£5omiwpc_svorun() also has some stringent requirements on implementation. Each program call executed must be executed in a separate thread of execution. , This is due to the way that the procedure call library will be used: J -C S U U /

-17- rscπιiwpc_svcsto () must cause the server (the SCM that is initiating the channel) to stop executing. When the server is stopped, the following should be true: (1) all current light weight processes running the server program have completed and exited, (2) all sockets including the listening file descriptor have been closed/shut. On both sides of the connection the HEADER described below is the minimum data that should cause select to mark the socket readable . The following protocol messages shall be used for communication:

1. START CLIENT := Open a socket connecting to the listening file descriptor on the server. Client will call rscsmlwpc_crβate() . 2. CLIENT CALL : = HEADER DATA

1. HEADER := MAGIC_NUMBER LENGTH ERROR

CALL_ID TOKEN_ID FUNCTION 2. MAGIC_NUMBER := UINT16 (0x7ela)

3. LENGTH := UINT16 # Length of data in the data portion

4. CALL_ID := UINT32 # A client controlled sequence number that increases with each client call

5. TOKEN_ID := UINT32 # A unique identifier for each client

6. FUNCTION := SINT32

7. DATA := BYTES (declared length) # Variable length data of the length declared

3. SERVER REPLY : = HEADER DATA 4. END CLIENT := Close your end of the socket. Use rscanlwpc_ estroy() Fault tolerance is assured as follows. The rsomi-pc_create{) routine assigns a unique token to each new client on the local system. Each client has an entry in a client reply cache on the remote system identified by the unique client token. Based on a sequence number incremented for each call on the client system, the remote server system can determine whether a call is a duplicate of a previous call, a duplicate of a call currently in service, or a new call that has not been seen before. FIG. 7 depicts the states 700 that each line in the server systems' client reply cache goes through to facilitate communication. Table III ECCS 007

-18- describes the events that cause the state transitions to the states of FIG. 7.

TABLE III

When a new call is received_. on the server, the manner in which the call is satisfied depends on the system state shown in TABLE IV, i.e., the table shows various network states at which failure may occur and the procedure call that is being performed at that network state. ECCS 007 -19-

TABLE IV

State in

Client Reply Remote Procedure Call Processing Description

Cache

(Satisfy by calling the service function and allow

IStarting the service function to reply. i The service function has been called, but has not

L yet returned. Change the socket for the reply and lServing indicate to the thread serving the request that a duplicate call has been received.

The service function has been called, and a reply iServed has been sent. Respond with that same reply, but ™ not call the client function.

As such, the RSCMLWPC module provides a process for initiating communications between SCMs and supporting the communications using procedure calls on multiple communications channels.

Although various embodiments which incorporate the teachings of the present invention have been shown and

10 described in detail herein, 'those skilled in the art can readily devise many other varied embodiments that still incorporate these teachings .

Claims

ECCS 007-20-What is claimed is:

1. Apparatus for providing fault tolerant communications between multiple network appliances: a first network appliance comprising a first communications module coupled to a plurality of network interface ports; a second network appliance comprising a second communications module coupled to a plurality of network interface ports; a multiple channel communications network coupled to said network interface ports of the first and second network appliances; where said first and second communications modules provide redundant communications paths between the first and second network appliances.

2. The apparatus of claim 1 wherein said first and second communications modules comprise: means for initializing each network interface port; means for transitioning to a good state upon said initialization being completed; and means for transitioning to a failed state upon said initialization not being completed.

3. The apparatus of claim 1 further comprising means for detecting a channel failure and means for transitioning said communications traffic to an operative network channel .^'

4. The apparatus of claim 1 further comprising means for providing load balancing of communications traffic across a plurality of network channels .

5. The apparatus of claim 1 further comprising a means for providing fault tolerant procedure calls. ECCS 007

-21-

6. The apparatus of claim 1 wherein said first and second communications modules are middleware software.

7. In a system having a plurality of network appliances, a method of communicating between network appliances comprising: initializing a plurality of network interface ports to form a plurality of communications channels between said network appliances ; transitioning to a good state upon said initialization being completed for a network interface port; and transitioning to a failed state upon said initialization not being completed for a network interface port.

8. The method of claim 7 further comprising detecting a channel failure in said plurality of channels and transitioning said communications traffic to an operative network channel .

9. The method of claim 7 further comprising providing load balancing of communications traffic across a plurality of network channels .

10. The method of claim 7 further comprising transmitting procedure calls through said plurality of channels.