WO2001082079A2 - Method and apparatus for providing fault tolerant communications between network appliances - Google Patents

Method and apparatus for providing fault tolerant communications between network appliances Download PDF

Info

Publication number
WO2001082079A2
WO2001082079A2 PCT/US2001/012864 US0112864W WO0182079A2 WO 2001082079 A2 WO2001082079 A2 WO 2001082079A2 US 0112864 W US0112864 W US 0112864W WO 0182079 A2 WO0182079 A2 WO 0182079A2
Authority
WO
WIPO (PCT)
Prior art keywords
network
communications
channel
channels
storage
Prior art date
Application number
PCT/US2001/012864
Other languages
French (fr)
Other versions
WO2001082079A3 (en
WO2001082079A9 (en
Inventor
Daniel A. Davis
Marty P. Johnson
Ben H. Mcmillan, Jr.
Original Assignee
Ciprico, Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ciprico, Inc filed Critical Ciprico, Inc
Priority to AU2001257132A priority Critical patent/AU2001257132A1/en
Publication of WO2001082079A2 publication Critical patent/WO2001082079A2/en
Publication of WO2001082079A9 publication Critical patent/WO2001082079A9/en
Publication of WO2001082079A3 publication Critical patent/WO2001082079A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2002Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant
    • G06F11/2007Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant using redundant communication media
    • G06F11/201Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant using redundant communication media between storage system components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2089Redundant storage control functionality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/14Multichannel or multilink protocols
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/40Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection

Definitions

  • the invention relates to network appliances and, more particularly, the invention relates to a method and apparatus for providing fault tolerant communications between network appliances.
  • Network appliances may include a general purpose computer that executes particular software to perform a specific network task, such as file server services, domain name services, data storage services, and the like. Because these network appliances have become important to the day-to-day operation of a network, the appliances are generally required to be fault-tolerant.
  • fault tolerance is accomplished by using redundant appliances, such that, if one appliance becomes disabled, another appliance takes over its duties on the network.
  • redundant appliances such that, if one appliance becomes disabled, another appliance takes over its duties on the network.
  • the process for transferring operations from one appliance to another leads to a loss of network information. For instance, if a pair of redundant data storage units are operating on a network and one unit fails, the second unit needs to immediately perform the duties of the failed unit.
  • the delay in transitioning from one storage unit to another may cause a loss of some data.
  • the communications network amongst appliances ensures that the appliances have knowledge of the present configuration information of other appliances that are connected to the network. Such communications are accomplished through a single link that informs another appliance of a catastrophic failure of a given appliance.
  • the communications is provided by using a remote procedure call (RPC) technique that is supported by a number of software manufacturers. Such notification causes the other appliance to take over the network functions that were provided by the failed appliance.
  • RPC remote procedure call
  • Such a single link is prone to false failure notifications and limited diagnostic information transfer. For example, if the single link between appliances is severed, the system may believe the appliance has failed when it has not .
  • the disadvantages associated with the prior art are overcome by the present invention of a method and apparatus for performing fault-tolerant network computing using redundant communications modules that communicate with one another through a plurality of communications paths .
  • the apparatus comprises a pair of network appliances coupled to a network.
  • the appliances interact with one another to detect a failure in one appliance and instantly transition operations from the failed appliance to a functional appliance.
  • Each appliance monitors the status and present configuration of at least one other appliance using multiple, redundant communication channels.
  • the communication channels are formed using a plurality of network interface cards.
  • the apparatus comprises a pair of storage controller modules (SCM) that are coupled to a storage pool, i.e., one or more data storage arrays .
  • the storage controller modules are coupled to a host network (or local area network (LAN) ) .
  • the network comprises a plurality of client computers that are interconnected by the network.
  • Each SCM comprises a status message generator and a status message monitor.
  • the status message generators produce periodic status messages (referred to as heartbeat messages) on multiple communications channels .
  • the status message monitors monitor all the communications channels and analyze status messages to detect failed communications channels. Upon detecting a failed channel, the monitor executes a fault analyzer to determine the cause of a fault and a remedy.
  • the communications module facilitates communication on a plurality of independent logical channels to achieve synchronization of configuration information across the network appliances.
  • the module uses remote ' procedure calls on multiple channels to create a redundant fault tolerant communications protocol. When a channel fails, the module rapidly reconnects the channel (if possible) or identifies the fault to the fault analyzer.
  • FIG. 1 depicts a block diagram of one embodiment of the present invention
  • FIG. 2 depicts a functional block diagram of the communications channels that interconnect a p'air of storage controller modules; ECCS 007
  • FIG. 3 depicts a state diagram for a communications module
  • FIG. 4 depicts a series of channel tables that are used during initialization
  • FIG. 5 depicts a series of channel tables that are used during channel failure
  • FIG. 6 depicts a block diagram of the network appliances communicating through a zoned switch.
  • FIG. 7 depicts a state diagram for a lightweight communications module.
  • One embodiment of the invention is a modular, high- performance, highly scalable, highly available, fault tolerant network appliance that is illustratively embodied in a data storage system that uses a redundant channel communications technique to facilitate fault tolerant communications between appliances .
  • FIG. 1 depicts a data processing system 50 comprising a plurality of client computers 102, 104, and 106, a host network 130, and a storage system 100.
  • the storage system 100 comprises a plurality of network appliances 108 and 110 and a storage pool 112.
  • the plurality of clients comprise one or more of a network attached storage (NAS) client 102, a direct attached storage (DAS) client 104 and a storage area network (SAN) client 106.
  • the plurality of network appliances 108 and 110 comprise a storage controller module A (SCM A) 108 and storage controller module B (SCM. B) 110.
  • the storage pool 112 is coupled to the storage controller modules 108, 110 via a fiber channel network 114.
  • One embodiment of the storage pool 112 comprises a pair of ECCS 007
  • the DAS client directly accesses the storage pool 112 via the fiber channel network 114, while the SAN client accesses the storage pool 112 via both the LAN 130 and the fiber channel network 114.
  • the SAN client 104 communicates via the LAN with the SCMs 108, 110 to request access to the storage pool 112.
  • the SCMs inform the SAN client 104 where in the storage arrays the requested data is located or where the data from the SAN client is to be stored.
  • the SAN client 104 then directly accesses a storage array using the location information provided by the SCMs.
  • the NAS client 106 only communicates with the storage pool 112 via the SCMs 108, 110.
  • a fiber channel network is depicted as one way of connecting the SCMs 108, 110 to the storage pool 112, the connection may be accomplished using any form of data network protocol such as SCSI, HIPPI, SSA and the like.
  • the storage system is a hierarchy of system components that are connected together within the framework established by the system architecture.
  • the major active system level components are:
  • Fiber channel switches are Fiber channel switches, hubs, and gateways
  • the system architecture provides an environment in which each of the storage components that comprise the storage system embodiment of the invention operate and interact to form a cohesive storage system.
  • the architecture is centered around a pair of SCMs 108 and 110 that provide storage management functions.
  • the SCMs are connected to a host network that allows the network community to access the services offered by the SCMs 108, 110.
  • Each SCM 108, 110 is connected to the same set of networks. This allows one SCM to provide the services of the other SCM in the event that one of the SCMs becomes faulty.
  • Each SCM 108, 110 has access to the entire storage pool 112.
  • the storage pool is logically divided by assigning a particular storage device (array 116 or 118) to one of the SCMs 108, 110.
  • a storage device 116 or 118 is only assigned to one SCM 108 or 110 at a time.
  • both SCMs 108, 110 are connected to the entirety of the storage pool 112, the storage devices 116, 118 assigned to a faulted SCM can be accessed by the remaining SCM to provide its services to the network community on behalf of the faulted SCM.
  • the SCMs communicate with one another via the host networks. Since each SCM 108, 110 is connected to the same set of physical networks as the other, --they are able to communicate with each other over these same links. These links allow the SCMs to exchange configuration information with each other and synchronize their operation.
  • the host network 130 is the medium through which the storage system communicates with the clients 104 and 106.
  • the SCMs 108, 110 provide network services such as NFS and HTTP to the clients 104, 106 that reside on the host network 130.
  • the host network 130 runs network protocols through which the various services are offered. These may include TCP/IP, UDP/IP, ARP, SNMP, NFS, CIFS, HTTP, NDMP, and the like.
  • front-end interfaces are network ports running file protocols.
  • the front end interfaces are facilitated by execution of communication software.
  • RSCM remote SCM communications module
  • the back-end interface of each SCM provides channel ports running raw block access protocols .
  • the SCMs 108, 110 accept network requests from the various clients and process them according to the command issued.
  • the main function of the SCM is to act as a network-attached storage (NAS) device. It therefore communicates with the clients using file protocols such as NFSv2, NFSv3, SMB/CIFS, and HTTP.
  • the SCM converts these file protocol requests into logical block requests suitable for use by a direct-attach storage device.
  • the storage array on the back-end is a direct-attach disk array controller with RAID and caching technologies.
  • the storage array accepts the logical block requests issued to a logical volume set and converts it into a set of member disk requests suitable for a disk drive.
  • the redundant SCMs will both be connected to the same set of networks. This allows either of the SCMs to respond to the IP address of the other SCM in the event of failure of one of the SCMs.
  • the SCMs support lOBaseT, 100BaseT, and lOOOBaseT.
  • the SCMs may be able to communicate with each other through a dedicated inter-SCM network 132. This optional dedicated connection is at least a 100BaseT Ethernet.
  • the SCMs 108, 110 connect to the storage arrays 116, 118 through parallel differential SCSI (not shown) or a fiber channel network 114. Each SCM 108, 110 may be connected through their own private SCSI connection to one of the ports on the storage array.
  • the storage arrays 116, 118 provide a high availability mechanism for RAID management. Each of the storage arrays provides a logical volume view of the ECCS 007
  • the SCM does not have to perform any volume management .
  • FIG. 2 depicts an embodiment of the invention having the SCMs 108, 110 coupled to the storage arrays 116, 118 via SCSI connections 200.
  • Each storage array 116, 118 comprises an array controller 202, 204 coupled to a disk array 206, 208.
  • the array controllers 202, 204 support RAID techniques to facilitate redundant, fault tolerant storage of data.
  • the SCMs 108, 110 are connected to both the host network 130 and to array controllers 202, 204. Note that every host network interface card (NIC) 210 connection on one SCM is duplicated on. the other. This allows a SCM to assume the IP address of the other on every network in the event of a SCM failure.
  • One- of the NICs 212 in each SCM 108, 110 is dedicated for communications between the two SCMs .
  • each SCM 108, 110 is connected to an array controller 202, 204 through its own host SCSI port 214. All volumes in each of the storage arrays 202, 204 are dual-ported .through SCSI ports 216 so that access to any volume is available to both SCMs 108, 110.
  • the SCM 108, 110 is based on a general purpose computer (PC) such as a ProLiant 185OR manufactured by COMPAQ Computer Corporation. This product is a Pentium PC platform mounted in a 3U 19" rack-mount enclosure.
  • the SCM comprises a plurality of network interface controls 210, 212, a central processing unit (CPU) 218, a memory unit 220, support circuits 222 and SCSI parts 214.- Communication amongst the SCM components is supported by a ECCS 007
  • the SCM employs, as a support circuit 222, dual hot-pluggable power supplies with separate AC power connections and contains three fans. (One fan resides in each of the two power supplies) .
  • the SCM is, for example, based on the Pentium III architecture running at 600 MHz and beyond.
  • the PC has 4 horizontal mount 32-bit 33 MHz PCI slots.
  • the PC comes equipped with 128 MB of 100 MHz SDRAM standard and is upgradable to 1 GB.
  • a Symbios 53c8xx series chipset resides on the 185OR motherboard that can be used to access the boot drive .
  • the SCM boots off the internal hard drive (also part of the memory unit 220) .
  • the internal drive is, for example, a SCSI drive and provides at least 1 GB of storage.
  • the internal boot device must be able to hold the SCSI executable image, a mountable file system with all the configuration files, HTML, documentation, and the storage administration application. This information may consume anywhere from 20 to 50 MB of disk space.
  • a disk array 116, 118 that can be used with the embodiment of the present invention is the Synchronix 2000 manufactured by ECCS, Inc. of Tinton Falls, New Jersey.
  • the Synchronix 2000 provides disk storage, volume management and RAID capability. These functions may also be provided by the SCM through the use of custom PCI I/O cards.
  • each of the storage arrays 116, 118 uses 4 PCI slots in a 1 host/3 target configuration, 6 SCSI target channels are available allowing six Synchronix 2000 units each with thirty 50GB disk drives. As such, the 180 drives provide 9 TB of total storage.
  • Each storage ⁇ array 116, 118 can utilize RAID techniques through a RAID processor ' 226 ECCS U U7
  • the purpose of the RSCM 152, 156 is to support communication between the two SCMs over multiple ethernet interfaces, while providing the following two benefits: 1. Provide fault-tolerance by trying multiple channels and allowing modules to avoid known failed or congested network channels. 2. Provide performance enhancements by load-balancing multiple network channels.
  • the module When the RSCM 152, 156 starts, the module must build a data structure of logical channel information from available configuration information. The interfaces and IP addresses of the local system are not sufficient for this configuration, as the corresponding interfaces and IP addresses of the remote system are also needed. Additionally, the network interface card (NIC) information includes information on network masks and broadcast addresses. From this information, the RSCM--152, 156 builds a data structure of RSCM logical channels i.e., an RSCM logical channel table.
  • the RSCM logical channel table has a fixed size equal to the number of network interfaces the system can support, and so some channels may not be configured.
  • FIG. 3 depicts a state diagram that depicts the states of each channel extending from the INIT state. Each state transition 301-311 are identified in FIG. 3. A description of the state transitions are described in Table I. ECCS 007
  • This transition occurs on initialization of the RSCM for each configured channel .
  • This loop-transition occurs whenever a channel error or- timeout occurs in the initializing state 312. Until the system has had time to initialize, the channel state is not known and no channel will be failed due to timeouts.
  • This transition to the good state 313 occurs when any successful connection is made on the channel. This includes connections made by the status monitoring module.
  • This transition to the failed state 314 occurs when a certain number of timeouts or communication errors have occurred on any one channel .
  • the principle logic that allows the system to load balance and provide fault tolerant communication is the 5 capability of rscm_ClientOpenChannel ( ) to try all RSCM Logical Channels in the Initializing or Good state.
  • the RSCM keeps a pointer into the RSCM Logical Channel Table that allows rscm_ClientOpenChannel ( ) to implement a round- robin algorithm over all RSCM Logical channels. If all 10 Initializing or Good channels fail to make a connection, any configured failed channels are tried next. If this also fails, then rscm_ClientOpenChannel ( ) returns -1 to indicate an error.
  • FIG. 4 depicts the progression of channel table 15 entries as communications is established between the SCMs.
  • the channel table 400 has eight entries 0-7, and there are four configured channels 0-3. After initialization, the four configured channels 0-3 are initializing, but have not yet successfully connected. The channel to try next is 20 channel 0.
  • the system tries channel 0 (as shown by pointer 406) and successfully connects. Channel 0 is now "Good” i.e., in the good state.
  • Table 402 represents the current state of the system. If all channels connect and work, they will ECCS 007 "" •
  • the table 404 represents the state of the system with all four configured channel operating.
  • the channel table 500 has eight entries 0-7 and there are three configured channels 0-2. All of the channels 0-2 are in the Good state, but a network switch 600 has just failed. The failure affects all the channels except (channel 2) one because the switch is zoned to create two separate LANs (zone 1 602 and zone 2 604) .
  • a shared file manager (task 1) , (b) a status monitor fault analyzer (task 2), and (c) a persistent shared object manager (task 3) using the RSCM when the failure first affects the system.
  • tasks and others that use the RSCM are described in U.S. patent application serial number , filed simultaneously herewith (Attorney Docket ECCS 005) , which is incorporated herein by reference. For simplicity, this example assumes there are.no connections when the failure first affects the system and the channel pointer 406 points to logical channel 0 at the failure.
  • the three connecting tasks (tasks 1, 2, 3) may all try to connect at the same time, but each will try to connect through a different channel 0, 1, 2.
  • a "light weight" version of the RSCM may be used in conjunction with the RSCM to provide procedure calls to a remotely located SCM.
  • This adjunct module is referred to herein as a remote SCM Light Weight Procedure Call
  • RSCMLWPC RSCMLWPC
  • the purpose of the RSCMLWPC module is to provide a transaction oriented module that uses the RSCM, so that one failed transaction initiated on the local system may be successfully retried. Furthermore, each connection may be handled in a separate light weight process, so that one remote transaction may block waiting for another to complete, and each may use normal, single- system synchronization calls.
  • the unit of transaction is the remote procedure call.
  • the remote procedure call mechanism is not as elaborate as that used for single channel, remote procedure call techniques of the prior art.
  • the remote system calls rscmlwpc_svcrun() to start the service for a particular port (similar to rscm_ServerOpenChannel ( ) ) . This starts a task to accept concurrent connections to the procedure call service. The service will stop if rsomlwpc_svcsto () is called.
  • the local system opens a socket connection to the server using the rscamiwpc_cr ⁇ ate() call. Thereafter, a remote function call can be made on that socket by calling rsomlwpc_call() .
  • the task running, the service calls the appropriate function using appropriate arguments once it has received all the data. Even a zero byte reply via rscanlwpc_reply() results in acknowledgement data being returned. An error occurs if the call is made with no reply.
  • rscaniwp ⁇ _caii() blocks until the reply is returned.
  • r£5omiwpc_svorun() also has some stringent requirements on implementation. Each program call executed must be executed in a separate thread of execution. , This is due to the way that the procedure call library will be used: J -C S U U U /
  • TOKEN_ID UINT32 # A unique identifier for each client
  • the service function has been called, and a reply iServed has been sent. Respond with that same reply, but TM not call the client function.
  • the RSCMLWPC module provides a process for initiating communications between SCMs and supporting the communications using procedure calls on multiple communications channels.

Abstract

A method and apparatus for performing fault-tolerant network computing using redundant communications modules. The apparatus comprises a pair of network appliances coupled to a network. The appliances interact with one another to detect a failure in one appliance and instantly transition operations from the failed appliance to a functional appliance. Each appliance monitors the status of another appliance using multiple, redundant communication channels are formed using a plurality of network interface cards.

Description

ϋc y uuv
-1-
METHOD AND APPARATUS FOR PROVIDING FAULT TOLERANT COMMUNICATIONS BETWEEN NETWORK APPLIANCES
BACKGROUND OF THE DISCLOSURE
1. Field of the Invention
The invention relates to network appliances and, more particularly, the invention relates to a method and apparatus for providing fault tolerant communications between network appliances.
2. Description of the Background Art
Data processing and storage systems that are connected to a network to perform task specific operations are known as network appliances . Network appliances may include a general purpose computer that executes particular software to perform a specific network task, such as file server services, domain name services, data storage services, and the like. Because these network appliances have become important to the day-to-day operation of a network, the appliances are generally required to be fault-tolerant.
Typically, fault tolerance is accomplished by using redundant appliances, such that, if one appliance becomes disabled, another appliance takes over its duties on the network. However, the process for transferring operations from one appliance to another leads to a loss of network information. For instance, if a pair of redundant data storage units are operating on a network and one unit fails, the second unit needs to immediately perform the duties of the failed unit. However, the delay in transitioning from one storage unit to another may cause a loss of some data.
One factor in performing a rapid transition between appliances and rapid recovery from a failure is to enable each redundant appliance to effectively communicate with ECCS 007
-2- another redundant appliance. The communications network amongst appliances ensures that the appliances have knowledge of the present configuration information of other appliances that are connected to the network. Such communications are accomplished through a single link that informs another appliance of a catastrophic failure of a given appliance. The communications is provided by using a remote procedure call (RPC) technique that is supported by a number of software manufacturers. Such notification causes the other appliance to take over the network functions that were provided by the failed appliance. However, such a single link is prone to false failure notifications and limited diagnostic information transfer. For example, if the single link between appliances is severed, the system may believe the appliance has failed when it has not .
Therefore, a need exists in the art for a method and apparatus for providing robust, fault tolerant communications between fault tolerant network appliances .
SUMMARY OF THE INVENTION
The disadvantages associated with the prior art are overcome by the present invention of a method and apparatus for performing fault-tolerant network computing using redundant communications modules that communicate with one another through a plurality of communications paths . The apparatus comprises a pair of network appliances coupled to a network. The appliances interact with one another to detect a failure in one appliance and instantly transition operations from the failed appliance to a functional appliance. Each appliance monitors the status and present configuration of at least one other appliance using multiple, redundant communication channels. The communication channels are formed using a plurality of network interface cards. ECCS 007
-3-
In one embodiment of the invention, the apparatus comprises a pair of storage controller modules (SCM) that are coupled to a storage pool, i.e., one or more data storage arrays . The storage controller modules are coupled to a host network (or local area network (LAN) ) . The network comprises a plurality of client computers that are interconnected by the network. Each SCM comprises a status message generator and a status message monitor. The status message generators produce periodic status messages (referred to as heartbeat messages) on multiple communications channels . The status message monitors monitor all the communications channels and analyze status messages to detect failed communications channels. Upon detecting a failed channel, the monitor executes a fault analyzer to determine the cause of a fault and a remedy.
The communications module facilitates communication on a plurality of independent logical channels to achieve synchronization of configuration information across the network appliances. The module uses remote 'procedure calls on multiple channels to create a redundant fault tolerant communications protocol. When a channel fails, the module rapidly reconnects the channel (if possible) or identifies the fault to the fault analyzer.
BRIEF DESCRIPTION OF THE DRAWINGS
The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
FIG. 1 depicts a block diagram of one embodiment of the present invention;
FIG. 2 depicts a functional block diagram of the communications channels that interconnect a p'air of storage controller modules; ECCS 007
-4-
FIG. 3 depicts a state diagram for a communications module;
FIG. 4 depicts a series of channel tables that are used during initialization; FIG. 5 depicts a series of channel tables that are used during channel failure;
FIG. 6 depicts a block diagram of the network appliances communicating through a zoned switch; and
FIG. 7 depicts a state diagram for a lightweight communications module.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
DETAILED DESCRIPTION
One embodiment of the invention is a modular, high- performance, highly scalable, highly available, fault tolerant network appliance that is illustratively embodied in a data storage system that uses a redundant channel communications technique to facilitate fault tolerant communications between appliances .
FIG. 1 depicts a data processing system 50 comprising a plurality of client computers 102, 104, and 106, a host network 130, and a storage system 100. The storage system 100 comprises a plurality of network appliances 108 and 110 and a storage pool 112. The plurality of clients comprise one or more of a network attached storage (NAS) client 102, a direct attached storage (DAS) client 104 and a storage area network (SAN) client 106. The plurality of network appliances 108 and 110 comprise a storage controller module A (SCM A) 108 and storage controller module B (SCM. B) 110. The storage pool 112 is coupled to the storage controller modules 108, 110 via a fiber channel network 114. One embodiment of the storage pool 112 comprises a pair of ECCS 007
-5- storage arrays 116, 118 that are coupled to the fiber channel network 114 via a pair of fiber channel switches 124, 126 and a communications gateway 120, 122. A tape library 128 is also provided for storage backup. In storage system 100, the DAS client directly accesses the storage pool 112 via the fiber channel network 114, while the SAN client accesses the storage pool 112 via both the LAN 130 and the fiber channel network 114. For example, the SAN client 104 communicates via the LAN with the SCMs 108, 110 to request access to the storage pool 112. The SCMs inform the SAN client 104 where in the storage arrays the requested data is located or where the data from the SAN client is to be stored. The SAN client 104 then directly accesses a storage array using the location information provided by the SCMs. The NAS client 106 only communicates with the storage pool 112 via the SCMs 108, 110. Although a fiber channel network is depicted as one way of connecting the SCMs 108, 110 to the storage pool 112, the connection may be accomplished using any form of data network protocol such as SCSI, HIPPI, SSA and the like.
The storage system is a hierarchy of system components that are connected together within the framework established by the system architecture. The major active system level components are:
SCM - Storage Controller Module
SDM - Storage Device Module (Storage Pool)
Fiber channel switches, hubs, and gateways
The system architecture provides an environment in which each of the storage components that comprise the storage system embodiment of the invention operate and interact to form a cohesive storage system. ECCS 007
-6-
The architecture is centered around a pair of SCMs 108 and 110 that provide storage management functions. The SCMs are connected to a host network that allows the network community to access the services offered by the SCMs 108, 110. Each SCM 108, 110 is connected to the same set of networks. This allows one SCM to provide the services of the other SCM in the event that one of the SCMs becomes faulty. Each SCM 108, 110 has access to the entire storage pool 112. The storage pool is logically divided by assigning a particular storage device (array 116 or 118) to one of the SCMs 108, 110. A storage device 116 or 118 is only assigned to one SCM 108 or 110 at a time. Since both SCMs 108, 110 are connected to the entirety of the storage pool 112, the storage devices 116, 118 assigned to a faulted SCM can be accessed by the remaining SCM to provide its services to the network community on behalf of the faulted SCM. The SCMs communicate with one another via the host networks. Since each SCM 108, 110 is connected to the same set of physical networks as the other, --they are able to communicate with each other over these same links. These links allow the SCMs to exchange configuration information with each other and synchronize their operation.
The host network 130 is the medium through which the storage system communicates with the clients 104 and 106. The SCMs 108, 110 provide network services such as NFS and HTTP to the clients 104, 106 that reside on the host network 130. The host network 130 runs network protocols through which the various services are offered. These may include TCP/IP, UDP/IP, ARP, SNMP, NFS, CIFS, HTTP, NDMP, and the like.
From an SCM point of view, its front-end interfaces are network ports running file protocols. The front end interfaces are facilitated by execution of communication software. The software that facilitates configuration ECCS 007
-7- infor ation synchronization across multiple SCMs is referred to as the remote SCM communications module (RSCM) 152, 156 that is stored in the memory 150, 154 of each SCM 108, 110. The back-end interface of each SCM provides channel ports running raw block access protocols .
The SCMs 108, 110 accept network requests from the various clients and process them according to the command issued. The main function of the SCM is to act as a network-attached storage (NAS) device. It therefore communicates with the clients using file protocols such as NFSv2, NFSv3, SMB/CIFS, and HTTP. The SCM converts these file protocol requests into logical block requests suitable for use by a direct-attach storage device.
The storage array on the back-end is a direct-attach disk array controller with RAID and caching technologies. The storage array accepts the logical block requests issued to a logical volume set and converts it into a set of member disk requests suitable for a disk drive.
The redundant SCMs will both be connected to the same set of networks. This allows either of the SCMs to respond to the IP address of the other SCM in the event of failure of one of the SCMs. The SCMs support lOBaseT, 100BaseT, and lOOOBaseT. Optionally, the SCMs may be able to communicate with each other through a dedicated inter-SCM network 132. This optional dedicated connection is at least a 100BaseT Ethernet.
The SCMs 108, 110 connect to the storage arrays 116, 118 through parallel differential SCSI (not shown) or a fiber channel network 114. Each SCM 108, 110 may be connected through their own private SCSI connection to one of the ports on the storage array.
The storage arrays 116, 118 provide a high availability mechanism for RAID management. Each of the storage arrays provides a logical volume view of the ECCS 007
- 8- storage to a respective SCM. The SCM does not have to perform any volume management .
The illustrative storage system summarily described above is described in detail in U.S. patent application serial number filed simultaneously herewith
(Attorney docket ECCS 005) which is hereby incorporated herein by reference
FIG. 2 depicts an embodiment of the invention having the SCMs 108, 110 coupled to the storage arrays 116, 118 via SCSI connections 200. Each storage array 116, 118 comprises an array controller 202, 204 coupled to a disk array 206, 208. The array controllers 202, 204 support RAID techniques to facilitate redundant, fault tolerant storage of data. The SCMs 108, 110 are connected to both the host network 130 and to array controllers 202, 204. Note that every host network interface card (NIC) 210 connection on one SCM is duplicated on. the other. This allows a SCM to assume the IP address of the other on every network in the event of a SCM failure. One- of the NICs 212 in each SCM 108, 110 is dedicated for communications between the two SCMs .
On the target channel side of the SCM, each SCM 108, 110 is connected to an array controller 202, 204 through its own host SCSI port 214. All volumes in each of the storage arrays 202, 204 are dual-ported .through SCSI ports 216 so that access to any volume is available to both SCMs 108, 110.
The SCM 108, 110 is based on a general purpose computer (PC) such as a ProLiant 185OR manufactured by COMPAQ Computer Corporation. This product is a Pentium PC platform mounted in a 3U 19" rack-mount enclosure. The SCM comprises a plurality of network interface controls 210, 212, a central processing unit (CPU) 218, a memory unit 220, support circuits 222 and SCSI parts 214.- Communication amongst the SCM components is supported by a ECCS 007
-9-
PCI bus 224. The SCM employs, as a support circuit 222, dual hot-pluggable power supplies with separate AC power connections and contains three fans. (One fan resides in each of the two power supplies) . The SCM is, for example, based on the Pentium III architecture running at 600 MHz and beyond. The PC has 4 horizontal mount 32-bit 33 MHz PCI slots. As part of the memory (MEM) unit 220, the PC comes equipped with 128 MB of 100 MHz SDRAM standard and is upgradable to 1 GB. A Symbios 53c8xx series chipset resides on the 185OR motherboard that can be used to access the boot drive .
The SCM boots off the internal hard drive (also part of the memory unit 220) . The internal drive is, for example, a SCSI drive and provides at least 1 GB of storage. The internal boot device must be able to hold the SCSI executable image, a mountable file system with all the configuration files, HTML, documentation, and the storage administration application. This information may consume anywhere from 20 to 50 MB of disk space. One example of a disk array 116, 118 that can be used with the embodiment of the present invention is the Synchronix 2000 manufactured by ECCS, Inc. of Tinton Falls, New Jersey. The Synchronix 2000 provides disk storage, volume management and RAID capability. These functions may also be provided by the SCM through the use of custom PCI I/O cards.
Depending on the I/O card configuration, multiple Sychronix 2000 units can be employed in this storage system. In one illustrative implementation of the invention, each of the storage arrays 116, 118 uses 4 PCI slots in a 1 host/3 target configuration, 6 SCSI target channels are available allowing six Synchronix 2000 units each with thirty 50GB disk drives. As such, the 180 drives provide 9 TB of total storage. Each storage^ array 116, 118 can utilize RAID techniques through a RAID processor '226 ECCS U U7
-10- such that data redundancy and disk drive fault tolerance is achieved.
The purpose of the RSCM 152, 156 is to support communication between the two SCMs over multiple ethernet interfaces, while providing the following two benefits: 1. Provide fault-tolerance by trying multiple channels and allowing modules to avoid known failed or congested network channels. 2. Provide performance enhancements by load-balancing multiple network channels.
When the RSCM 152, 156 starts, the module must build a data structure of logical channel information from available configuration information. The interfaces and IP addresses of the local system are not sufficient for this configuration, as the corresponding interfaces and IP addresses of the remote system are also needed. Additionally, the network interface card (NIC) information includes information on network masks and broadcast addresses. From this information, the RSCM--152, 156 builds a data structure of RSCM logical channels i.e., an RSCM logical channel table. The RSCM logical channel table has a fixed size equal to the number of network interfaces the system can support, and so some channels may not be configured. FIG. 3 depicts a state diagram that depicts the states of each channel extending from the INIT state. Each state transition 301-311 are identified in FIG. 3. A description of the state transitions are described in Table I. ECCS 007
- 11-
TABLE I
Transition Description
301, This transition occurs on initialization of the RSCM for each configured channel .
302 This transition occurs on initialization of the RSCM for each non-configured channel .
303 This loop-transition occurs whenever a channel error or- timeout occurs in the initializing state 312. Until the system has had time to initialize, the channel state is not known and no channel will be failed due to timeouts.
304, This transition to the good state 313 occurs when any successful connection is made on the channel. This includes connections made by the status monitoring module.
305 This transition to the failed state 314 occurs when a certain number of timeouts or communication errors have occurred on any one channel .
306, This transition occurs when a successful connection is made by the status monitoring module on the channel. While in the failed state 314, the channel will not be used by other modules .
307 This transition occurs when the system transitions to degraded mode after the system controller determines that the remote has failed. After this transition successful communication has failed.
308 This transition occurs when the system transitions to Degraded mode after the HASC ECCS υυv
-12-
Figure imgf000014_0001
The primary functional interface for the RSCM is described in TABLE II.
-s us uu/
-13- TABLE II
Figure imgf000015_0001
ECCS 007
-14-
Figure imgf000016_0001
The principle logic that allows the system to load balance and provide fault tolerant communication is the 5 capability of rscm_ClientOpenChannel ( ) to try all RSCM Logical Channels in the Initializing or Good state. The RSCM keeps a pointer into the RSCM Logical Channel Table that allows rscm_ClientOpenChannel ( ) to implement a round- robin algorithm over all RSCM Logical channels. If all 10 Initializing or Good channels fail to make a connection, any configured failed channels are tried next. If this also fails, then rscm_ClientOpenChannel ( ) returns -1 to indicate an error.
FIG. 4 depicts the progression of channel table 15 entries as communications is established between the SCMs. The channel table 400 has eight entries 0-7, and there are four configured channels 0-3. After initialization, the four configured channels 0-3 are initializing, but have not yet successfully connected. The channel to try next is 20 channel 0. For the first call to rscm_ClientOpenChannel ( ) , the system tries channel 0 (as shown by pointer 406) and successfully connects. Channel 0 is now "Good" i.e., in the good state. Table 402 represents the current state of the system. If all channels connect and work, they will ECCS 007 ""
-15 - all be Good after the first four connections are made. These transitions can occur without any specific task checking the goodness before use. The table 404 represents the state of the system with all four configured channel operating.
In another scenario as depicted in FIGs. 5 and 6, the channel table 500 has eight entries 0-7 and there are three configured channels 0-2. All of the channels 0-2 are in the Good state, but a network switch 600 has just failed. The failure affects all the channels except (channel 2) one because the switch is zoned to create two separate LANs (zone 1 602 and zone 2 604) .
Three separate tasks are attempting to make a connection using the failed switch: (a) a shared file manager (task 1) , (b) a status monitor fault analyzer (task 2), and (c) a persistent shared object manager (task 3) using the RSCM when the failure first affects the system. These tasks and others that use the RSCM are described in U.S. patent application serial number , filed simultaneously herewith (Attorney Docket ECCS 005) , which is incorporated herein by reference. For simplicity, this example assumes there are.no connections when the failure first affects the system and the channel pointer 406 points to logical channel 0 at the failure. The three connecting tasks (tasks 1, 2, 3) may all try to connect at the same time, but each will try to connect through a different channel 0, 1, 2.
Each time channel 0 or 1 is tried, error counters identify the channel as non-operative. Meanwhile, task 2 registers a loss of status information and invokes a failure analysis as described in U.S. patent application serial number filed simultaneously herewith
(Attorney docket ECCS 006) which is hereby incorporated herein by reference. After some time, channel 0 and 1 will transition to a failed state. .ECCS U U7
-16-
In an alternative embodiment of the invention, a "light weight" version of the RSCM may be used in conjunction with the RSCM to provide procedure calls to a remotely located SCM. This adjunct module is referred to herein as a remote SCM Light Weight Procedure Call
(RSCMLWPC) module. The purpose of the RSCMLWPC module is to provide a transaction oriented module that uses the RSCM, so that one failed transaction initiated on the local system may be successfully retried. Furthermore, each connection may be handled in a separate light weight process, so that one remote transaction may block waiting for another to complete, and each may use normal, single- system synchronization calls.
The unit of transaction is the remote procedure call. The remote procedure call mechanism is not as elaborate as that used for single channel, remote procedure call techniques of the prior art. The remote system calls rscmlwpc_svcrun() to start the service for a particular port (similar to rscm_ServerOpenChannel ( ) ) . This starts a task to accept concurrent connections to the procedure call service. The service will stop if rsomlwpc_svcsto () is called.
The local system opens a socket connection to the server using the rscamiwpc_crβate() call. Thereafter, a remote function call can be made on that socket by calling rsomlwpc_call() . The task running, the service calls the appropriate function using appropriate arguments once it has received all the data. Even a zero byte reply via rscanlwpc_reply() results in acknowledgement data being returned. An error occurs if the call is made with no reply. rscaniwpσ_caii() blocks until the reply is returned. r£5omiwpc_svorun() also has some stringent requirements on implementation. Each program call executed must be executed in a separate thread of execution. , This is due to the way that the procedure call library will be used: J -C S U U /
-17- rscπιiwpc_svcsto () must cause the server (the SCM that is initiating the channel) to stop executing. When the server is stopped, the following should be true: (1) all current light weight processes running the server program have completed and exited, (2) all sockets including the listening file descriptor have been closed/shut. On both sides of the connection the HEADER described below is the minimum data that should cause select to mark the socket readable . The following protocol messages shall be used for communication:
1. START CLIENT := Open a socket connecting to the listening file descriptor on the server. Client will call rscsmlwpc_crβate() . 2. CLIENT CALL : = HEADER DATA
1. HEADER := MAGIC_NUMBER LENGTH ERROR
CALL_ID TOKEN_ID FUNCTION 2. MAGIC_NUMBER := UINT16 (0x7ela)
3. LENGTH := UINT16 # Length of data in the data portion
4. CALL_ID := UINT32 # A client controlled sequence number that increases with each client call
5. TOKEN_ID := UINT32 # A unique identifier for each client
6. FUNCTION := SINT32
7. DATA := BYTES (declared length) # Variable length data of the length declared
3. SERVER REPLY : = HEADER DATA 4. END CLIENT := Close your end of the socket. Use rscanlwpc_ estroy() Fault tolerance is assured as follows. The rsomi-pc_create{) routine assigns a unique token to each new client on the local system. Each client has an entry in a client reply cache on the remote system identified by the unique client token. Based on a sequence number incremented for each call on the client system, the remote server system can determine whether a call is a duplicate of a previous call, a duplicate of a call currently in service, or a new call that has not been seen before. FIG. 7 depicts the states 700 that each line in the server systems' client reply cache goes through to facilitate communication. Table III ECCS 007
-18- describes the events that cause the state transitions to the states of FIG. 7.
TABLE III
Figure imgf000020_0001
When a new call is received. on the server, the manner in which the call is satisfied depends on the system state shown in TABLE IV, i.e., the table shows various network states at which failure may occur and the procedure call that is being performed at that network state. ECCS 007 -19-
TABLE IV
State in
Client Reply Remote Procedure Call Processing Description
Cache
(Satisfy by calling the service function and allow
IStarting the service function to reply. i The service function has been called, but has not
L yet returned. Change the socket for the reply and lServing indicate to the thread serving the request that a duplicate call has been received.
The service function has been called, and a reply iServed has been sent. Respond with that same reply, but ™ not call the client function.
As such, the RSCMLWPC module provides a process for initiating communications between SCMs and supporting the communications using procedure calls on multiple communications channels.
Although various embodiments which incorporate the teachings of the present invention have been shown and
10 described in detail herein, 'those skilled in the art can readily devise many other varied embodiments that still incorporate these teachings .

Claims

ECCS 007-20-What is claimed is:
1. Apparatus for providing fault tolerant communications between multiple network appliances: a first network appliance comprising a first communications module coupled to a plurality of network interface ports; a second network appliance comprising a second communications module coupled to a plurality of network interface ports; a multiple channel communications network coupled to said network interface ports of the first and second network appliances; where said first and second communications modules provide redundant communications paths between the first and second network appliances.
2. The apparatus of claim 1 wherein said first and second communications modules comprise: means for initializing each network interface port; means for transitioning to a good state upon said initialization being completed; and means for transitioning to a failed state upon said initialization not being completed.
3. The apparatus of claim 1 further comprising means for detecting a channel failure and means for transitioning said communications traffic to an operative network channel .'
4. The apparatus of claim 1 further comprising means for providing load balancing of communications traffic across a plurality of network channels .
5. The apparatus of claim 1 further comprising a means for providing fault tolerant procedure calls. ECCS 007
-21-
6. The apparatus of claim 1 wherein said first and second communications modules are middleware software.
7. In a system having a plurality of network appliances, a method of communicating between network appliances comprising: initializing a plurality of network interface ports to form a plurality of communications channels between said network appliances ; transitioning to a good state upon said initialization being completed for a network interface port; and transitioning to a failed state upon said initialization not being completed for a network interface port.
8. The method of claim 7 further comprising detecting a channel failure in said plurality of channels and transitioning said communications traffic to an operative network channel .
9. The method of claim 7 further comprising providing load balancing of communications traffic across a plurality of network channels .
10. The method of claim 7 further comprising transmitting procedure calls through said plurality of channels.
PCT/US2001/012864 2000-04-20 2001-04-20 Method and apparatus for providing fault tolerant communications between network appliances WO2001082079A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001257132A AU2001257132A1 (en) 2000-04-20 2001-04-20 Method and apparatus for providing fault tolerant communications between network appliances

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US55278000A 2000-04-20 2000-04-20
US09/552,780 2000-04-20

Publications (3)

Publication Number Publication Date
WO2001082079A2 true WO2001082079A2 (en) 2001-11-01
WO2001082079A9 WO2001082079A9 (en) 2002-04-11
WO2001082079A3 WO2001082079A3 (en) 2002-07-18

Family

ID=24206775

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/012864 WO2001082079A2 (en) 2000-04-20 2001-04-20 Method and apparatus for providing fault tolerant communications between network appliances

Country Status (2)

Country Link
AU (1) AU2001257132A1 (en)
WO (1) WO2001082079A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1525682A1 (en) * 2002-06-28 2005-04-27 Harris Corporation System and method for supporting automatic protection switching between multiple node pairs using common agent architecture
EP1525689A1 (en) * 2002-06-28 2005-04-27 Harris Corporation Software fault tolerance between nodes
WO2010073414A1 (en) * 2008-12-26 2010-07-01 Hitachi, Ltd. Storage system for optimally controlling a plurality of data transfer paths and method therefor

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8670303B2 (en) * 2011-10-05 2014-03-11 Rockwell Automation Technologies, Inc. Multiple-fault-tolerant ethernet network for industrial control

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4692918A (en) * 1984-12-17 1987-09-08 At&T Bell Laboratories Reliable local data network arrangement
US5918021A (en) * 1996-06-03 1999-06-29 Intel Corporation System and method for dynamic distribution of data packets through multiple channels
US5931916A (en) * 1994-12-09 1999-08-03 British Telecommunications Public Limited Company Method for retransmitting data packet to a destination host by selecting a next network address of the destination host cyclically from an address list
EP0942554A2 (en) * 1998-01-27 1999-09-15 Moore Products Co. Network communications system manager

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4692918A (en) * 1984-12-17 1987-09-08 At&T Bell Laboratories Reliable local data network arrangement
US5931916A (en) * 1994-12-09 1999-08-03 British Telecommunications Public Limited Company Method for retransmitting data packet to a destination host by selecting a next network address of the destination host cyclically from an address list
US5918021A (en) * 1996-06-03 1999-06-29 Intel Corporation System and method for dynamic distribution of data packets through multiple channels
EP0942554A2 (en) * 1998-01-27 1999-09-15 Moore Products Co. Network communications system manager

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1525682A1 (en) * 2002-06-28 2005-04-27 Harris Corporation System and method for supporting automatic protection switching between multiple node pairs using common agent architecture
EP1525689A1 (en) * 2002-06-28 2005-04-27 Harris Corporation Software fault tolerance between nodes
EP1525682A4 (en) * 2002-06-28 2006-04-12 Harris Corp System and method for supporting automatic protection switching between multiple node pairs using common agent architecture
EP1525689A4 (en) * 2002-06-28 2009-03-11 Harris Corp Software fault tolerance between nodes
WO2010073414A1 (en) * 2008-12-26 2010-07-01 Hitachi, Ltd. Storage system for optimally controlling a plurality of data transfer paths and method therefor
US8122151B2 (en) 2008-12-26 2012-02-21 Hitachi, Ltd. Storage system for optimally controlling a plurality of data transfer paths and method therefor

Also Published As

Publication number Publication date
WO2001082079A3 (en) 2002-07-18
AU2001257132A1 (en) 2001-11-07
WO2001082079A9 (en) 2002-04-11

Similar Documents

Publication Publication Date Title
US6701449B1 (en) Method and apparatus for monitoring and analyzing network appliance status information
US8443232B1 (en) Automatic clusterwide fail-back
US7380163B2 (en) Apparatus and method for deterministically performing active-active failover of redundant servers in response to a heartbeat link failure
US7565566B2 (en) Network storage appliance with an integrated switch
US8266472B2 (en) Method and system to provide high availability of shared data
US7028218B2 (en) Redundant multi-processor and logical processor configuration for a file server
US7401254B2 (en) Apparatus and method for a server deterministically killing a redundant server integrated within the same network storage appliance chassis
US20030158933A1 (en) Failover clustering based on input/output processors
US6718481B1 (en) Multiple hierarichal/peer domain file server with domain based, cross domain cooperative fault handling mechanisms
US6865157B1 (en) Fault tolerant shared system resource with communications passthrough providing high availability communications
EP1498816B1 (en) System and method for reliable peer communication in a clustered storage system
US20050125557A1 (en) Transaction transfer during a failover of a cluster controller
US8370494B1 (en) System and method for customized I/O fencing for preventing data corruption in computer system clusters
US7627780B2 (en) Apparatus and method for deterministically performing active-active failover of redundant servers in a network storage appliance
US7437423B1 (en) System and method for monitoring cluster partner boot status over a cluster interconnect
US20050108593A1 (en) Cluster failover from physical node to virtual node
US20030018927A1 (en) High-availability cluster virtual server system
US20140173330A1 (en) Split Brain Detection and Recovery System
US20050028028A1 (en) Method for establishing a redundant array controller module in a storage array network
JPH1185644A (en) System switching control method for redundancy system
WO2001082079A2 (en) Method and apparatus for providing fault tolerant communications between network appliances
JP2002344450A (en) High availability processing method, and executing system and processing program thereof
US11372553B1 (en) System and method to increase data center availability using rack-to-rack storage link cable
WO2001082080A2 (en) Network appliance
George et al. Analysis of a Highly Available Cluster Architecture

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: C2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1/5-5/5, DRAWINGS, REPLACED BY NEW PAGES 1/5-5/5; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP