WO2001079962A2 - Fault-tolerant maintenance bus, protocol, and method for using the same - Google Patents

Fault-tolerant maintenance bus, protocol, and method for using the same Download PDF

Info

Publication number
WO2001079962A2
WO2001079962A2 PCT/US2001/011804 US0111804W WO0179962A2 WO 2001079962 A2 WO2001079962 A2 WO 2001079962A2 US 0111804 W US0111804 W US 0111804W WO 0179962 A2 WO0179962 A2 WO 0179962A2
Authority
WO
WIPO (PCT)
Prior art keywords
bridge
command
maintenance bus
string
bus
Prior art date
Application number
PCT/US2001/011804
Other languages
French (fr)
Other versions
WO2001079962A3 (en
Inventor
A. Charles Suffin
Joseph S. Amato
Paul Joyce
Original Assignee
Stratus Technologies International, S.A.R.L.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/548,202 external-priority patent/US6691257B1/en
Priority claimed from US09/548,536 external-priority patent/US6633996B1/en
Application filed by Stratus Technologies International, S.A.R.L. filed Critical Stratus Technologies International, S.A.R.L.
Priority to AU2001251536A priority Critical patent/AU2001251536A1/en
Publication of WO2001079962A2 publication Critical patent/WO2001079962A2/en
Publication of WO2001079962A3 publication Critical patent/WO2001079962A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2002Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant
    • G06F11/2007Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant using redundant communication media

Definitions

  • This invention relates to fault-tolerant computer systems and more particularly to a dedicated maintenance bus for use with such computer systems.
  • Fault-tolerant computer systems are employed in situations and environments that demand high reliability and minimal downtime. Such computer systems may be employed in the tracking of financial markets, the control and routing of telecommunications and in other mission-critical functions such as air traffic control.
  • a common technique for incorporating fault-tolerance into a computer system is to provide a degree of redundancy to various components. In other words, important components are often paired with one or more backup components of the same type. As such, two or more components may operate in a so-called lockstep mode in which each component performs the same task at the same time, while only one is typically called upon for delivery of information. Where data collisions, race conditions and other complications may limit the use of lockstep architecture, redundant components may be employed in afailover mode.
  • one component is selected as a primary component that operates under normal circumstances. If a failure in the primary component is detected, then the primary component is bypassed and the secondary (or tertiary) redundant component is brought on line.
  • a variety of initialization and switchover tecliniques are employed to make a transition from one component to another during runtime of the computer system. A primary goal of these techniques is to minimize downtime and corresponding loss of function and/or data.
  • Fault-tolerant computer systems are often costly to implement since many commercially available components are not specifically designed for use in redundant systems. It is desirable to adapt conventional components and their built-in architecture whenever possible. All modern computer systems have particular capabilities directed to control and monitoring of functions. For example, large microprocessor chips such as the Pentium IIITM, available from Intel Corporation of Santa Clara, California, are designed to operate within a specific temperature range that is monitored by a commercially available enviromnental/t ⁇ mperature-sensing chip.
  • One technique for interconnecting such an environmental monitor or other monitoring and control devices is to utilize a dedicated maintenance bus.
  • the maintenance bus is typically separate from the system's main data and control bus structure.
  • the maintenance bus generally connects to a single, centralized point of control, often implemented as a peripheral component interconnect (PCI) device.
  • PCI peripheral component interconnect
  • a maintenance bus architecture that displays a high-degree of fault-tolerance and a protocol for use on this bus.
  • the maintenance bus architecture and associated protocol should be interoperable with commercially available components and should allow a fairly high degree of versatility in terms of monitoring and control of important computer system components.
  • the architecture includes two maintenance buses interconnecting each of a plurality of printed circuit boards, termed "parent" circuit boards.
  • the two maintenance buses are each connected to a pair of system management modules (SMMs) that are configured to perform a variety of maintenance bus activities.
  • SMM can comprise any acceptable device for driving commands according to the protocol on the maintenance bus arrangement.
  • the SMM has general knowledge of the circuit boards and their components.
  • the protocol is formatted to operated in accordance with Philips Semiconductors' I 2 C maintenance bus standard. Other standards are expressly contemplated.
  • Within each parent board are a pair of redundant bridges both having a unique address.
  • One bridge is connected to the first maintenance bus while a second bridge is connected to the second maintenance bus of the pair.
  • a child maintenance bus interconnects the two bridges through a "child" printed circuit board.
  • the introduction of a separate board to implement the child maintenance bus can be useful, but is not essential according to this invention.
  • the child maintenance bus is itself interconnected with a variety of monitor and control functions on maintenance bus-compatible subsystem components.
  • the SMMs can address components on each child printed circuit board individually and receive appropriate responses therefrom based upon appropriate response identifiers within the protocol. In the event of a bus or bridge failure, the SMM can still communicate with the child subsystem components via the redundant bus and bridge.
  • the protocol includes a unique data packet structure.
  • the command message initiated by the SMM includes a target bridge header, a command byte (wherein a non-zero byte code designates the message and a command rather than a response), the message size and a unique originator tag value.
  • the command message further includes one or more bytes of forwarding data for subordinate bridges on the child bus (leading to and from remote components/circuitry).
  • the command message has a response byte code to direct responses on the return trip through the bridge.
  • the command message also includes one or more bytes of data to identify, and be used by, subsystem components.
  • the command message includes one checksum byte meant to sum up all previous message bytes.
  • a similar message packet is provided by the bridge in response to the command message.
  • the response includes an SMM address byte and a zero-value command byte (indicating a response). Also provided is a byte indicating the overall message size in bytes and the identical tag originally provided in the command packet. The tags allow the SMM to verify that the response is to a particular transmitted command.
  • a one-byte status code field and one-byte error message field re also provided. Unique status codes and error messages are generated by the bridge if a formatted message is incorrect or commanded action was not (or may not have been) taken by the subsystem.
  • One or more bytes of response data delivered from the subsystem component or bridge is also provided in the response message. Finally, a checksum byte is provided for error checking.
  • Command message/data packets are transmitted by the SMM to be received by an appropriate component within a given time frame. If an expected response message/data packet is not returned from the component as expected, the SMM "times-out" and performs various error procedures that may include an alarm condition, system shut-down and/or restransmission of the packet.
  • the bridge can include an interconnection to a further bridge.
  • This remote bridge can, itself, be interconnected to additional microprocessors and associated memory.
  • the remote bridge is addressed through one of the parent board's bridges so the communication to and from the SMM can occur.
  • the forwarding data of the command packet enables the packets to be transferred through these further bridges, while stored response data in each subordinate bridge is used to route the return of a response back to the originating SMM.
  • the SMM can be interconnected with a variety of other computer system peripherals and components, and can be accessed over a local network or through an Internet-based communication network.
  • Fig. 1 is a block diagram showing an overview of a fault-tolerant maintenance bus architecture utilizing the maintenance bus protocol according to this invention
  • Fig. 2 is a more detailed block diagram showing one parent and child printed circuit board implementing a fault- tolerant maintenance bus according to this invention
  • Fig. 3 is the board of Fig. 2 including a bridge for accessing a remote microprocessor board according to an alternate embodiment
  • Fig. 4 is a block diagram of the protocol's command message/data packet according to an embodiment of this invention
  • Fig. 5 is a block diagram of the protocol's response message/data packet according to an embodiment of this invention.
  • Fig. 1 details a fault-tolerant maintenance bus architecture adapted to use a protocol according to a preferred embodiment of this invention. Before discussing the protocol in detail, the underlying maintenance bus architecture will be explained.
  • a pair of parent maintenance buses MBA and MBB are shown. These maintenance buses are identical in architecture and can be implemented as a combination of cables, circuitry and circuit board traces.
  • the buses MBA and MBB interconnect with a plurality of input/output (I/O) slots and pin locations within a cabinet that may contain a plurality of circuit boards.
  • the parent maintenance buses MBA and MBB can also jump between cabinets in a larger computer system. It is generally contemplated that the buses are implemented in a multi-cabinet fault- tolerant server system, but the architecture according to this invention can be utilized in a variety of fault-tolerant computing configurations.
  • the buses MBA and MBB are each two-wire busses designed to take advantage of integrated circuit components utilizing the I 2 C bus standard.
  • the I 2 C bus is a proprietary design of Philips Semiconductors of the Netherlands. This standard has become widely adopted for consumer electronics and various circuit applications, and is now supported by a large number of commercially available monitoring and control devices. Details on the use of the I 2 C bus can be found in the fC-b s and how to use it (including specifications), April 1995 update, Chapter 3 by Philips Semiconductors. Typically, the bus is clocked at a speed of approximately lOKbytes/Sec. While I 2 C is employed as the bus standard according to a preferred embodiment of this invention, it is expressly contemplated that other maintenance bus standards can be utilized according to the teachings of this invention with appropriate modifications.
  • the parent buses MBA and MBB are amplified to generate a signal at 5N x 30mA.
  • a variety of bus amplification circuits can be used.
  • commercially available hardware bus extenders are employed.
  • the amplified bus operates at a gain that is ten times the normal operating range for an I 2 C bus (5V x 3mA). As described further below, this difference in operating level between the parent bus and various circuit components is compensated-for (on both sides) using the extender hardware.
  • circuit board assemblies 102, 104 and 106 are shown. Each of these circuit board assemblies is interconnected with the bus pair (MBA and MBB). These board assemblies can represent a variety of computer system components. For example, the boards can together comprise a set of redundant identical boards or a set of separate functions including a central processing unit (CPU) board, "front panel” board and input/output (I/O) board. Each board assembly 102, 104 and 106 is defined functionally as a parent printed circuit board 112, 114 and 116 and an associated child printed circuit board 122, 124 and 126. As discussed above, while a separate board to implement the child maintenance bus can be useful according to an embodiment of this invention, it is not required.
  • the division between the parent board and child board is somewhat arbitrary, and the actual physical structure for one or more boards can be implemented as a single plug-in printed circuit card residing in a connector socket or slot on a larger cabinet-based motherboard.
  • SMMs redundant system management modules
  • Each SMM is a microprocessor-based component.
  • the SMMs 128 and 130 each reside on a PCI bus 132 and 134.
  • the SMM performs a variety of functions and includes both Ethernet and modem capabilities allowing it to interconnect with the computer operating system and other network communication structures (block 136).
  • the SMM may also include other unrelated system components such as a video driver chip.
  • the SMM is particularly based around a Motorola Power PCTM 860T microprocessor utilizing the Nx Works real-time operating system available from Wind River Systems, Inc. of Alameda, California.
  • system management module (or SMM) is defined broadly to include any acceptable device for driving commands on the maintenance bus arrangement. While a microcontroller described above is used in a preferred embodiment, the SMM can be an application specific integrated circuit (ASIC), a programmable logic array, a microprocessor unit or any other command originator interconnected with the maintenance bus arrangement.
  • ASIC application specific integrated circuit
  • command module is also used to define the SMM in its various possible embodiments.
  • Each SMM includes a pair of I 2 C buses 138 and 140. Each pair is connected with a respective bus from the parent bus pair MBA and MBB.
  • the SMMs are configured to operate in failover mode. In other words, SMMA operates under normal circumstances. In this mode SMMB monitors and communicates with SMMA over the shared I C bus, ready to take over for SMMA if it fails. If a failure is detected, then SMMB takes over operation but is otherwise largely idle, during normal run time.
  • the function of the SMMs is described in further detail below.
  • the SMMs carry information about components on each of the board assemblies 102, 104 and 106. The SMMs use this information to monitor and generally control the board assemblies. This information may be transferred to other parts of the computer system and over a network via the PCI bus.
  • Each parent board 112, 114 and 116 includes various data processing, display and communication capabilities in accordance with its purpose.
  • Each board 112, 114 and 116 also includes a respective CPU (CPU1, CPU2 and CPU3) 152, 154 and 156, respectively.
  • Each CPU can comprise an Intel Pentium IIITM, XeonTM or any other acceptable microprocessor having I 2 C or equivalent maintenance bus architecture.
  • Each board 112, 114 and 116 is interconnected with the parent buses MBA and MBB at various interconnection points, where appropriate. Since the parent bus is amplified by a gain by approximately ten times the normal I 2 C operating level, interconnections with the parent buses can be made via bus extenders to be described further below.
  • Each parent board 112, 114 and 116 also includes a pair of interconnections 160 and 162 with each of the respective parent maintenance buses MBA and MBB.
  • the interconnections 160 and 162 link to respective bridges 192 and 194.
  • These bridges interconnect with respective child maintenance buses CB1, CB2 and CB3 to interconnect child board components.
  • On each child board 122, 124 and 126 reside various control and monitoring subsystem components 172, 174 and 176, respectively.
  • the subsystem components are described in further detail below. These components are each in communication with the maintenance bus using the preferred I 2 C standard.
  • the interconnection between each parent board and child board occurs via a pair of bridges 192 and 194.
  • Each bridge is essentially identical in architecture, each pair of bridges 192 and 194 having the same different address for communication with the SMMs.
  • the address of the bridge pair on each board differs so that the SMM can uniquely address a specific board. Addresses are established based upon the pin and socket arrangement for the respective bridge. It is useful to assign the same address to both bridges 192, 194 in the pair since they each reside on a different bus (MBA or MBB).
  • MBB bus
  • the SMM utilizes only one of the two bridges on a parent board to accomplish a task.
  • the other, unused bridge in the pair can be used if the SMM cannot complete the transaction with the original bridge.
  • the SMM uses the other parent maintenance bus to access the other, previously unused bridge.
  • bus extender hardware is in employed.
  • the bus extender hardware is available from Philip Semiconductors under part number 82B715. Using amplified parent buses, approximately thirty or more loads can be carried.
  • the extender acts as a buffer for signals traversing the extender hardware providing the necessary amplification and deamplification.
  • Extender components 196 are provided between the parent buses MBA and MBB and corresponding bridge interconnections 160 and 162. While not shown, interconnections 160 and 162 can also include appropriate series resistors and FET triggers in line with the extender components 196 in accordance with the 82B715 hardware manufacturer's data sheet.
  • the bridges 192 and 194 each act as store-forward devices in the transfer of I 2 C signals into and out of the child board subsystem. In other words, the bridges receive packetized signals from the SMMs and transfer them to appropriate I C-compatable maintenance bus ports on subsystem components. Likewise, the bridges receive signals from subsystem components and transfers them back to the SMMs.
  • two bridges 192 and 194 are employed, each communicating with one of the dual parent buses MBA and MBB. In this manner, the failure of a single bridge or parent bus does not cause a loss of connection between the subsystem components and SMMs. This is because each child bus CBl, CB2 and CB3 is interconnected with both bridges simultaneously.
  • a reset connection (RI, R2 and R3) and power connection (PI, P2 and P3) extend from each bridge in the pair.
  • a reset and/or power command from an SMM to the active bridge in the pair is used to power-up or reset the underlying board assembly.
  • the SMMs are configured to provide independent reset and power commands to the bridges 192 and 194 to allow powering and reset of each underlying board through the maintenance bus arrangement.
  • the active bridge performs power-up.
  • the bridges are configured to handshake, or otherwise communicate, to ensure that the board hardware is functioning properly before power-up occurs generally within the board.
  • each bridge 192, 194 comprises a commercially available Intel 87C54 microcontroller.
  • This circuit package includes a built-in programmable storage device (an erasable programmable read-only memory EPROM) and 256 bytes of random access memory (RAM). This package is relatively low-cost and complete. Data traveling over the I 2 C bus is buffered in the RAM while basic routing and power control functions are preprogrammed into the bridge microcontroller EPROM. Though the 87C54 is the preferred embodiment, any microcontroller with sufficient I/O ports to drive both parent and child maintenance buses could instantiate the bridges 192 and 194.
  • the exemplary parent board assembly 102 is shown in further detail. Particularly, the subsystem components 172 interconnected with the I 2 C bus are illustrated.
  • EEPROM electrically-erasable programmable read only memory
  • ID board identification
  • a light-emitting diode (LED) monitor 204 is provided. This LED provides a visible indication of the status of the board for an operator of the board.
  • an environmental monitor chip 206 having I 2 C compatibility is provided. This chip typically monitors temperature and other important functions and transmits appropriate data and/or alarms regarding environment.
  • Microprocessor information for CPU1 112 is also interconnected with the bus CBl via an I 2 C interconnection.
  • the CPU support information 208 is transmitted over the I 2 C bus, as well as other important status data.
  • I C interconnections with the dual inline memory module sockets (DIMMS) 210 of the board assembly are also provided by the child bus CB 1.
  • other I/O ports 212 with I 2 C capabilities may be serviced by the child bus CBl.
  • each bridge and subsystem component contains its own unique address on the maintenance bus that makes it identifiable by the SMMs.
  • the SMMs have knowledge of the subsystem components on each board. Packets sent to and from the SMM have the bridge identification and the data within the packet is used to identify the particular subsystem device.
  • a variety of protocols and communication tecliniques can be used according to this invention.
  • I 2 C connections have operated using a highly simplified communication scheme without the benefit of addressing and protocol techniques.
  • a command message/data packet structure is shown schematically.
  • a command message/data packet is shown. While the illustrated message 400 is a command message initiated by the SMM, the packet structure to be described is generally a two-way message structure in which command versus response messages are differentiated by a unique command designator/identifier byte within the message header.
  • the command data packet 400 which is transferred between the SMM and the various subsystem components, includes an address header 402.
  • This header is typically limited to one byte of information.
  • the architecture is arranged so that one byte (six address bits, one parity bit and one read/write bit) is sufficient to direct the packet to an appropriate bridge and corresponding subcomponent.
  • the address specifies the target bridge through which data is transferred. As described further below, the final delivery of a command to a specific subcomponent is facilitated by the data bytes of the packet.
  • a command byte 404 is provided. Specifically, if the transferred packet is a command packet from the SMM, the command byte is enabled with a specific recognized command byte code.
  • the command bytes direct a particular subsystem component to perform a particular action, cause a bridge to perform a power or reset function, or cause a subordinate bridge to perform a forwarding operation (described further below).
  • the overall message size is indicated as a number of bytes by the one-byte string.
  • a tag byte 408 is provided to the header packet.
  • This tag byte is generated by the originator (the SMM) of the data packet. It is a unique one-byte number.
  • the tag byte is repeated in a response message, as described below.
  • the appropriate response sent by the subsystem component should include the same tag byte, indicating that the message was received properly and acted upon. If the tag byte is not received, then an error has occurred an appropriate action is taken by the SMM.
  • forwarding data 410 enables a hierarchy of bridge structures to be established within the child bus.
  • Fig. 3 which again illustrates the exemplary board assembly 102.
  • the subsystem 172 of this board includes the set subsystem components described above with reference to Fig. 2.
  • another bridge 302 is interconnected to the child bus CBl.
  • This bridge is similar in configuration to the bridges 192 and 194 and can be constructed from the same type of microcontroller circuit.
  • the bridge 302 includes another discrete address that is recognized by the SMM so that data is transferred via the bridges 192, 194 to the subordinate bridge 302 as if it were any travelling to any other subsystem component.
  • the packet structure according to this embodiment enables a large number of remote components to be accessed notwithstanding the relatively small (one-byte) command address 420.
  • thesubordinate bridge 302 stores and forwards the message to the I 2 C-compatable ports on further computer circuitry 304.
  • the CPU information block 208 is connected through the subordinate bridge 302 according to Fig. 3.
  • the processor information is located behind the child bridge, accounting for the depicted arrangement.
  • the circuitry 304 includes another microprocessor (such as an Intel XeonTM) and/or associated memory and other peripherals.
  • the above-described protocol enables messages to be transferred from the child bus through bridges to additional, subordinate bridges (such as bridge 302). Further components, such as circuitry 304, can be accessed through these subordinate bridges.
  • the subordinate bridge acts as a hierarchy to access the remote microprocessor circuitry 304 given the limited addressing available for the protocol. In this sense, the circuitry 304 is invisible to this child bus CBl.
  • a series of subordinate bridges can be chained in series. Each subordinate bridge uses response data in the command message to route a received response back up the hierarchy of bridges to the originating SMM.
  • Each bridge in the bridge hierarchy stores knowledge of the message transferred therefrom.
  • the response field 412 is stripped and stored by each bridge along the pathway, and the bridge's own field is substituted. This enables responses to command messages to return to their source (SMM) after passing back from the subordinate location through the bridge system.
  • the structure includes zero or more bytes of data 414, following the response byte field 412.
  • data is provided as part of a particular transaction under the I 2 C protocol. This data is subcomponent-specific.
  • the data provides the identification of a particular component on the child bridge, or in a remote, subordinate bride hierarchy (such as the circuitry 304).
  • the command packet 400 includes a checksum comprising a one-byte number.
  • the checksum indicates the sum of previous bytes in the message. If either the bridge, or an SMM discovers an incorrect checksum byte in a message, then the message is discarded.
  • An erroneous message can trigger a variety of actions, including resending of the message, an alarm condition or shut down as appropriate.
  • at least two error modes for the SMM exist: (1) where it receives no response following the transmission of a message, resulting in "time- out" state; and (2) an error state in which a message is returned that is properly formatted, but is not understood by the bridge or the SMM as applicable.
  • One possible corrective action is to retry sending of the messages for a predetermined number of times.
  • the response message/data packet 500 is shown in Fig. 5. It includes a one-byte address 502 designating the SMM. Where the command packet contains a command byte (404), the response packet contains a reserve value 504, generally equal to zero. This, in fact, indicates that a response is being transferred over the maintenance bus.
  • the response packet 500 also includes an overall message size 506 similar to the command message size 406.
  • the tag 508 of the packet 500 is the original tag (408) in the command message, now being returned by subsystem components.
  • a one-byte status code 410 is an indicator that a problem may exist, but the system does not have confirmation. For example, power may have been turned on, but no response is given. Different potential problems may be indicated by different status codes. In general, the status code is appended to the response by the active bridge. If the command is, however, misunderstood or improperly formatted, then an appropriate error code 512 is generated. The absence of an error is generally indicated by a zero value. Conversely, the specific errors are indicated by particular byte codes. Exemplary error codes are listed in the following table:
  • This signal is used by the reporting bridge to inform its peer subsystem bridge of the action is should take.
  • Several commands require the cooperation of both bridges in an I 2 C subsystem in order to carry out an action, and source signals are utilized in conjunction with a cross-interrupt mechanism to synchronize the bridges. Source signals are routed between two subsystem bridges only.
  • the stuck high error should be impossible given the bridge's microcontroller hardware, but the firmware tests the signal for this condition as a sanity check. This error would indicate a fault in the
  • the response packet 500 next includes zero or more bytes of response data 514.
  • This data can comprise environmental telemetry or variety of other required response data from the selected subsystem components.
  • a checksum byte 516 is provided. This checksum, again indicates the sum of the previous bytes in the response message. If the SMM or a bridge discovers an incorrect checksum, then the response is discarded and the command-response cycle is generally retried (or other appropriate action is taken).

Abstract

A fault-tolerant maintenance bus architecture and protocol for use therewith provides dual maintenance buses interconnecting each of a plurality of parent circuit boards. The two maintenance buses are each connected to a pair of system management modules (SMMs) that are configured to perform a variety of maintenance bus activities. Within each parent board are a pair of redundant bridges each having a unique address. One bridge is connected to the first maintenance bus while a second bridge is connected to the second maintenance bus of the pair. A child maintenance bus interconnects the two bridges on child circuit board. The child maintenance bus is interconnected to a bridge assembly that itself directs messages formatted in the protocol between the subsystem components and the command module through the bridge. The protocol includes a command message structure that uniquely addresses the bridge assembly. It also includes a command string, a command data string for communicating with one of the subsystem components and a command error-checking string. A response message structure is generated by the bridge in response to a command message. The response message uniquely addresses the command module. It includes error and status strings with respect to execution of the command message, a response data string for communicating with the command module and a response error-checking string.

Description

FAULT-TOLERANT MAINTENANCE BUS, PROTOCOL, AND METHOD FOR
USING THE SAME
BACKGROUND OF THE INVENTION
Field of the Invention
This invention relates to fault-tolerant computer systems and more particularly to a dedicated maintenance bus for use with such computer systems. Background Information
Fault-tolerant computer systems are employed in situations and environments that demand high reliability and minimal downtime. Such computer systems may be employed in the tracking of financial markets, the control and routing of telecommunications and in other mission-critical functions such as air traffic control. A common technique for incorporating fault-tolerance into a computer system is to provide a degree of redundancy to various components. In other words, important components are often paired with one or more backup components of the same type. As such, two or more components may operate in a so-called lockstep mode in which each component performs the same task at the same time, while only one is typically called upon for delivery of information. Where data collisions, race conditions and other complications may limit the use of lockstep architecture, redundant components may be employed in afailover mode. In failover mode, one component is selected as a primary component that operates under normal circumstances. If a failure in the primary component is detected, then the primary component is bypassed and the secondary (or tertiary) redundant component is brought on line. A variety of initialization and switchover tecliniques are employed to make a transition from one component to another during runtime of the computer system. A primary goal of these techniques is to minimize downtime and corresponding loss of function and/or data.
Fault-tolerant computer systems are often costly to implement since many commercially available components are not specifically designed for use in redundant systems. It is desirable to adapt conventional components and their built-in architecture whenever possible. All modern computer systems have particular capabilities directed to control and monitoring of functions. For example, large microprocessor chips such as the Pentium III™, available from Intel Corporation of Santa Clara, California, are designed to operate within a specific temperature range that is monitored by a commercially available enviromnental/t^mperature-sensing chip. One technique for interconnecting such an environmental monitor or other monitoring and control devices is to utilize a dedicated maintenance bus. The maintenance bus is typically separate from the system's main data and control bus structure. The maintenance bus generally connects to a single, centralized point of control, often implemented as a peripheral component interconnect (PCI) device.
However, as discussed above, conventional maintenance bus architecture is not specifically designed for redundant operation. Accordingly, prior fault-tolerant systems have utilized a customized architecture for transmitting monitor and control signals over the system's main buses (or dedicated proprietary buses) using, for example, a series of application specific integrated circuits (ASICs) mounted on each circuit board being monitored. To take advantage of current, commercially available maintenance bus architecture in a fault tolerant computing environment, a more comprehensive and cost-effective approach is needed.
Accordingly, it is an object of this invention to provide a maintenance bus architecture that displays a high-degree of fault-tolerance and a protocol for use on this bus. The maintenance bus architecture and associated protocol should be interoperable with commercially available components and should allow a fairly high degree of versatility in terms of monitoring and control of important computer system components.
SUMMARY OF THE INVENTION This invention overcomes the disadvantages of the prior art by providing a fault-tolerant maintenance bus architecture and bus protocol. The architecture includes two maintenance buses interconnecting each of a plurality of printed circuit boards, termed "parent" circuit boards. The two maintenance buses are each connected to a pair of system management modules (SMMs) that are configured to perform a variety of maintenance bus activities. The SMM can comprise any acceptable device for driving commands according to the protocol on the maintenance bus arrangement. The SMM has general knowledge of the circuit boards and their components. According to a preferred embodiment, the protocol is formatted to operated in accordance with Philips Semiconductors' I2C maintenance bus standard. Other standards are expressly contemplated. Within each parent board are a pair of redundant bridges both having a unique address. One bridge is connected to the first maintenance bus while a second bridge is connected to the second maintenance bus of the pair. A child maintenance bus interconnects the two bridges through a "child" printed circuit board. The introduction of a separate board to implement the child maintenance bus can be useful, but is not essential according to this invention. The child maintenance bus is itself interconnected with a variety of monitor and control functions on maintenance bus-compatible subsystem components. Using the protocol, the SMMs can address components on each child printed circuit board individually and receive appropriate responses therefrom based upon appropriate response identifiers within the protocol. In the event of a bus or bridge failure, the SMM can still communicate with the child subsystem components via the redundant bus and bridge. The protocol includes a unique data packet structure. The command message initiated by the SMM includes a target bridge header, a command byte (wherein a non-zero byte code designates the message and a command rather than a response), the message size and a unique originator tag value. The command message further includes one or more bytes of forwarding data for subordinate bridges on the child bus (leading to and from remote components/circuitry). Next the command message has a response byte code to direct responses on the return trip through the bridge. The command message also includes one or more bytes of data to identify, and be used by, subsystem components. Finally the command message includes one checksum byte meant to sum up all previous message bytes.
A similar message packet is provided by the bridge in response to the command message. The response includes an SMM address byte and a zero-value command byte (indicating a response). Also provided is a byte indicating the overall message size in bytes and the identical tag originally provided in the command packet. The tags allow the SMM to verify that the response is to a particular transmitted command. A one-byte status code field and one-byte error message field re also provided. Unique status codes and error messages are generated by the bridge if a formatted message is incorrect or commanded action was not (or may not have been) taken by the subsystem. One or more bytes of response data delivered from the subsystem component or bridge is also provided in the response message. Finally, a checksum byte is provided for error checking.
Command message/data packets are transmitted by the SMM to be received by an appropriate component within a given time frame. If an expected response message/data packet is not returned from the component as expected, the SMM "times-out" and performs various error procedures that may include an alarm condition, system shut-down and/or restransmission of the packet.
The bridge can include an interconnection to a further bridge. This remote bridge can, itself, be interconnected to additional microprocessors and associated memory. The remote bridge is addressed through one of the parent board's bridges so the communication to and from the SMM can occur. The forwarding data of the command packet enables the packets to be transferred through these further bridges, while stored response data in each subordinate bridge is used to route the return of a response back to the originating SMM. The SMM can be interconnected with a variety of other computer system peripherals and components, and can be accessed over a local network or through an Internet-based communication network.
BRIEF DESCRIPTION OF THE DRAWINGS The foregoing and other objects and advantages of the invention will become more clear with reference to the following detailed description as illustrated by the drawings in which:
Fig. 1 is a block diagram showing an overview of a fault-tolerant maintenance bus architecture utilizing the maintenance bus protocol according to this invention;
Fig. 2 is a more detailed block diagram showing one parent and child printed circuit board implementing a fault- tolerant maintenance bus according to this invention;
Fig. 3 is the board of Fig. 2 including a bridge for accessing a remote microprocessor board according to an alternate embodiment;
Fig. 4 is a block diagram of the protocol's command message/data packet according to an embodiment of this invention; and Fig. 5 is a block diagram of the protocol's response message/data packet according to an embodiment of this invention.
DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT Fig. 1 details a fault-tolerant maintenance bus architecture adapted to use a protocol according to a preferred embodiment of this invention. Before discussing the protocol in detail, the underlying maintenance bus architecture will be explained.
A pair of parent maintenance buses MBA and MBB are shown. These maintenance buses are identical in architecture and can be implemented as a combination of cables, circuitry and circuit board traces. The buses MBA and MBB interconnect with a plurality of input/output (I/O) slots and pin locations within a cabinet that may contain a plurality of circuit boards. The parent maintenance buses MBA and MBB can also jump between cabinets in a larger computer system. It is generally contemplated that the buses are implemented in a multi-cabinet fault- tolerant server system, but the architecture according to this invention can be utilized in a variety of fault-tolerant computing configurations. According to a preferred embodiment, the buses MBA and MBB are each two-wire busses designed to take advantage of integrated circuit components utilizing the I2C bus standard. The I2C bus is a proprietary design of Philips Semiconductors of the Netherlands. This standard has become widely adopted for consumer electronics and various circuit applications, and is now supported by a large number of commercially available monitoring and control devices. Details on the use of the I2C bus can be found in the fC-b s and how to use it (including specifications), April 1995 update, Chapter 3 by Philips Semiconductors. Typically, the bus is clocked at a speed of approximately lOKbytes/Sec. While I2C is employed as the bus standard according to a preferred embodiment of this invention, it is expressly contemplated that other maintenance bus standards can be utilized according to the teachings of this invention with appropriate modifications.
To avoid signal loss over long distances, the parent buses MBA and MBB are amplified to generate a signal at 5N x 30mA. A variety of bus amplification circuits can be used. In particular, commercially available hardware bus extenders are employed. The amplified bus operates at a gain that is ten times the normal operating range for an I2C bus (5V x 3mA). As described further below, this difference in operating level between the parent bus and various circuit components is compensated-for (on both sides) using the extender hardware.
In Fig. 1, three circuit board assemblies 102, 104 and 106 are shown. Each of these circuit board assemblies is interconnected with the bus pair (MBA and MBB). These board assemblies can represent a variety of computer system components. For example, the boards can together comprise a set of redundant identical boards or a set of separate functions including a central processing unit (CPU) board, "front panel" board and input/output (I/O) board. Each board assembly 102, 104 and 106 is defined functionally as a parent printed circuit board 112, 114 and 116 and an associated child printed circuit board 122, 124 and 126. As discussed above, while a separate board to implement the child maintenance bus can be useful according to an embodiment of this invention, it is not required. In general, the division between the parent board and child board is somewhat arbitrary, and the actual physical structure for one or more boards can be implemented as a single plug-in printed circuit card residing in a connector socket or slot on a larger cabinet-based motherboard. Also interconnected with the parent bus pair MBA and MBB are a pair of redundant system management modules (SMMs) identified herein as SMMA 128 and SMMB 130. Each SMM is a microprocessor-based component. The SMMs 128 and 130 each reside on a PCI bus 132 and 134. The SMM performs a variety of functions and includes both Ethernet and modem capabilities allowing it to interconnect with the computer operating system and other network communication structures (block 136). The SMM may also include other unrelated system components such as a video driver chip. The SMM is particularly based around a Motorola Power PC™ 860T microprocessor utilizing the Nx Works real-time operating system available from Wind River Systems, Inc. of Alameda, California. The term "system management module" (or SMM) is defined broadly to include any acceptable device for driving commands on the maintenance bus arrangement. While a microcontroller described above is used in a preferred embodiment, the SMM can be an application specific integrated circuit (ASIC), a programmable logic array, a microprocessor unit or any other command originator interconnected with the maintenance bus arrangement. The term "command module" is also used to define the SMM in its various possible embodiments.
Each SMM includes a pair of I2C buses 138 and 140. Each pair is connected with a respective bus from the parent bus pair MBA and MBB. The SMMs are configured to operate in failover mode. In other words, SMMA operates under normal circumstances. In this mode SMMB monitors and communicates with SMMA over the shared I C bus, ready to take over for SMMA if it fails. If a failure is detected, then SMMB takes over operation but is otherwise largely idle, during normal run time. The function of the SMMs is described in further detail below. In summary, the SMMs carry information about components on each of the board assemblies 102, 104 and 106. The SMMs use this information to monitor and generally control the board assemblies. This information may be transferred to other parts of the computer system and over a network via the PCI bus.
Each parent board 112, 114 and 116 includes various data processing, display and communication capabilities in accordance with its purpose. Each board 112, 114 and 116 also includes a respective CPU (CPU1, CPU2 and CPU3) 152, 154 and 156, respectively. Each CPU can comprise an Intel Pentium III™, Xeon™ or any other acceptable microprocessor having I2C or equivalent maintenance bus architecture. Each board 112, 114 and 116 is interconnected with the parent buses MBA and MBB at various interconnection points, where appropriate. Since the parent bus is amplified by a gain by approximately ten times the normal I2C operating level, interconnections with the parent buses can be made via bus extenders to be described further below. Each parent board 112, 114 and 116 also includes a pair of interconnections 160 and 162 with each of the respective parent maintenance buses MBA and MBB. The interconnections 160 and 162 link to respective bridges 192 and 194. These bridges, in turn interconnect with respective child maintenance buses CB1, CB2 and CB3 to interconnect child board components. On each child board 122, 124 and 126 reside various control and monitoring subsystem components 172, 174 and 176, respectively. The subsystem components are described in further detail below. These components are each in communication with the maintenance bus using the preferred I2C standard. The interconnection between each parent board and child board occurs via a pair of bridges 192 and 194. Each bridge is essentially identical in architecture, each pair of bridges 192 and 194 having the same different address for communication with the SMMs. The address of the bridge pair on each board, however, differs so that the SMM can uniquely address a specific board. Addresses are established based upon the pin and socket arrangement for the respective bridge. It is useful to assign the same address to both bridges 192, 194 in the pair since they each reside on a different bus (MBA or MBB). For a given transaction, the SMM utilizes only one of the two bridges on a parent board to accomplish a task. The other, unused bridge in the pair can be used if the SMM cannot complete the transaction with the original bridge. The SMM uses the other parent maintenance bus to access the other, previously unused bridge. As noted above, in order to provide an amplified parent bus signal and an appropriate signal level to the subsystem bridges, bus extender hardware is in employed. The bus extender hardware is available from Philip Semiconductors under part number 82B715. Using amplified parent buses, approximately thirty or more loads can be carried. The extender acts as a buffer for signals traversing the extender hardware providing the necessary amplification and deamplification. Extender components 196 are provided between the parent buses MBA and MBB and corresponding bridge interconnections 160 and 162. While not shown, interconnections 160 and 162 can also include appropriate series resistors and FET triggers in line with the extender components 196 in accordance with the 82B715 hardware manufacturer's data sheet.
The bridges 192 and 194 each act as store-forward devices in the transfer of I2C signals into and out of the child board subsystem. In other words, the bridges receive packetized signals from the SMMs and transfer them to appropriate I C-compatable maintenance bus ports on subsystem components. Likewise, the bridges receive signals from subsystem components and transfers them back to the SMMs. In order to provide desired fault-tolerance, two bridges 192 and 194 are employed, each communicating with one of the dual parent buses MBA and MBB. In this manner, the failure of a single bridge or parent bus does not cause a loss of connection between the subsystem components and SMMs. This is because each child bus CBl, CB2 and CB3 is interconnected with both bridges simultaneously. The subsystem components are accessed via the child bus. A reset connection (RI, R2 and R3) and power connection (PI, P2 and P3) extend from each bridge in the pair. A reset and/or power command from an SMM to the active bridge in the pair is used to power-up or reset the underlying board assembly. The SMMs are configured to provide independent reset and power commands to the bridges 192 and 194 to allow powering and reset of each underlying board through the maintenance bus arrangement. In general, the active bridge performs power-up. However, the bridges are configured to handshake, or otherwise communicate, to ensure that the board hardware is functioning properly before power-up occurs generally within the board. According to a preferred embodiment, each bridge 192, 194 comprises a commercially available Intel 87C54 microcontroller. This circuit package includes a built-in programmable storage device (an erasable programmable read-only memory EPROM) and 256 bytes of random access memory (RAM). This package is relatively low-cost and complete. Data traveling over the I2C bus is buffered in the RAM while basic routing and power control functions are preprogrammed into the bridge microcontroller EPROM. Though the 87C54 is the preferred embodiment, any microcontroller with sufficient I/O ports to drive both parent and child maintenance buses could instantiate the bridges 192 and 194.
With further reference to Fig. 2, the exemplary parent board assembly 102 is shown in further detail. Particularly, the subsystem components 172 interconnected with the I2C bus are illustrated. The electrically-erasable programmable read only memory (EEPROM) carrying the board identification (ID), generally termed the IDPROM 202, is provided on the bus. In addition, a light-emitting diode (LED) monitor 204 is provided. This LED provides a visible indication of the status of the board for an operator of the board. In addition, an environmental monitor chip 206 having I2C compatibility is provided. This chip typically monitors temperature and other important functions and transmits appropriate data and/or alarms regarding environment. Microprocessor information for CPU1 112 is also interconnected with the bus CBl via an I2C interconnection. The CPU support information 208 is transmitted over the I2C bus, as well as other important status data. I C interconnections with the dual inline memory module sockets (DIMMS) 210 of the board assembly are also provided by the child bus CB 1. In addition, other I/O ports 212 with I2C capabilities may be serviced by the child bus CBl.
Having described the architecture of the maintenance bus arrangement, the protocol operating thereon is now described in detail. It is contemplated that the SMMs communicate with the bridges 192 and 194 via data packets, sent over the I2C bus with appropriate destination addresses. In general, each bridge and subsystem component contains its own unique address on the maintenance bus that makes it identifiable by the SMMs. The SMMs have knowledge of the subsystem components on each board. Packets sent to and from the SMM have the bridge identification and the data within the packet is used to identify the particular subsystem device. A variety of protocols and communication tecliniques can be used according to this invention. Heretofore, I2C connections have operated using a highly simplified communication scheme without the benefit of addressing and protocol techniques. Because of the fault-tolerant nature of the bus arrangement and bridge system, addressing of control and monitor functions between the SMMs and the appropriate bridge are highly desirable. Referring to Fig. 4, a command message/data packet structure is shown schematically. A command message/data packet is shown. While the illustrated message 400 is a command message initiated by the SMM, the packet structure to be described is generally a two-way message structure in which command versus response messages are differentiated by a unique command designator/identifier byte within the message header.
The command data packet 400, which is transferred between the SMM and the various subsystem components, includes an address header 402. This header is typically limited to one byte of information. In general, the architecture is arranged so that one byte (six address bits, one parity bit and one read/write bit) is sufficient to direct the packet to an appropriate bridge and corresponding subcomponent. In particular, the address specifies the target bridge through which data is transferred. As described further below, the final delivery of a command to a specific subcomponent is facilitated by the data bytes of the packet.
Following the address header 402, a command byte 404 is provided. Specifically, if the transferred packet is a command packet from the SMM, the command byte is enabled with a specific recognized command byte code. The command bytes direct a particular subsystem component to perform a particular action, cause a bridge to perform a power or reset function, or cause a subordinate bridge to perform a forwarding operation (described further below). The following is an exemplary list of command codes:
Figure imgf000010_0001
Figure imgf000011_0001
The foregoing list of commands is only exemplary, and is specifically adapted to an I2C standard application. Where the standard and available subcomponents vary from those described herein, a different set of commands may be appropriate.
Following the command byte is a one-bye number designating the overall message size 406. The overall message size is indicated as a number of bytes by the one-byte string.
Next, a tag byte 408 is provided to the header packet. This tag byte is generated by the originator (the SMM) of the data packet. It is a unique one-byte number. The tag byte is repeated in a response message, as described below. In other words, when a message is originated and transferred to a subsystem, the appropriate response sent by the subsystem component should include the same tag byte, indicating that the message was received properly and acted upon. If the tag byte is not received, then an error has occurred an appropriate action is taken by the SMM.
Next, one or more bytes of forwarding data 410 are provided. The forwarding data enables a hierarchy of bridge structures to be established within the child bus. In connection with the forwarding data 410, reference is also made to Fig. 3, which again illustrates the exemplary board assembly 102. The subsystem 172 of this board includes the set subsystem components described above with reference to Fig. 2. In addition, another bridge 302 is interconnected to the child bus CBl. This bridge is similar in configuration to the bridges 192 and 194 and can be constructed from the same type of microcontroller circuit. The bridge 302 includes another discrete address that is recognized by the SMM so that data is transferred via the bridges 192, 194 to the subordinate bridge 302 as if it were any travelling to any other subsystem component. The packet structure according to this embodiment enables a large number of remote components to be accessed notwithstanding the relatively small (one-byte) command address 420. With reference again to Fig. 3, once a command message packet is received, thesubordinate bridge 302 stores and forwards the message to the I2C-compatable ports on further computer circuitry 304. Note that the CPU information block 208 is connected through the subordinate bridge 302 according to Fig. 3. According to this embodiment, the processor information is located behind the child bridge, accounting for the depicted arrangement. In this example, the circuitry 304 includes another microprocessor (such as an Intel Xeon™) and/or associated memory and other peripherals. The above-described protocol enables messages to be transferred from the child bus through bridges to additional, subordinate bridges (such as bridge 302). Further components, such as circuitry 304, can be accessed through these subordinate bridges. The subordinate bridge acts as a hierarchy to access the remote microprocessor circuitry 304 given the limited addressing available for the protocol. In this sense, the circuitry 304 is invisible to this child bus CBl. Based upon the above-described structure a series of subordinate bridges can be chained in series. Each subordinate bridge uses response data in the command message to route a received response back up the hierarchy of bridges to the originating SMM. Each bridge in the bridge hierarchy stores knowledge of the message transferred therefrom. The response field 412 is stripped and stored by each bridge along the pathway, and the bridge's own field is substituted. This enables responses to command messages to return to their source (SMM) after passing back from the subordinate location through the bridge system.
Referring again to the command data packet 400 of Fig. 4, the structure includes zero or more bytes of data 414, following the response byte field 412. In general, data is provided as part of a particular transaction under the I2C protocol. This data is subcomponent-specific. In addition, the data provides the identification of a particular component on the child bridge, or in a remote, subordinate bride hierarchy (such as the circuitry 304).
Finally, the command packet 400 includes a checksum comprising a one-byte number. The checksum indicates the sum of previous bytes in the message. If either the bridge, or an SMM discovers an incorrect checksum byte in a message, then the message is discarded. An erroneous message can trigger a variety of actions, including resending of the message, an alarm condition or shut down as appropriate. Note that at least two error modes for the SMM exist: (1) where it receives no response following the transmission of a message, resulting in "time- out" state; and (2) an error state in which a message is returned that is properly formatted, but is not understood by the bridge or the SMM as applicable. One possible corrective action is to retry sending of the messages for a predetermined number of times. Certain subsystem devices that must return data require a positive response, and thus, a full-way communication is required for an error not to occur. Conversely, where a power-on command is sent by the SMM or another non-response-dependent action occurs within the subsystem, then failure to return a response may cause an indication that an error may be present, but the action is possibly completed. These conditions are described further with respect to the response phase that is now discussed in detail.
The response message/data packet 500 is shown in Fig. 5. It includes a one-byte address 502 designating the SMM. Where the command packet contains a command byte (404), the response packet contains a reserve value 504, generally equal to zero. This, in fact, indicates that a response is being transferred over the maintenance bus.
The response packet 500 also includes an overall message size 506 similar to the command message size 406. The tag 508 of the packet 500 is the original tag (408) in the command message, now being returned by subsystem components. Next is provided a one-byte status code 410. The status code, as described above, is an indicator that a problem may exist, but the system does not have confirmation. For example, power may have been turned on, but no response is given. Different potential problems may be indicated by different status codes. In general, the status code is appended to the response by the active bridge. If the command is, however, misunderstood or improperly formatted, then an appropriate error code 512 is generated. The absence of an error is generally indicated by a zero value. Conversely, the specific errors are indicated by particular byte codes. Exemplary error codes are listed in the following table:
Figure imgf000013_0001
Figure imgf000014_0001
Figure imgf000015_0001
Figure imgf000016_0001
local_unknown_type The firmware's local command state (0x22) machine was passed a command which is an incorrect type. This error means that the reporting bridge has encountered a firmware bug. This event freezes the bridge's history log. local_bad_struct_length The write state structure, which contains a (0x23) valid bridge local command, is not the correct length. xint_source_signal_stuck_ 1 A transition from low to high was expected ow on a signal which the bridge released (i.e.,
(0x24) it stopped grounding it, but the signal did not go high). This signal is used b the reporting bridge to inform its peer subsystem bridge of the action it should take. Several commands require the cooperation of both bridges in a 12C subsystem in order to carry out an action, and "source" signals are utilized in conjunction with a cross-interrupt mechanism to synchronize the bridges. Source signals are routed between two subsystem bridges only. target_signal_stayed_low A transition from low to high was expected (0x25) on a signal which the bridge released (i.e., it stopped grounding it, but the signal didn't go high). A "target" signal is one of the signals that both subsystem bridges use to control surrounding hardware on an assembly - such as assembly reset, power enable, or NMI. xint_source_signal_stuck_ A transition from high to low was expected high on a signal which the bridge is now
(0x26) grounding. This signal is used by the reporting bridge to inform its peer subsystem bridge of the action is should take. Several commands require the cooperation of both bridges in an I2C subsystem in order to carry out an action, and source signals are utilized in conjunction with a cross-interrupt mechanism to synchronize the bridges. Source signals are routed between two subsystem bridges only.
The stuck high error should be impossible given the bridge's microcontroller hardware, but the firmware tests the signal for this condition as a sanity check. This error would indicate a fault in the
Figure imgf000018_0001
Figure imgf000019_0001
Referring further to Fig. 5, the response packet 500 next includes zero or more bytes of response data 514. This data can comprise environmental telemetry or variety of other required response data from the selected subsystem components. Finally, a checksum byte 516 is provided. This checksum, again indicates the sum of the previous bytes in the response message. If the SMM or a bridge discovers an incorrect checksum, then the response is discarded and the command-response cycle is generally retried (or other appropriate action is taken).
It should now be clear that the foregoing architecture and accompanying protocol enables an effective and low-cost technique for implementing a fault tolerant maintenance bus within a number of separate computer components.
The foregoing has been a detailed description of a preferred embodiment. Various modifications and additions can be made without departing from the spirit and scope of the invention. For example, while the maintenance bus is implemented as an I2C standard, it can be implemented in any other acceptable standard and the number of lines in the bus can be varied from the two lines shown. While a serial maintenance bus is utilized, it is contemplated that a parallel bus can be employed according to an alternate embodiment. Various components such as bridges and SMMs can be implemented using a variety of commercially available and customized circuits. Accordingly, this description is meant to be taken only by way of example, and not to otherwise limit the scope of the invention. What is claimed is:

Claims

CLAIMS 1. A fault-tolerant maintenance bus protocol for communicating between a command module located on a parent maintenance bus and a plurality of subsystem components joined together on a child maintenance bus, wherein the child maintenance bus is interconnected to a bridge assembly that directs messages formatted in the protocol between the subsystem components and the command module through the bridge, the protocol comprising: a command message structure that uniquely addresses the bridge assembly, and that includes a command string, a command data string for communicating with one of the subsystem components and a command error-checking string; and a response message structure that uniquely addresses the command module, and that includes error and status strings with respect to execution of the command message, a response data string for communicating with the command module and a response error-checking string.
2. The protocol are set forth in claim 1 wherein the command string is positioned on the structure directly following a bridge address string and further comprising a response string directly following a command module address string, the response string having a unique value indicative of a response message structure. ,
3. The protocol as set forth in claim 2 wherein the plurality of subsystem components are contained on a circuit board having a plurality of components, the circuit board being located on the parent maintenance bus and the command string includes command information formatted to instruct the bridge to power-up the plurality of components on the circuit board.
4. The protocol as set forth in claim 3 wherein the command data string includes command instructions formatted for one of the subsystem components.
5. The protocol as set forth in claim 4 wherein the response data string includes information provided by one of the subsystem components.
6. The protocol as set forth in claim 5 wherein the status and error strings include codes specifying erroneous conditions in execution of the command string.
7. The protocol as set forth in claim 6 wherein the command error-checking string and the response error-checking string each include an identical tag string, the tag string being initially provided by the command module in the command message structure and being returned by the bridge in the response message structure.
8. The protocol as set forth in claim 7 wherein the command error-checking string and the response error-checking string each include a checksum string indicative of a total length of the command message structure and the response message structure respectively.
9. The protocol as set forth in claim 6 wherein the bridge address string is adapted to address one of either the bridge or another bridge residing on the circuit board thereon each of the bridge and the other bridge being joined to the child bus and each of the bridge and the other bridge having the same address.
10. The protocol as set forth in claim 9 wherein each of the command message structure and the response message structure are adapted to travel between the command module and the other bridge over another parent maintenance bus connected to the other bridge.
11. The protocol as set forth in claim 6 wherein the response string provides the bridge with address data for directing the response message structure.
12. The protocol as set forth in claim 11 wherein the command data string includes information for identifying a subordinate bridge in the child maintenance bus, whereby the command message structure is routed through the subordinate bridge to remote circuitry.
13. A method for using a fault-tolerant maintenance bus protocol for communicating between a command module located on a parent maintenance bus and a plurality of subsystem components joined together on a child maintenance bus, wherein the child maintenance bus is interconnected to a bridge assembly that directs messages formatted in the protocol between the subsystem components and the command module through the bridge, the method comprising the steps of: transmitting, from the command module, a command message structure that uniquely addresses the bridge assembly, and that includes a command string, a command data string for communicating with one of the subsystem components and a command error-checking string; and transmitting, from the bridge, a response message structure that uniquely addresses the command module, and that includes error and status strings with respect to execution of the command message, a response data string for communicating with the command module and a response error-checking string.
14. The method as set forth in claim 13 wherein the step of transmitting the command message structure and the step of transmitting the response message structure includes selectively transmitting the command message structure and transmitting the response message structure over one of either the parent maintenance bus to the bridge or to another parent maintenance bus to another bridge, each of the bridge and the other bridge being interconnected to the child maintenance bus.
15. The method as set forth in claim 14 wherein the step of transmitting the command message structure includes providing an address string that identifies each of the bridge and the other bridge.
16. The method as set forth in claim 15 further comprising providing information with respect to a selected one of the subsystem components in the command data string and the response data string.
17. The method as set forth in claim 16 wherein the step of transmitting command message structure includes providing the bridge with address data in the response string for subsequent directing the response message structure.
18. The method as set forth in claim 17 further comprising providing, in the command data string, information for identifying a subordinate bridge in the child maintenance bus, whereby the command message structure is routed through the subordinate bridge to remote circuitry.
19. A maintenance bus architecture for a fault-tolerant computer system having a plurality of circuit board assemblies and maintenance bus-compatible subsystem components thereon comprising: a first parent maintenance bus and a second parent maintenance bus interconnecting to each of the plurality of circuit board assemblies; a command module interconnected with each of the first parent maintenance bus and the second parent maintenance bus, the command module being constructed and arranged to transmit and receive control and monitor data over the first parent maintenance bus and the second parent maintenance bus in a predetermined format; a first bridge and a second bridge associated with each of the plurality of circuit boards; each first bridge being interconnected with the first parent maintenance bus and each second bridge being connected with the second parent maintenance bus; a child maintenance bus interconnected between the first bridge and the second bridge, the child maintenance bus being interconnected to predetermined ports on each of the maintenance bus-compatible subsystem components; and wherein each of the first bridge and the second bridge are constructed and arranged to transfer the control and monitor data addressed thereto between the child maintenance bus and the first parent maintenance bus and second parent maintenance bus, respectively, only one of the first bridge and the second bridge being active to transfer the control and monitor data at a given time.
20. The maintenance bus architecture as set forth in claim 19 wherein the first bridge and the second bridge include reset and power connections therebetween and the command module is constructed and arranged to transmit reset and power signals directly to each of the first bridge and the second bridge to thereby control power and reset of components on the respective of the board assemblies.
21. The maintenance bus architecture as set forth in claim 20 wherein each of the first bridge and the second bridge and each of the subsystem components is constructed and arranged to be uniquely identified the command module.
22. The maintenance bus architecture as set forth in claim 21 wherein each of the subsystem components includes maintenance bus ports arranged according to a two-wire I2C maintenance bus standard interconnected with the child maintenance bus and wherein each of the first parent maintenance bus, the second parent maintenance bus and the child maintenance bus are arranged according to the two-wire I C maintenance bus standard.
23. The maintenance bus architecture as set forth in claim 22 further comprising another command module interconnected to each of the first parent maintenance bus and the second parent maintenance bus, the other command module being constructed and arranged to monitor the command module and to provide backup for the command module.
24. The maintenance bus architecture as set forth in claim 23 wherein the subsystem components include an environmental monitor, an IDPROM for the circuit board and an LED indicator.
25. The maintenance bus architecture as set forth in claim 23 further comprising a bus extender for amplifying the first parent bus and the second parent bus and for providing a deamplified signal to the first bridge and the second bridge.
26. The maintenance bus architecture as set forth in claim 23 further comprising a third bridge interconnected with the child bus constructed and arranged to transfer the monitor and control data from the child bus to a maintenance bus-compatible port of a remote circuit.
27. The maintenance bus architecture as set forth in claim 26 wherein the remote circuit comprises a microprocessor.
PCT/US2001/011804 2000-04-13 2001-04-11 Fault-tolerant maintenance bus, protocol, and method for using the same WO2001079962A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001251536A AU2001251536A1 (en) 2000-04-13 2001-04-11 Fault-tolerant maintenance bus, protocol, and method for using the same

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US09/548,202 US6691257B1 (en) 2000-04-13 2000-04-13 Fault-tolerant maintenance bus protocol and method for using the same
US09/548,202 2000-04-13
US09/548,536 2000-04-13
US09/548,536 US6633996B1 (en) 2000-04-13 2000-04-13 Fault-tolerant maintenance bus architecture

Publications (2)

Publication Number Publication Date
WO2001079962A2 true WO2001079962A2 (en) 2001-10-25
WO2001079962A3 WO2001079962A3 (en) 2003-02-27

Family

ID=27068796

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/011804 WO2001079962A2 (en) 2000-04-13 2001-04-11 Fault-tolerant maintenance bus, protocol, and method for using the same

Country Status (2)

Country Link
AU (1) AU2001251536A1 (en)
WO (1) WO2001079962A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1697849A1 (en) * 2003-12-15 2006-09-06 Finisar Corporation Two-wire interface having dynamically adjustable data fields depending on operation code
CN100369003C (en) * 2001-11-16 2008-02-13 中兴通讯股份有限公司 Method for carrying out distribution function of multiple objects
EP2228725A1 (en) * 2009-03-13 2010-09-15 Giga-Byte Technology Co., Ltd. Motherboard with backup chipset
EP2742677A4 (en) * 2011-08-09 2015-10-28 Alcatel Lucent System and method for powering redundant components
CN113190395A (en) * 2021-03-15 2021-07-30 新华三信息技术有限公司 State monitoring method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4128883A (en) * 1977-09-30 1978-12-05 Ncr Corporation Shared busy means in a common bus environment
WO1997024677A1 (en) * 1995-12-28 1997-07-10 Intel Corporation A method and apparatus for interfacing a device compliant to first bus protocol to an external bus
WO1998021660A1 (en) * 1996-11-14 1998-05-22 Data General Corporation Dynamically upgradeable disk array system and method
US5884027A (en) * 1995-06-15 1999-03-16 Intel Corporation Architecture for an I/O processor that integrates a PCI to PCI bridge
US5892928A (en) * 1997-05-13 1999-04-06 Micron Electronics, Inc. Method for the hot add of a network adapter on a system including a dynamically loaded adapter driver
WO1999059066A1 (en) * 1998-05-14 1999-11-18 Motorola, Inc. Controlling a bus with multiple system hosts
WO1999066410A1 (en) * 1998-06-15 1999-12-23 Sun Microsystems, Inc. Direct memory access in a bridge for a multi-processor system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4128883A (en) * 1977-09-30 1978-12-05 Ncr Corporation Shared busy means in a common bus environment
US5884027A (en) * 1995-06-15 1999-03-16 Intel Corporation Architecture for an I/O processor that integrates a PCI to PCI bridge
WO1997024677A1 (en) * 1995-12-28 1997-07-10 Intel Corporation A method and apparatus for interfacing a device compliant to first bus protocol to an external bus
WO1998021660A1 (en) * 1996-11-14 1998-05-22 Data General Corporation Dynamically upgradeable disk array system and method
US5892928A (en) * 1997-05-13 1999-04-06 Micron Electronics, Inc. Method for the hot add of a network adapter on a system including a dynamically loaded adapter driver
WO1999059066A1 (en) * 1998-05-14 1999-11-18 Motorola, Inc. Controlling a bus with multiple system hosts
WO1999066410A1 (en) * 1998-06-15 1999-12-23 Sun Microsystems, Inc. Direct memory access in a bridge for a multi-processor system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100369003C (en) * 2001-11-16 2008-02-13 中兴通讯股份有限公司 Method for carrying out distribution function of multiple objects
EP1697849A1 (en) * 2003-12-15 2006-09-06 Finisar Corporation Two-wire interface having dynamically adjustable data fields depending on operation code
EP1697849A4 (en) * 2003-12-15 2007-12-05 Finisar Corp Two-wire interface having dynamically adjustable data fields depending on operation code
EP2228725A1 (en) * 2009-03-13 2010-09-15 Giga-Byte Technology Co., Ltd. Motherboard with backup chipset
EP2742677A4 (en) * 2011-08-09 2015-10-28 Alcatel Lucent System and method for powering redundant components
CN113190395A (en) * 2021-03-15 2021-07-30 新华三信息技术有限公司 State monitoring method and device
CN113190395B (en) * 2021-03-15 2023-08-18 新华三信息技术有限公司 State monitoring method and device

Also Published As

Publication number Publication date
WO2001079962A3 (en) 2003-02-27
AU2001251536A1 (en) 2001-10-30

Similar Documents

Publication Publication Date Title
US6691257B1 (en) Fault-tolerant maintenance bus protocol and method for using the same
US6633996B1 (en) Fault-tolerant maintenance bus architecture
US10417167B2 (en) Implementing sideband control structure for PCIE cable cards and IO expansion enclosures
JP2532317B2 (en) Backup method of general-purpose I / O redundancy method in process control system
US7574540B2 (en) Managing management controller communications
JP2791965B2 (en) Method for performing cross-validation of primary and secondary databases in a process control system
US6202160B1 (en) System for independent powering of a computer system
JPH086910A (en) Cluster type computer system
US20090077275A1 (en) Multiple I/O interfacing system for a storage device and communicating method for the same
JPH04364562A (en) Method of ensuring data stored in primary database and secondary database in process control system
US7073088B2 (en) Data bus arrangement and control method for efficiently compensating for faulty signal lines
JP6429188B2 (en) Relay device
US20060212749A1 (en) Failure communication method
CN111628944B (en) Switch and switch system
WO2001079962A2 (en) Fault-tolerant maintenance bus, protocol, and method for using the same
KR19990066203A (en) Fault Detection Device and Method Using Peripheral Interconnect Bus Monitor
JP4799273B2 (en) Storage system and automatic recovery method in case of loop error
CN116055347A (en) Computing system and network device management method
JP2014532236A (en) Connection method
US8639967B2 (en) Controlling apparatus, method for controlling apparatus and information processing apparatus
US20050215128A1 (en) Remote device probing for failure detection
US7131028B2 (en) System and method for interconnecting nodes of a redundant computer system
CN117092902A (en) Multi-data channel backboard, multi-data channel management method and system
US20090210610A1 (en) Computer system, data relay device and control method for computer system
KR0175468B1 (en) Dual system bus matcher

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP