WO1997022930A1 - Transparent fault tolerant computer system - Google Patents

Transparent fault tolerant computer system Download PDF

Info

Publication number
WO1997022930A1
WO1997022930A1 PCT/US1996/018584 US9618584W WO9722930A1 WO 1997022930 A1 WO1997022930 A1 WO 1997022930A1 US 9618584 W US9618584 W US 9618584W WO 9722930 A1 WO9722930 A1 WO 9722930A1
Authority
WO
WIPO (PCT)
Prior art keywords
replica
primary
backup
operating system
supervisor
Prior art date
Application number
PCT/US1996/018584
Other languages
French (fr)
Other versions
WO1997022930A9 (en
Inventor
Thomas C. Bressoud
John E. Ahern
Kenneth P. Birman
Robert C. B. Cooper
Bradford B. Glade
Fred B. Schneider
John D. Service
Original Assignee
Stratus Computer, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stratus Computer, Inc. filed Critical Stratus Computer, Inc.
Priority to AU11211/97A priority Critical patent/AU1121197A/en
Publication of WO1997022930A1 publication Critical patent/WO1997022930A1/en
Publication of WO1997022930A9 publication Critical patent/WO1997022930A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2097Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements maintaining the standby controller/processing unit updated
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques

Definitions

  • the invention relates to fault-tolerant computer systems and, in particular, to replica coordinators for such fault-tolerant systems.
  • each replica calls for each replica to be a deterministic state machine that reads a sequence of commands, each command causing a state transition that is completely determined by the command and the current state of the machine.
  • the state transitions can produce outputs to an environment, e.g. I/O requests.
  • Each replica of the state machine starts in the same state and reads an identical sequence of commands.
  • Each replica therefore, undergoes an identical sequence of state transitions and produces an identical sequence of outputs.
  • This approach ensures that, when a failure occurs, the spare is in the same state as the failed component at the time of, or just prior to, the failure.
  • the spare can therefore interact with the environment in a manner consistent with past interactions between the now-failed component and the environment.
  • Fault-tolerant systems mask replica failures by combining the output sequences from multiple replicas into a single output sequence that appears to have come from a single, non-faulty state machine.
  • each replica produces a sequence of outputs, but a replica coordination mechanism allows only one replica's sequence of outputs to reach the environment. For example, accepting outputs from 2t+l replicas, a "majority voter" can selectively provide the outputs to the environment and thereby mask as many as t faulty replicas.
  • a "primary/backup" method of replica coordination all non-faulty replicas perform the same computations, but only one replica (the primary) interacts with the environment.
  • a computer system is composed of layers, as shown in Fig. 1.
  • An Application program makes calls to application support routines (DLLs, RTLs, etc.), which in turn call operating system software.
  • the application software can bypass the support routines and call the operating system directly.
  • Lower layers include computer hardware (CPU, memory, bus, network, etc.), and I/O components (disks, user terminals, etc.)
  • Prior-art systems provide replica coordination by adding a layer or significantly modifying one or more of the existing layers. Each such approach poses problems, however.
  • the invention provides "transparent fault-tolerance" by interposing a new layer (a "replica supervisor") between the application/application support layers (hereinafter the "application”) and the operating system layer without altering the operating system interface, as viewed by the application.
  • Transparent fault-tolerance means that application developers need not be aware of the replica supervisor when writing an application and an existing application can be made fault-tolerant without a developer rewriting the application.
  • An additional processor and operating system executes and provides services to, respectively, each replica of the application.
  • interposing the replica supervisor above the operating system enables the system to tolerate hardware faults and faults in the operating system that occur independently on the different processors, such as faults induced by other, possibly unrelated, applications.
  • the operating system can replaced, e.g. with a newer version, on one processor at a time. If more than two processors are provided, continuous fault-tolerance is provided.
  • the replica supervisors intercept calls to the operating system made by the replicas and asynchronous events delivered by the operating system to the replicas.
  • the replica supervisors communicate with each other to coordinate the multiple replicas and ensure that only one replica, a "primary replica” being executed on a "primary processor,” interacts with the environment.
  • the replica supervisors also ensure that a "backup replica" receives the same inputs, and thus undergoes the same state transitions, as the primary replica.
  • the replica supervisor being executed by the primary processor is known as a "primary supervisor," while the other replica supervisor is a "backup supervisor.”
  • a primary supervisor while the other replica supervisor is a "backup supervisor.”
  • Application state to include attributes of a replica that are saved and restored during a process context switch. These attributes include, but are not limited to, the contents of registers and memory and operating system-maintained quantities, e.g. a scheduling wait-state resulting from an I/O request.
  • Events that can change (produce a "transformation" of) a replica's application state include the replica executing instructions, calling the operating system, and receiving notification of an asynchronous event from the operating system.
  • These asynchronous events include exceptions resulting from executing an instruction, e.g. dividing by zero, and completion of operations previously requested by the replica, e.g. disk I/O completion.
  • Corresponding events in different replicas can produce different transformations of the respective replica's application state.
  • the replica supervisors intercept events that can cause different transformations of the respective replica's application state.
  • the replica supervisors provide interfaces to the replicas that are the same as the interface provided by the operating system.
  • the primary supervisor intercepts a call by the primary replica, the primary supervisor makes the call to the operating system on behalf of the replica and then delivers the results of the operating system call (the "values returned by the operating system") to the primary replica.
  • the primary supervisor causes a transformation in the state of the primary replica that is equivalent to the transformation the operating system would have caused if the call had not been intercepted.
  • the primary supervisor also sends a message to the backup replica.
  • the message contains the values returned by the operating system.
  • the backup supervisor intercepts a call to the operating system by the backup replica, the backup supervisor does not call the operating system on behalf of the backup replica. Instead, the backup supervisor uses the values sent by the primary supervisor, as a result of the corresponding call by the primary replica, to transform the state of the backup replica.
  • the replica supervisors ensure that the primary and backup replicas undergo equivalent transformations of their application state as a result of corresponding calls to the operating system.
  • the invention uses a technique, commonly known as "object- code editing,” to enable the replica supervisors to intercept calls to the operating system.
  • An object-code editor examines an application's executable file, locates calls to the operating system and then replaces those calls with one or more instructions (an "instruction sequence") that, when executed by a processor, cause the processor to transfer control to the replica supervisor.
  • the replica supervisor receives control when the replica calls the operating system.
  • the invention can replace only calls to the operating system that can effect the environment or that can cause different transformations of the application state in different replicas. This selective replacement provides improved performance over the prior art because the replica supervisor is invoked only when necessary.
  • a processor can execute a mixture of replicated and non- replicated applications, so a user can choose which applications to replicate and avoid overhead associated with replicating non-critical applications.
  • the invention replaces or modifies an existing component that provides an interface between applications and the operating system.
  • applications call the Windows NT operating system through an interface provided by a dynamic-linked library (DLL), specifically Win32.
  • DLL dynamic-linked library
  • the DLL transfers control to the operating system.
  • the invention replaces the DLL with one that transfers control to the replica supervisor instead.
  • Other operating systems such as the VMS operating system, have a well-known, invariant location (a "system service vector") for each possible call to the operating system.
  • Applications call these locations, which contain instructions that transfer to an appropriate routine within the operating system. In such a case, the invention modifies these locations to transfer to the replica supervisor instead.
  • the replica supervisors "virtualize" certain operating system data, such as a processor's network address, that might otherwise be returned differently to different replicas.
  • the replica supervisors return virtualized data to the replicas and to the environment. Once returned by the operating system on the primary processor, some data, such as time-of-day, can be used to satisfy the corresponding call by the backup replica despite the fact that clocks on different processors are never perfectly synchronized.
  • the clock on the backup (“newly-promoted primary") processor might require adjustment to make it consistent with values that were returned to the newly- promoted primary replica before it was promoted.
  • the newly- promoted primary supervisor can increase or decrease, as appropriate, all times passed between the newly-promoted primary replica and the newly- promoted primary processor to account for the difference between the two clocks.
  • replica supervisors fabricate a network address and supply it to both the primary and backup replicas.
  • the replica supervisors perform any necessary translations between the fabricated and actual network addresses when they handle calls to the operating system and values returned to the replicas.
  • the newly-promoted primary replica continues to use the same fabricated network address.
  • the replica supervisors thus create a "virtual process environment" that is independent of the primary processor and that remains invariant, even after a fail-over.
  • the replica supervisors also ensure that, as a result of asynchronous events, the same changes in application state occur at the same relative points in the execution of each replica.
  • the replica supervisors also virtualize operating system data that might otherwise be delivered differently to different replicas.
  • the primary supervisor intercepts asynchronous events, and their attendant state-transforming values, that the operating system attempts to deliver to the primary replica.
  • the primary supervisor subsequently delivers the results of the asynchronous event (the "values provided by the operating system") to the primary replica.
  • the primary supervisor causes a transformation in the state of the primary replica that is equivalent to the transformation the operating system would have caused if the asynchronous event had not been intercepted.
  • the primary supervisor also sends a message to the backup replica. The message contains the values provided by the operating system.
  • the backup supervisor intercepts and discards asynchronous events the operating system attempts to deliver to the backup replica. Instead, the backup supervisor uses the values sent by the primary supervisor, as a result of the corresponding asynchronous event delivered to the primary replica, to transform the state of the backup replica.
  • the replica supervisors ensure that each asynchronous event is delivered to the replicas at the same application-relative time, i.e. at the same point in the dynamic instruction stream of each replica.
  • the invention uses object-code editing to locate loops in the application and to insert into each loop an instruction sequence that counts a number of times the processor executes each loop ("loop cycles") and transfers control to the replica supervisor every time the processor executes the loop a controllable number of times. (Locating instructions and inserting instructions in an application is commonly known as "instrumenting" the application.)
  • the replica supervisor receives control periodically throughout the execution of the replica. The supervisor counts the number of times each loop executes and this count forms at least part of the application-relative time.
  • the primary supervisor notes the application-relative time at which it delivers each asynchronous event to the primary replica and it sends this time to the backup replica along with the state-transforming values associated with the asynchronous event.
  • the backup supervisor waits until the backup replica reaches the same application-relative time before delivering the asynchronous event to the backup replica. This ensures that the backup replica has executed the same instructions as the primary replica before delivery of the asynchronous event. Of course, if the backup replica has already reached the same application-relative time, no wait is necessary.
  • a “thread” is an execution of a sequence of instructions by a processor. Each thread is scheduled separately by the operating system.
  • a “process” includes an address space, an identity by which the process is known by other processes, and other operating system-specific quantities, such as privileges, e.g. exclusive access to an I/O device.
  • privileges e.g. exclusive access to an I/O device.
  • One process can compromise multiple threads. Since the threads of a process share one address space, they must use a synchronization technique, e.g. a critical section, mutual exclusion semaphore or test-and-branch instruction, to coordinate access to the shared address space. These techniques enable at most one thread at a time to acquire and then release access to a shared resource, e.g.
  • the invention can optionally be extended to provide fault-tolerance to threaded applications.
  • the replica supervisors intercept acquisitions and releases of the shared resources by the threads and ensure that the sequence of acquisitions and releases is the same on the primary and backup replicas.
  • the replica supervisors maintain an application-relative time for each thread, i.e. each replica supervisor maintains the count of loop cycles in thread- specific storage.
  • the primary supervisor intercepts acquisitions and releases and sends to the backup supervisor an identity of the thread, an identity of the shared resource and the thread-specific application-relative time of the acquisition or release.
  • the backup supervisor subsequently allows the threads on the backup replica to acquire and release the corresponding shared resources on the backup replica in the same order as they were acquired and released on the primary replica.
  • the invention uses object-code editing to locate acquisitions and releases of the shared resources.
  • the invention can intercept only acquisitions and releases that can effect the environment or that can cause different transformations of the application state in different replicas.
  • fault-tolerant support for threads can be provided by the same mechanism that provides fault-tolerant support for calls to the operating system.
  • Fig. 1 is a block diagram of a computer system showing it's layers
  • Fig. 2 is a more detailed block diagram of a portion of the computer system of Fig. 1
  • Fig. 3 is a block diagram of a fault-tolerant computer system, having a primary and a backup processor, according to the present invention
  • Fig. 4 is a flow chart illustrating steps taken by the primary processor of
  • Fig. 3 for processing calls to an operating system
  • Fig. 5 is a flow chart illustrating steps taken by the backup processor of
  • FIG. 3 for processing calls to the operating system
  • Fig. 6 is a flow chart showing steps taken by the fault-tolerant system of Fig. 3 for processing asynchronous events
  • Fig. 7 is a flow chart showing steps taken by the fault-tolerant system of Fig. 3 at the end of an epoch (a unit of application-relative time); and Fig. 8 is a flow chart showing steps taken by the fault-tolerant system of Fig. 3 after a failure of the primary processor.
  • Fig. 1 shows a computer system 100 and its constituent layers, namely an application program 102, application support routines 104, operating system software 106, computer hardware 108, and I/O components 110.
  • the invention interposes a replica supervisor (not shown) between the operating system software 106 and the application program 102/application support routines 104.
  • an application 202 comprises an application program 102 and application support routines 104.
  • the application program 102 makes operating system calls 204, either directly or through the application support routines 104, to the operating system 106.
  • the operating system 106 interacts with the environment, including I/O devices, and returns results and asynchronous system events 206 to the application 202.
  • Fig. 3 is a block diagram of a fault-tolerant computer system 300 according to the invention in which a primary processor 302 executes a primary replica 304 and a backup processor 306 executes a backup replica 308.
  • a network and/or processor interconnect bus 309 interconnects the primary and backup processors 302 and 306 with common I/O components 310.
  • a replica supervisor 312 is interposed between the primary replica 304 and the operating system 106 on the primary processor 302.
  • a replica supervisor 312 is interposed between the backup replica 308 and the operating system 106 on the backup processor 306.
  • the replica supervisors 312 intercept calls to the operating system made by the primary and backup replicas 304 and 308 and asynchronous events delivered by the operating system 106 to the replicas.
  • the replica supervisors 312 communicate with each other over a replica coordination communication means 314 to coordinate the primary and backup replicas 304 and 308.
  • the replica coordination communication 314 can be carried over the network and/or processor interconnect bus 309, over a separate bus or network, or over any other message passing mechanism.
  • the 5 replica coordination communication 314 between the replica supervisors 312 provides first-in-first-out (FIFO) message delivery, though not necessarily reliable message delivery.
  • the invention defines a protocol between the replica supervisors 312. This protocol is manifested in messages passed between the replica supervisors 312 over the replica coordination o communication means 314.
  • Application-relative time advances for each replica 304 and 308 as each processor 302 and 306, respectively, executes the replica.
  • an object-code editor locates back-branch instructions in a file that stores the application program's executable module.
  • Back-branch instructions include conditional and unconditional branch instructions that, when executed by a processor, cause the processor to transfer control to an instruction whose address is lower than the address of the branch instruction, i.e. back-branch instructions implement loops.
  • a replica's application-relative time includes an offset "i" that is incremented each time the processor executes one cycle of one of the loops.
  • each replica 304 and 308 is partitioned into epochs "e" of application-relative time, where each epoch comprises one or more increments of the replica's offset i.
  • Corresponding epochs in the primary and backup replicas 304 and 308 comprise equivalent instruction sequences.
  • Each replica's offset i is reset to zero at the beginning of each epoch.
  • its application-relative time comprises an epoch and an offset from the beginning of the epoch, hereinafter represented as ⁇ e,i>.
  • the object-code editor inserts, prior to each back-branch instruction, i.e.
  • an instruction sequence that counts a number of times the loop executes, i.e. the offset i.
  • the instruction sequence transfers control to the replica supervisor 312 whenever the loop executes a maximum number of times.
  • an epoch as the maximum number of times a loop can execute.
  • the replica supervisor 312 receives control periodically throughout the execution of the replica, i.e. at the end of each epoch.
  • the replica supervisor can control the frequency with which it receives control, i.e. the epoch length, by changing the maximum number of times the loop can execute.
  • the epoch length can be increased to reduce the frequency with which the replica supervisor 312 is invoked and, therefore, reduce overhead imposed by the replica supervisor.
  • decreasing the epoch length decreases response time of the replica supervisor 312 in delivering asynchronous events to the replicas 304 and 308.
  • Fig. 4 is a flowchart illustrating steps taken by the primary and backup processors 302 and 306 for processing calls made by the replicas 304 and 308 to the operating system 106.
  • the primary replica 304 calls the operating system at application-relative time ⁇ e,i>.
  • the replica supervisor 312 on the primary processor 302 intercepts the call to the operating system.
  • the replica supervisor 312 ascertains whether the call can affect the environment, e.g. a call to one of the common I/O components 310. If the call can affect the environment, at step 408 the replica supervisor 312 waits for acknowledgment of all messages previously sent to the replica supervisor 312 on the backup processor 306 over the replica coordination communication means 314.
  • the replica supervisor 312 adds this call to its list of outstanding calls to the operating system that can affect the environment.
  • the replica supervisor 312 calls the operating system 106 on behalf of the primary replica 304. If necessary, the replica supervisor 312 virtualizes operating system data. A delay might occur at step 414 while the operating system 106 completes an operation, e.g. an I/O operation.
  • the operating system returns values to the replica supervisor 312 as a result of the call to the operating system.
  • the replica supervisor 312 returns the values to the primary replica 304, i.e. the supervisor causes a transformation in the state of the primary replica that is equivalent to the transformation the operating system 106 would have caused if the call had not been intercepted.
  • the replica supervisor 312 sends a message over the replica coordination communication means 31 to the replica supervisor on the backup processor 306.
  • the message contains the application-relative time, i.e. ⁇ e,i>, and the state-transforming values returned by the operating system 106 as a result of the call.
  • the replica supervisor 312 removes this call from the list of outstanding calls that can affect the environment.
  • the messages for multiple calls to the operating system can be sent together as a single message.
  • the replica supervisor 312 on the backup processor 306 receives the message sent by the replica supervisor on the primary processor 302 at step 420.
  • the replica supervisor 312 on the backup processor 306 buffers the message and acknowledges receipt of it.
  • the replica supervisor 312 on the backup processor 306 removes a corresponding call from its list of outstanding calls that can affect the environment.
  • Fig 5 is a flowchart illustrating steps taken by the replica supervisor 312 on the backup processor 306 for processing calls to the operating system 106 by the backup replica 308
  • the backup replica 308 calls the operating system at application-relative time ⁇ e, ⁇ > .
  • the replica supervisor 312 on the backup processor 306 intercepts the call
  • the replica supervisor 312 ascertains whether the call can affect the environment If the call can affect the environment, at step 508 the replica supervisor 312 on the backup processor 306 adds the call to its list of outstanding calls that can affect the environment
  • the replica supervisor 312 on the backup processor 306 waits for the message that the replica supervisor 312 on the primary processor 302 sends at step 420 (Fig 4) In other words, the replica supervisor 312 on the backup processor 306 waits for a message that corresponds to the call to the operating system made by the backup replica 308 at step 502
  • the replica supervisor 312 identifies the message of the corresponding call to the operating system by matching the application-relative time ⁇ e, ⁇ > at which the backup replica 308 called the operating system and the application-relative time at which the primary replica 304 called the operating system If the message is received, at step 512 the replica supervisor 312 on the backup processor 306 returns to the backup replica 308 the state-
  • Fig 6 is a flowchart showing steps taken by the replica supervisors 312 on the primary and backup processors 302 and 306 for processing asynchronous events
  • the operating system 106 attempts to deliver an asynchronous event at application-relative time ⁇ e, ⁇ > to the primary replica 304.
  • the replica supervisor 312 intercepts the asynchronous event and at step 606 the replica supervisor buffers the event and the state-transforming values provided by the operating system. If necessary, the replica supervisor 312 virtualizes operating system data.
  • the replica supervisor 312 sends a message over the replica coordination communication means 314 to the replica supervisor on the backup processor 306.
  • the message contains the application-relative time of the asynchronous event and the state-transforming values provided by the operating system.
  • the replica supervisor 312 on the backup processor 306 receives the message, buffers it and acknowledges receipt of it.
  • the messages for multiple asynchronous events can be sent together as a single message.
  • the operating system 106 on the backup processor 306 attempts to deliver an asynchronous event to the backup replica 308.
  • the replica supervisor 312 on the backup processor 306 intercepts the asynchronous event and at step 616 the replica supervisor ignores the asynchronous event.
  • Fig. 7 is a flowchart showing steps taken by the replica supervisors 312 on the primary and backup processors 302 and 306 at the end of each epoch.
  • the primary replica 304 reaches the end of an epoch at application-relative time ⁇ e,i>.
  • the replica supervisor 312 sends an end-of-epoch message over the replica coordination communication means 314 to the replica supervisor 312 on the backup processor 306.
  • the replica supervisor 312 on the backup processor 306 receives the message, buffers it, and acknowledges receipt of the message.
  • the replica supervisor 312 on the primary processor 302 delivers all asynchronous events buffered during the epoch.
  • the replica supervisor 312 causes a transformation in the state of the primary replica 304 that is equivalent to the transformation the operating system would have caused if the asynchronous event had not been intercepted.
  • the replica supervisor 312 increments e and sets i to zero to begin a new epoch.
  • the primary replica 304 begins the new epoch.
  • the backup replica 308 reaches the end of an epoch at time ⁇ e,i>.
  • the replica supervisor 312 waits for the end-of-epoch message that the replica supervisor 312 on the primary processor 302 sends at step 704.
  • the replica supervisor 312 on the backup processor 306 delivers all asynchronous events buffered during the epoch.
  • the replica supervisor increments e and sets i to zero and at step 722 the backup replica 308 begins a new epoch.
  • the replica supervisor 312 on the backup processor 306 fails to receive the end-of-epoch message sent in step 704 or receives notification that the primary processor 302 failed, at step 724 the replica supervisor 312 transfers to linkage point 802, at which the backup processor is promoted to the primary processor.
  • Fig. 8 is a flowchart showing steps taken by the backup processor 306 after a failure of the primary processor 302.
  • the replica supervisor 312 increments e and sets i to zero and at step 806 the backup replica 308 begins a new epoch.
  • the replica supervisor 312 on the backup processor 306 After a failure, it is impossible for the replica supervisor 312 on the backup processor 306 to ascertain exactly which requests that effect the environment, e.g., I/O requests to the common I/O components 310, had been successfully completed by the primary processor 302. As previously described, operating system calls that can effect the environment are added to a list of such calls at step 508 (Fig. 5). Such requests completed on the primary processor 302 cause the replica supervisor 312 on the primary processor 304 to send a message at step 420 (Fig. 4) to the replica supervisor 312 on the backup processor 306. However, the replica coordination communication means 314 is not necessarily reliable and the message might not be received. Consequently, the replica supervisor 312 on the backup processor 304 might not remove the corresponding call from its list at step 426.
  • each outstanding call on the list falls into one of three categories. Calls in the first category are "idempotent", i.e., each call contains all information necessary for performing the operation and they can be repeated with greatity. For example, in a disk I/O request the address of the data on the disk, the address of the data in memory, the size and the transfer direction (read/write) are all included in the call. Performing such an I/O operation a second time does not introduce any errors. After a fail-over, for these types of calls, the replica supervisor 312 on the backup processor 304 simply repeats the operation at step 808 (Fig. 8).
  • the second category of outstanding calls is non-idempotent, but the designer of the fault-tolerant system also controls the design of the common I/O component.
  • sequence numbers or another mechanism can be used by the replica supervisor to enable the I/O devise to ignore duplicate requests.
  • the replica supervisor 312 on the backup processor 308 simply retries the operation at step 808.
  • the replica supervisor 312 on the backup processor 306 returns an unsuccessful I/O operation status code to the replica 308.
  • the replica 308 should already be written to handle such return codes, i.e., the replica handles the event as a transient I/O device error.
  • the fault-tolerant computer system 300 can optionally restart the failed processor and synchronize the application state of its replica with that of the surviving replica.
  • the replica supervisor on the surviving processor sends messages to the replica supervisor on the restarted processor, the messages containing the application state of the surviving replica, possibly including the contents of all or part of the replica's address space, registers, etc.
  • the replica supervisor on the restarted processor sets the application state, including the contents of the address space, registers, etc., of the restarted replica.
  • Other embodiments and modifications are possible.
  • a source-code editor can be used instead of an object-code editor.
  • the object code can be edited at load time by a loader.
  • load- time editing provides fault-tolerance to applications that dynamically load executable modules.
  • an application might dynamically link in a DLL or might use object linking and embedding (OLE) to cause the processor to execute another module.
  • Load- time editing provides fault tolerance for such dynamically linked modules.
  • Replica management can also be used with non-fault-tolerant applications. For example, replicas of a database can be stored in multiple locations on a network. A replica supervisor can then intercept I/O requests from client or middle-tier, in a three-tiered architecture, applications to the database and direct the requests to the nearest or fastest-responding replica of the database.

Abstract

In a fault-tolerant computer system, a primary replica supervisor is interposed between an operating system and a primary replica of an application program being executed by a primary processor. An object-code editor locates calls to the operating system and loops in the application program and inserts instruction sequences that enable the replica supervisor to intercept the calls to the operating system, results returned by the operating system as a result of the calls and asynchronous events delivered by the operating system to the replica. A backup replica supervisor is similarly interposed between an operating system and a backup replica of the application program being executed by a backup processor. The primary replica interacts with an environment. The replica supervisors ensure that the backup replica undergoes state transformations, as a result of the calls to the operating system and asynchronous events, that are equivalent to state transformations that the primary replica undergoes as a result of corresponding calls and asynchronous events. Thus, after a failure in the primary processor, the backup replica can interact with the environment in a manner consistent with interactions between the primary replica and the environment prior to the failure.

Description

TRANSPARENT FAULT TOLERANT COMPUTER SYSTEM
BACKGROUND OF THE INVENTION
1. Field of the Invention The invention relates to fault-tolerant computer systems and, in particular, to replica coordinators for such fault-tolerant systems.
2. Description of the Related Art
A need exists for "fault-tolerant" computer systems that can continue to operate despite a failure of one of their components. Examples include inter- bank electronic funds-transfer systems and airline flight reservation systems. To achieve fault-tolerance, these systems typically employ redundancy in the components that are likely to fail, i.e., they employ "replicas" and replace a failed component with a non-faulty replica ("spare"). These systems coordinate the replicas so all replicas are in the same state and, therefore, the spare is able to take over for the failed component. Coordinating these replicas poses a key problem in the design of a fault-tolerant system.
One approach to replica coordination, known as the "state machine approach," calls for each replica to be a deterministic state machine that reads a sequence of commands, each command causing a state transition that is completely determined by the command and the current state of the machine. The state transitions can produce outputs to an environment, e.g. I/O requests. Each replica of the state machine starts in the same state and reads an identical sequence of commands. Each replica, therefore, undergoes an identical sequence of state transitions and produces an identical sequence of outputs. This approach ensures that, when a failure occurs, the spare is in the same state as the failed component at the time of, or just prior to, the failure. The spare can therefore interact with the environment in a manner consistent with past interactions between the now-failed component and the environment. Fault-tolerant systems mask replica failures by combining the output sequences from multiple replicas into a single output sequence that appears to have come from a single, non-faulty state machine. According to this approach, each replica produces a sequence of outputs, but a replica coordination mechanism allows only one replica's sequence of outputs to reach the environment. For example, accepting outputs from 2t+l replicas, a "majority voter" can selectively provide the outputs to the environment and thereby mask as many as t faulty replicas. In a "primary/backup" method of replica coordination, all non-faulty replicas perform the same computations, but only one replica (the primary) interacts with the environment. In either case the fault-tolerant system must ensure that each replica reads the same sequence of commands and the environment receives only a single output. A computer system is composed of layers, as shown in Fig. 1. An Application program makes calls to application support routines (DLLs, RTLs, etc.), which in turn call operating system software. Optionally, the application software can bypass the support routines and call the operating system directly. Lower layers include computer hardware (CPU, memory, bus, network, etc.), and I/O components (disks, user terminals, etc.) Prior-art systems provide replica coordination by adding a layer or significantly modifying one or more of the existing layers. Each such approach poses problems, however. For example, in hardware-layer replica coordination, the CPU, memory, operating system and application are replicated and the hardware chooses which replica interacts with the environment. This requires no changes to the operating system or application software, but, quite problematically, each new hardware realization requires a separate design and consequentially these systems lag behind the hardware cost/performance curve.
Adding replica coordination to an existing operating system is difficult because a developer must identify state transitions implemented in the operating system. This is difficult because mature operating systems are very complex. Furthermore, modifying an operating system and then maintaining the modified operating system is very costly. Some prior-art systems provide an additional application support layer between the application and the operating system. Problematically, this requires application developers to learn and use a new interface. Furthermore existing applications must be rewritten to be made fault-tolerant. Fault-tolerance can also be built into an application. However, this shifts the problem of replica coordination to the application's developers and the same problems must be solved anew for each application. Furthermore, application developers generally are not acquainted with the complexity and nuances of replica coordination. Some prior-art systems, such as a so-called "hypervisor" system, impose a heavy performance penalty due to a high frequency with which the replica coordinator must be invoked.
SUMMARY OF THE INVENTION The invention provides "transparent fault-tolerance" by interposing a new layer (a "replica supervisor") between the application/application support layers (hereinafter the "application") and the operating system layer without altering the operating system interface, as viewed by the application. Transparent fault-tolerance means that application developers need not be aware of the replica supervisor when writing an application and an existing application can be made fault-tolerant without a developer rewriting the application. An additional processor and operating system executes and provides services to, respectively, each replica of the application. Advantageously, interposing the replica supervisor above the operating system enables the system to tolerate hardware faults and faults in the operating system that occur independently on the different processors, such as faults induced by other, possibly unrelated, applications. The operating system can replaced, e.g. with a newer version, on one processor at a time. If more than two processors are provided, continuous fault-tolerance is provided. The replica supervisors intercept calls to the operating system made by the replicas and asynchronous events delivered by the operating system to the replicas. The replica supervisors communicate with each other to coordinate the multiple replicas and ensure that only one replica, a "primary replica" being executed on a "primary processor," interacts with the environment. The replica supervisors also ensure that a "backup replica" receives the same inputs, and thus undergoes the same state transitions, as the primary replica. The replica supervisor being executed by the primary processor is known as a "primary supervisor," while the other replica supervisor is a "backup supervisor." For simplicity, we describe a system comprising one backup processor, but the invention can easily be extended to multiple backup processors.
We define "application state" to include attributes of a replica that are saved and restored during a process context switch. These attributes include, but are not limited to, the contents of registers and memory and operating system-maintained quantities, e.g. a scheduling wait-state resulting from an I/O request. Events that can change (produce a "transformation" of) a replica's application state include the replica executing instructions, calling the operating system, and receiving notification of an asynchronous event from the operating system. These asynchronous events include exceptions resulting from executing an instruction, e.g. dividing by zero, and completion of operations previously requested by the replica, e.g. disk I/O completion. Corresponding events in different replicas can produce different transformations of the respective replica's application state. For example, calling the operating system to create a scratch file can result in different file names being returned to the respective replicas. The replica supervisors intercept events that can cause different transformations of the respective replica's application state. The replica supervisors provide interfaces to the replicas that are the same as the interface provided by the operating system. Thus, when one of the rep*licas makes a call to the operating system, the corresponding replica supervisor is invoked and the supervisor ensures that the effect of the intercepted call is the same regardless of whether the primary or backup performs the operation. When the primary supervisor intercepts a call by the primary replica, the primary supervisor makes the call to the operating system on behalf of the replica and then delivers the results of the operating system call (the "values returned by the operating system") to the primary replica. In other words, the primary supervisor causes a transformation in the state of the primary replica that is equivalent to the transformation the operating system would have caused if the call had not been intercepted.
The primary supervisor also sends a message to the backup replica. The message contains the values returned by the operating system. When the backup supervisor intercepts a call to the operating system by the backup replica, the backup supervisor does not call the operating system on behalf of the backup replica. Instead, the backup supervisor uses the values sent by the primary supervisor, as a result of the corresponding call by the primary replica, to transform the state of the backup replica. Thus, the replica supervisors ensure that the primary and backup replicas undergo equivalent transformations of their application state as a result of corresponding calls to the operating system.
Preferably, the invention uses a technique, commonly known as "object- code editing," to enable the replica supervisors to intercept calls to the operating system. An object-code editor examines an application's executable file, locates calls to the operating system and then replaces those calls with one or more instructions (an "instruction sequence") that, when executed by a processor, cause the processor to transfer control to the replica supervisor. Thus, the replica supervisor receives control when the replica calls the operating system. Advantageously, the invention can replace only calls to the operating system that can effect the environment or that can cause different transformations of the application state in different replicas. This selective replacement provides improved performance over the prior art because the replica supervisor is invoked only when necessary. Furthermore, unlike prior- art systems, a processor can execute a mixture of replicated and non- replicated applications, so a user can choose which applications to replicate and avoid overhead associated with replicating non-critical applications.
Alternatively, the invention replaces or modifies an existing component that provides an interface between applications and the operating system. For example, applications call the Windows NT operating system through an interface provided by a dynamic-linked library (DLL), specifically Win32. The DLL, in turn, transfers control to the operating system. In such a case, the invention replaces the DLL with one that transfers control to the replica supervisor instead. Other operating systems, such as the VMS operating system, have a well-known, invariant location (a "system service vector") for each possible call to the operating system. Applications call these locations, which contain instructions that transfer to an appropriate routine within the operating system. In such a case, the invention modifies these locations to transfer to the replica supervisor instead.
The replica supervisors "virtualize" certain operating system data, such as a processor's network address, that might otherwise be returned differently to different replicas. The replica supervisors return virtualized data to the replicas and to the environment. Once returned by the operating system on the primary processor, some data, such as time-of-day, can be used to satisfy the corresponding call by the backup replica despite the fact that clocks on different processors are never perfectly synchronized. After a failure of the primary processor and a take-over by the backup processor (a "fail-over"), the clock on the backup ("newly-promoted primary") processor might require adjustment to make it consistent with values that were returned to the newly- promoted primary replica before it was promoted. Alternatively, the newly- promoted primary supervisor can increase or decrease, as appropriate, all times passed between the newly-promoted primary replica and the newly- promoted primary processor to account for the difference between the two clocks.
Other operating system data, such as a processor's network address, cannot be changed, so the replica supervisors fabricate a network address and supply it to both the primary and backup replicas. The replica supervisors perform any necessary translations between the fabricated and actual network addresses when they handle calls to the operating system and values returned to the replicas. After a fail-over, the newly-promoted primary replica continues to use the same fabricated network address. The replica supervisors thus create a "virtual process environment" that is independent of the primary processor and that remains invariant, even after a fail-over.
The replica supervisors also ensure that, as a result of asynchronous events, the same changes in application state occur at the same relative points in the execution of each replica. As with calls to the operating system, the replica supervisors also virtualize operating system data that might otherwise be delivered differently to different replicas. The primary supervisor intercepts asynchronous events, and their attendant state-transforming values, that the operating system attempts to deliver to the primary replica. The primary supervisor subsequently delivers the results of the asynchronous event (the "values provided by the operating system") to the primary replica. In other words, the primary supervisor causes a transformation in the state of the primary replica that is equivalent to the transformation the operating system would have caused if the asynchronous event had not been intercepted. The primary supervisor also sends a message to the backup replica. The message contains the values provided by the operating system.
The backup supervisor intercepts and discards asynchronous events the operating system attempts to deliver to the backup replica. Instead, the backup supervisor uses the values sent by the primary supervisor, as a result of the corresponding asynchronous event delivered to the primary replica, to transform the state of the backup replica.
We define an "application-relative time" that advances as the application executes. The replica supervisors ensure that each asynchronous event is delivered to the replicas at the same application-relative time, i.e. at the same point in the dynamic instruction stream of each replica. Preferably, the invention uses object-code editing to locate loops in the application and to insert into each loop an instruction sequence that counts a number of times the processor executes each loop ("loop cycles") and transfers control to the replica supervisor every time the processor executes the loop a controllable number of times. (Locating instructions and inserting instructions in an application is commonly known as "instrumenting" the application.) Thus, the replica supervisor receives control periodically throughout the execution of the replica. The supervisor counts the number of times each loop executes and this count forms at least part of the application-relative time.
The primary supervisor notes the application-relative time at which it delivers each asynchronous event to the primary replica and it sends this time to the backup replica along with the state-transforming values associated with the asynchronous event. The backup supervisor waits until the backup replica reaches the same application-relative time before delivering the asynchronous event to the backup replica. This ensures that the backup replica has executed the same instructions as the primary replica before delivery of the asynchronous event. Of course, if the backup replica has already reached the same application-relative time, no wait is necessary.
The description thus far has been limited to "single-threaded" applications. A "thread" is an execution of a sequence of instructions by a processor. Each thread is scheduled separately by the operating system. A "process" includes an address space, an identity by which the process is known by other processes, and other operating system-specific quantities, such as privileges, e.g. exclusive access to an I/O device. One process can compromise multiple threads. Since the threads of a process share one address space, they must use a synchronization technique, e.g. a critical section, mutual exclusion semaphore or test-and-branch instruction, to coordinate access to the shared address space. These techniques enable at most one thread at a time to acquire and then release access to a shared resource, e.g. a location in the shared address space. The invention can optionally be extended to provide fault-tolerance to threaded applications. The replica supervisors intercept acquisitions and releases of the shared resources by the threads and ensure that the sequence of acquisitions and releases is the same on the primary and backup replicas. The replica supervisors maintain an application-relative time for each thread, i.e. each replica supervisor maintains the count of loop cycles in thread- specific storage. Similar to the manner in which the replica supervisors handle calls to the operating system, the primary supervisor intercepts acquisitions and releases and sends to the backup supervisor an identity of the thread, an identity of the shared resource and the thread-specific application-relative time of the acquisition or release. The backup supervisor subsequently allows the threads on the backup replica to acquire and release the corresponding shared resources on the backup replica in the same order as they were acquired and released on the primary replica.
Preferably, the invention uses object-code editing to locate acquisitions and releases of the shared resources. Optionally, as with calls to the operating system, the invention can intercept only acquisitions and releases that can effect the environment or that can cause different transformations of the application state in different replicas. Furthermore, if the operating system provides acquisition and release functionality through calls to the operating system, fault-tolerant support for threads can be provided by the same mechanism that provides fault-tolerant support for calls to the operating system. BRIEF DESCRIPTION OF THE DRAWINGS
The above and further advantages of the invention may be better understood by referring to the following description in conjunction with the accompanying drawings, in which:
Fig. 1 is a block diagram of a computer system showing it's layers; Fig. 2 is a more detailed block diagram of a portion of the computer system of Fig. 1 ; Fig. 3 is a block diagram of a fault-tolerant computer system, having a primary and a backup processor, according to the present invention; Fig. 4 is a flow chart illustrating steps taken by the primary processor of
Fig. 3 for processing calls to an operating system; Fig. 5 is a flow chart illustrating steps taken by the backup processor of
Fig. 3 for processing calls to the operating system; Fig. 6 is a flow chart showing steps taken by the fault-tolerant system of Fig. 3 for processing asynchronous events;
Fig. 7 is a flow chart showing steps taken by the fault-tolerant system of Fig. 3 at the end of an epoch (a unit of application-relative time); and Fig. 8 is a flow chart showing steps taken by the fault-tolerant system of Fig. 3 after a failure of the primary processor.
DESCRIPTION OF THE PREFERRED EMBODIMENT
Fig. 1 shows a computer system 100 and its constituent layers, namely an application program 102, application support routines 104, operating system software 106, computer hardware 108, and I/O components 110. The invention interposes a replica supervisor (not shown) between the operating system software 106 and the application program 102/application support routines 104. As illustrated in Fig. 2, an application 202 comprises an application program 102 and application support routines 104. The application program 102 makes operating system calls 204, either directly or through the application support routines 104, to the operating system 106. As a result of the operating system calls 204, the operating system 106 interacts with the environment, including I/O devices, and returns results and asynchronous system events 206 to the application 202.
Fig. 3 is a block diagram of a fault-tolerant computer system 300 according to the invention in which a primary processor 302 executes a primary replica 304 and a backup processor 306 executes a backup replica 308. A network and/or processor interconnect bus 309 interconnects the primary and backup processors 302 and 306 with common I/O components 310. A replica supervisor 312 is interposed between the primary replica 304 and the operating system 106 on the primary processor 302. Similarly, a replica supervisor 312 is interposed between the backup replica 308 and the operating system 106 on the backup processor 306. The replica supervisors 312 intercept calls to the operating system made by the primary and backup replicas 304 and 308 and asynchronous events delivered by the operating system 106 to the replicas. The replica supervisors 312 communicate with each other over a replica coordination communication means 314 to coordinate the primary and backup replicas 304 and 308. The replica coordination communication 314 can be carried over the network and/or processor interconnect bus 309, over a separate bus or network, or over any other message passing mechanism. The 5 replica coordination communication 314 between the replica supervisors 312 provides first-in-first-out (FIFO) message delivery, though not necessarily reliable message delivery. The invention defines a protocol between the replica supervisors 312. This protocol is manifested in messages passed between the replica supervisors 312 over the replica coordination o communication means 314.
Application-relative time advances for each replica 304 and 308 as each processor 302 and 306, respectively, executes the replica. Preferably, an object-code editor locates back-branch instructions in a file that stores the application program's executable module. Back-branch instructions include conditional and unconditional branch instructions that, when executed by a processor, cause the processor to transfer control to an instruction whose address is lower than the address of the branch instruction, i.e. back-branch instructions implement loops. A replica's application-relative time includes an offset "i" that is incremented each time the processor executes one cycle of one of the loops. Execution of each replica 304 and 308 is partitioned into epochs "e" of application-relative time, where each epoch comprises one or more increments of the replica's offset i. Corresponding epochs in the primary and backup replicas 304 and 308 comprise equivalent instruction sequences. Each replica's offset i is reset to zero at the beginning of each epoch. Thus, for each replica 304 and 308, its application-relative time comprises an epoch and an offset from the beginning of the epoch, hereinafter represented as <e,i>. Preferably, the object-code editor inserts, prior to each back-branch instruction, i.e. into each loop, an instruction sequence that counts a number of times the loop executes, i.e. the offset i. The instruction sequence transfers control to the replica supervisor 312 whenever the loop executes a maximum number of times. We define an epoch as the maximum number of times a loop can execute. Thus, the replica supervisor 312 receives control periodically throughout the execution of the replica, i.e. at the end of each epoch. The replica supervisor can control the frequency with which it receives control, i.e. the epoch length, by changing the maximum number of times the loop can execute. Advantageously, the epoch length can be increased to reduce the frequency with which the replica supervisor 312 is invoked and, therefore, reduce overhead imposed by the replica supervisor. Alternatively, decreasing the epoch length decreases response time of the replica supervisor 312 in delivering asynchronous events to the replicas 304 and 308.
Fig. 4 is a flowchart illustrating steps taken by the primary and backup processors 302 and 306 for processing calls made by the replicas 304 and 308 to the operating system 106. At step 402 the primary replica 304 calls the operating system at application-relative time <e,i>. At step 404 the replica supervisor 312 on the primary processor 302 intercepts the call to the operating system. At step 406 the replica supervisor 312 ascertains whether the call can affect the environment, e.g. a call to one of the common I/O components 310. If the call can affect the environment, at step 408 the replica supervisor 312 waits for acknowledgment of all messages previously sent to the replica supervisor 312 on the backup processor 306 over the replica coordination communication means 314. At step 410 the replica supervisor 312 adds this call to its list of outstanding calls to the operating system that can affect the environment. At step 412 the replica supervisor 312 calls the operating system 106 on behalf of the primary replica 304. If necessary, the replica supervisor 312 virtualizes operating system data. A delay might occur at step 414 while the operating system 106 completes an operation, e.g. an I/O operation. At step 416 the operating system returns values to the replica supervisor 312 as a result of the call to the operating system. At step 418 the replica supervisor 312 returns the values to the primary replica 304, i.e. the supervisor causes a transformation in the state of the primary replica that is equivalent to the transformation the operating system 106 would have caused if the call had not been intercepted. At step 420 the replica supervisor 312 sends a message over the replica coordination communication means 31 to the replica supervisor on the backup processor 306. The message contains the application-relative time, i.e. <e,i>, and the state-transforming values returned by the operating system 106 as a result of the call. At step 422 the replica supervisor 312 removes this call from the list of outstanding calls that can affect the environment. Optionally, the messages for multiple calls to the operating system can be sent together as a single message.
At step 424 the replica supervisor 312 on the backup processor 306 receives the message sent by the replica supervisor on the primary processor 302 at step 420. The replica supervisor 312 on the backup processor 306 buffers the message and acknowledges receipt of it. At step 426 the replica supervisor 312 on the backup processor 306 removes a corresponding call from its list of outstanding calls that can affect the environment. Fig 5 is a flowchart illustrating steps taken by the replica supervisor 312 on the backup processor 306 for processing calls to the operating system 106 by the backup replica 308 At step 502 the backup replica 308 calls the operating system at application-relative time <e,ι> . At step 504 the replica supervisor 312 on the backup processor 306 intercepts the call At step 506 the replica supervisor 312 ascertains whether the call can affect the environment If the call can affect the environment, at step 508 the replica supervisor 312 on the backup processor 306 adds the call to its list of outstanding calls that can affect the environment At step 510 the replica supervisor 312 on the backup processor 306 waits for the message that the replica supervisor 312 on the primary processor 302 sends at step 420 (Fig 4) In other words, the replica supervisor 312 on the backup processor 306 waits for a message that corresponds to the call to the operating system made by the backup replica 308 at step 502 The replica supervisor 312 identifies the message of the corresponding call to the operating system by matching the application-relative time <e,ι> at which the backup replica 308 called the operating system and the application-relative time at which the primary replica 304 called the operating system If the message is received, at step 512 the replica supervisor 312 on the backup processor 306 returns to the backup replica 308 the state-transforming values sent by the replica supervisor 312 on the primary processor 302 at step 420 (Fig 4) In other words, the replica supervisor 312 on the backup processor 306 transforms the state of the backup replica 308, the transformation being equivalent to the transformation of the primary replica 304 at step 418 (Fig 4) On the other hand, if the replica supervisor 312 on the backup processor 306 fails to receive the message or receives notification of a failure of the primary processor 302, at step 514 the replica supervisor 312 transfers to linkage point 802, at which the backup processor is promoted to the primary processor
Fig 6 is a flowchart showing steps taken by the replica supervisors 312 on the primary and backup processors 302 and 306 for processing asynchronous events At step 602 the operating system 106 attempts to deliver an asynchronous event at application-relative time <e,ι> to the primary replica 304. At step 604 the replica supervisor 312 intercepts the asynchronous event and at step 606 the replica supervisor buffers the event and the state-transforming values provided by the operating system. If necessary, the replica supervisor 312 virtualizes operating system data. At step 608 the replica supervisor 312 sends a message over the replica coordination communication means 314 to the replica supervisor on the backup processor 306. The message contains the application-relative time of the asynchronous event and the state-transforming values provided by the operating system. At step 610 the replica supervisor 312 on the backup processor 306 receives the message, buffers it and acknowledges receipt of it. Optionally, the messages for multiple asynchronous events can be sent together as a single message.
At step 612 the operating system 106 on the backup processor 306 attempts to deliver an asynchronous event to the backup replica 308. At step 614, the replica supervisor 312 on the backup processor 306 intercepts the asynchronous event and at step 616 the replica supervisor ignores the asynchronous event.
Fig. 7 is a flowchart showing steps taken by the replica supervisors 312 on the primary and backup processors 302 and 306 at the end of each epoch. At step 702 the primary replica 304 reaches the end of an epoch at application-relative time <e,i>. At step 704 the replica supervisor 312 sends an end-of-epoch message over the replica coordination communication means 314 to the replica supervisor 312 on the backup processor 306. At step 706 the replica supervisor 312 on the backup processor 306 receives the message, buffers it, and acknowledges receipt of the message. At step 708 the replica supervisor 312 on the primary processor 302 delivers all asynchronous events buffered during the epoch. In other words, the replica supervisor 312 causes a transformation in the state of the primary replica 304 that is equivalent to the transformation the operating system would have caused if the asynchronous event had not been intercepted. At step 710 the replica supervisor 312 increments e and sets i to zero to begin a new epoch. At step 712 the primary replica 304 begins the new epoch. At step 714 the backup replica 308 reaches the end of an epoch at time <e,i>. At step 716 the replica supervisor 312 waits for the end-of-epoch message that the replica supervisor 312 on the primary processor 302 sends at step 704. If the message is received, at step 718 the replica supervisor 312 on the backup processor 306 delivers all asynchronous events buffered during the epoch. At step 720 the replica supervisor increments e and sets i to zero and at step 722 the backup replica 308 begins a new epoch.
On the other hand, if the replica supervisor 312 on the backup processor 306 fails to receive the end-of-epoch message sent in step 704 or receives notification that the primary processor 302 failed, at step 724 the replica supervisor 312 transfers to linkage point 802, at which the backup processor is promoted to the primary processor.
Fig. 8 is a flowchart showing steps taken by the backup processor 306 after a failure of the primary processor 302. At step 804 the replica supervisor 312 increments e and sets i to zero and at step 806 the backup replica 308 begins a new epoch.
After a failure, it is impossible for the replica supervisor 312 on the backup processor 306 to ascertain exactly which requests that effect the environment, e.g., I/O requests to the common I/O components 310, had been successfully completed by the primary processor 302. As previously described, operating system calls that can effect the environment are added to a list of such calls at step 508 (Fig. 5). Such requests completed on the primary processor 302 cause the replica supervisor 312 on the primary processor 304 to send a message at step 420 (Fig. 4) to the replica supervisor 312 on the backup processor 306. However, the replica coordination communication means 314 is not necessarily reliable and the message might not be received. Consequently, the replica supervisor 312 on the backup processor 304 might not remove the corresponding call from its list at step 426. Thus, all calls removed from the list in step 426 have been completed by the primary processor 302, but calls remaining on the list might or might not have been completed by the primary processor. Each outstanding call on the list falls into one of three categories. Calls in the first category are "idempotent", i.e., each call contains all information necessary for performing the operation and they can be repeated with impunity. For example, in a disk I/O request the address of the data on the disk, the address of the data in memory, the size and the transfer direction (read/write) are all included in the call. Performing such an I/O operation a second time does not introduce any errors. After a fail-over, for these types of calls, the replica supervisor 312 on the backup processor 304 simply repeats the operation at step 808 (Fig. 8). The second category of outstanding calls is non-idempotent, but the designer of the fault-tolerant system also controls the design of the common I/O component. In this case sequence numbers or another mechanism can be used by the replica supervisor to enable the I/O devise to ignore duplicate requests. Here again, the replica supervisor 312 on the backup processor 308 simply retries the operation at step 808.
Otherwise, in the third category of outstanding calls, the replica supervisor 312 on the backup processor 306 returns an unsuccessful I/O operation status code to the replica 308. The replica 308 should already be written to handle such return codes, i.e., the replica handles the event as a transient I/O device error.
After a failure, the fault-tolerant computer system 300 can optionally restart the failed processor and synchronize the application state of its replica with that of the surviving replica. The replica supervisor on the surviving processor sends messages to the replica supervisor on the restarted processor, the messages containing the application state of the surviving replica, possibly including the contents of all or part of the replica's address space, registers, etc. The replica supervisor on the restarted processor then sets the application state, including the contents of the address space, registers, etc., of the restarted replica. Other embodiments and modifications are possible. For example, a source-code editor can be used instead of an object-code editor. Alternatively, the object code can be edited at load time by a loader. Advantageously, load- time editing provides fault-tolerance to applications that dynamically load executable modules. For example, as a result of user commands an application might dynamically link in a DLL or might use object linking and embedding (OLE) to cause the processor to execute another module. Load- time editing provides fault tolerance for such dynamically linked modules. Replica management can also be used with non-fault-tolerant applications. For example, replicas of a database can be stored in multiple locations on a network. A replica supervisor can then intercept I/O requests from client or middle-tier, in a three-tiered architecture, applications to the database and direct the requests to the nearest or fastest-responding replica of the database. In such a case, the application need not be written with the knowledge that the database is replicated, yet the application receives a performance benefit by accessing the nearest replica. Furthermore, all access requests from all concurrently executing applications are distributed among all the replicas of the database, so no single database is a bottleneck. It will therefore be seen that we have developed a transparent fault- tolerant computer system, which can be utilized with a variety of computer applications. The terms and expressions employed herein are used as terms of description and not of limitation, and there is no intention, in the use of such terms and expressions, of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible within the scope of the invention claimed.

Claims

ι 1. A fault-tolerant computer system for executing a primary replica of a
2 program (primary replica) and at least one backup replica of the
3 program (backup replica), the fault-tolerant computer system
4 comprising:
5 (A) a primary processor, the primary processor: e (I) executing the primary replica; and
7 (II) comprising an operating system and a primary replica β supervisor interposed between the operating system and
9 the primary replica; wherein: ιo (a) the primary replica has a process state and makes ii calls to the operating system;
12 (b) the operating system, in response to calls, returns
13 values for changing process state of a caller; and
14 (c) the primary replica supervisor: is (i) intercepts the calls to the operating system
16 made by the primary replica;
17 (ii) calls, in response to the intercepted calls, the iβ operating system on behalf of the primary
19 replica; and
20 (iii) changes the process state of the primary 2i replica in accordance with the values
22 returned by the operating system;
23 (B) at least one backup processor, each backup processor:
24 (I) executing a different one of the at least one backup
25 replica; and
26 (II) comprising an operating system and a backup replica
27 supervisor interposed between the operating system and
28 the backup replica, wherein:
29 (a) the backup replica has a process state and makes
30 calls to the operating system and (b) the backup replica supervisor intercepts the calls to the operating system made by the backup replica; and (C) replica coordination communication means between the primary and backup replica supervisors, wherein after the primary replica supervisor calls the operating system on behalf of the primary replica, the primary replica supervisor sends to each backup replica supervisor the values returned by the operating system and each backup replica supervisor then changes the process state of the respective backup replica in accordance with the 1 values sent by the primary replica supervisor, whereby: (I) the primary replica undergoes a series of changes of process state as a result of the calls to the operating system made by the primary replica (primary series of state-changes); 6 (II) each at least one backup replica undergoes a series of 7 changes of process state as a result of the calls to the 8 operating system made by the backup replica (backup 9 series of state-changes); and o (III) if the primary processor fails, the backup series of state- 1 changes is equivalent to the primary series of state- 2 changes as of a time prior to the failure.
ι 2. A fault-tolerant computer system for executing a primary replica of a
2 program (primary replica) on a primary processor and at least one
3 backup replica of the program (backup replica) on a respective backup
4 processor; the primary and backup processors each comprising an
5 operating system; each operating system, in response to calls, returning
6 values that change process state of a caller; the primary and backup
7 replicas each having a process state and making calls to the operating β system, the fault-tolerant computer system comprising: (A) a primary replica supervisor for: (I) being interposed between the operating system and the primary replica; (II) intercepting the calls to the operating system made by the primary replica; (III) calling, in response to the intercepted calls, the operating system on behalf of the primary replica; and (IV) changing the process state of the primary replica in accordance with the values returned by the operating system; (B) a backup replica supervisor for: (I) being interposed between the operating system and the at least one backup replica; and (II) intercepting the calls to the operating system made by the backup replica; and (C) replica coordination communication means for: (I) facilitating communication between the primary and backup replica supervisors, wherein when the primary replica supervisor calls the operating system on behalf of the primary replica, the primary replica supervisor sends to each backup replica supervisor the values returned by the operating system and each backup replica supervisor then changes the process state of the respective at least one backup replica in accordance with the values sent by the primary replica supervisor, whereby: (a) the primary replica undergoes a series of changes of process state as a result of the calls to the operating system made by the primary replica (primary series of state-changes); (b) each at least one backup replica undergoes a series of changes of process state as a result of the 0 calls to the operating system made by the backup 1 replica (backup series of state-changes); and 2 (c) if the primary processor fails, the backup series of 3 state-changes is equivalent to the primary series of state-changes as of a time prior to the failure.
ι 3. An object-code editor, comprising:
2 (A) input means for reading, from a first file, an executable module;
3 (B) first locating means for locating, in the read executable module,
4 calls to an operating system;
5 (C) first code-insertion means for inserting a first instruction e sequence at each located call to the operating system, the first
7 instruction sequence, when executed by a processor, causing the β processor to transfer control to a replica supervisor associated
9 with a fault-tolerant computer system; and o (D) output means for writing, to a second file, the executable module i as modified by the first code-insertion means; whereby when the 2 processor executes the executable module, the replica 3 supervisor receives control at least when the executable module 4 calls the operating system.
ι 4. The object-code editor defined in claim 3, further comprising:
2 (E) second locating means for locating, in the read executable
3 module, one or more back-branch instructions; and
4 (F) second code-insertion means for inserting a second instruction
5 sequence at each located back-branch instruction; the second e instruction sequence, when executed by the processor, causing
7 the processor to advance a counter and, if the counter reaches a β predetermined loop-count value, to transfer control to the replica
9 supervisor; and wherein 0 (D1 ) the output means writes the executable module as modified by the i second code-insertion means; whereby when the processor executes 2 the executable module, the replica supervisor receives control at least 3 after each time the processor executes one of the one or more back- 4 branch instructions a number of times equal to the predetermined loop- 5 count value.
ι 5. A fault-tolerant computer system for executing a primary replica of a
2 program (primary replica) and at least one backup replica of the
3 program (backup replica), the fault-tolerant computer system
4 comprising:
5 (A) a primary processor, the primary processor:
6 (I) executing the primary replica; and
7 (II) comprising an operating system and a primary replica β supervisor interposed between the operating system and
9 the primary replica; wherein: ιo (a) the primary replica has a process state; ii (b) the operating system delivers asynchronous events
12 that change process state of a process; and
13 (c) the primary replica supervisor:
14 (i) intercepts the asynchronous events is delivered by the operating system to the ie primary replica; and
17 (ii) changes the process state of the primary iβ replica in accordance with the intercepted
19 asynchronous events;
20 (B) at least one backup processor, each backup processor:
21 (I) executing a different one of the at least one backup
22 replica; and (II) comprising an operating system and a backup replica supervisor interposed between the operating system and the backup replica, wherein (a) the backup replica has a process state, and (b) the backup replica supervisor intercepts the 8 asynchronous events delivered by the operating 9 system to the backup replica, and (C) replica coordination communication means between the primary i and backup replica supervisors, wherein when the primary 2 replica supervisor intercepts one of the asynchronous events 3 delivered by the operating system to the primary replica, the primary replica supervisor sends to each backup replica 5 supervisor the intercepted asynchronous event and each backup 6 replica supervisor then changes the process state of the 7 respective backup replica in accordance with the intercepted 8 asynchronous event sent by the primary replica supervisor, 9 whereby 0 (I) the primary replica undergoes a series of changes of 1 process state as a result of the asynchronous events 2 delivered by the operating system to the primary replica 3 (primary series of state-changes), 4 (II) each at least one backup replica undergoes a series of 5 changes of process state as a result of the asynchronous 6 events delivered by the operating system to the primary 7 replica (backup seπes of state-changes), and 8 (III) if the primary processor fails, the backup series of state- 9 changes is equivalent to the primary series of state- o changes as of a time prior to the failure
ι 6 A fault-tolerant computer system for executing a primary replica of a
2 program (primary replica) and at least one backup replica of the
3 program (backup replica), each replica comprising shared resources 4 and at least two threads (threads), the fault-tolerant computer system
5 comprising:
6 (A) a primary processor, the primary processor:
7 (I) executing the threads of the primary replica; and β (II) comprising an operating system and a primary replica
9 supervisor interposed between the operating system and ιo the primary replica; wherein: ii (a) each thread of the primary replica performs
12 acquisitions and releases of exclusive access to
13 one or more shared resources of the primary replica
14 (primary acquisitions and releases) and is (b) the primary replica supervisor intercepts and
16 permits at least some of the primary acquisitions
17 and releases; iβ (B) at least one backup processor, each backup processor:
19 (I) executing the threads of a different one of the at least one
20 backup replica; and
21 (II) comprising an operating system and a backup replica
22 supervisor interposed between the operating system and
23 the backup replica, wherein:
24 (a) each thread of the backup replica performs
25 acquisitions and releases of exclusive access to
26 one or more shared resources of the backup replica
27 (backup acquisitions and releases) and
28 (b) the backup replica supervisor intercepts at least
29 some of the backup acquisitions and releases; and
30 (C) replica coordination communication means between the primary
31 and backup replica supervisors, wherein whenever the primary
32 replica supervisor intercepts and permits one of the at least some
33 of the primary acquisitions and releases, the primary replica
34 supervisor identifies to each backup replica supervisor the thread
35 of the primary replica and the one or more shared resources of the primary replica and each backup replica supervisor then permits the one of the backup acquisition and releases of exclusive access to the one or more shared resources of the backup replica that correspond to the identified one or more shared resources of the primary replica by the thread of the backup replica that corresponds to the identified thread of the primary replica, whereby: (I) each thread of the primary replica performs a series of acquisitions and releases (primary series of acquisitions and releases); (II) each thread of each of the at least one backup replica performs a series of acquisitions and releases (backup series of acquisitions and releases); and (III) if the primary processor fails, the backup series of acquisitions and releases is equivalent to the primary series of acquisitions and releases as of a time prior to the failure.
PCT/US1996/018584 1995-12-01 1996-11-19 Transparent fault tolerant computer system WO1997022930A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU11211/97A AU1121197A (en) 1995-12-01 1996-11-19 Transparent fault tolerant computer system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US08/565,145 US5802265A (en) 1995-12-01 1995-12-01 Transparent fault tolerant computer system
US08/565,145 1995-12-01

Publications (2)

Publication Number Publication Date
WO1997022930A1 true WO1997022930A1 (en) 1997-06-26
WO1997022930A9 WO1997022930A9 (en) 1997-10-16

Family

ID=24257383

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1996/018584 WO1997022930A1 (en) 1995-12-01 1996-11-19 Transparent fault tolerant computer system

Country Status (3)

Country Link
US (2) US5802265A (en)
AU (1) AU1121197A (en)
WO (1) WO1997022930A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1495414A2 (en) * 2002-03-25 2005-01-12 Eternal Systems, Inc. Transparent consistent active replication of multithreaded application programs

Families Citing this family (81)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5802265A (en) * 1995-12-01 1998-09-01 Stratus Computer, Inc. Transparent fault tolerant computer system
GB9603582D0 (en) 1996-02-20 1996-04-17 Hewlett Packard Co Method of accessing service resource items that are for use in a telecommunications system
KR100195064B1 (en) * 1996-06-20 1999-06-15 유기범 Data network matching device
US6219801B1 (en) * 1996-06-20 2001-04-17 Fujitsu Limited Work inheriting system
JP3883647B2 (en) * 1997-06-10 2007-02-21 インターナショナル・ビジネス・マシーンズ・コーポレーション Message processing method, message processing apparatus, and storage medium for storing program for controlling message processing
US6038689A (en) * 1997-08-21 2000-03-14 Digital Equipment Corporation Fault notification system and process using local area network
US5923831A (en) * 1997-09-05 1999-07-13 International Business Machines Corporation Method for coordinating membership with asymmetric safety in a distributed system
US6658486B2 (en) * 1998-02-25 2003-12-02 Hewlett-Packard Development Company, L.P. System and method for efficiently blocking event signals associated with an operating system
US6223304B1 (en) * 1998-06-18 2001-04-24 Telefonaktiebolaget Lm Ericsson (Publ) Synchronization of processors in a fault tolerant multi-processor system
US6412079B1 (en) * 1998-10-09 2002-06-25 Openwave Systems Inc. Server pool for clustered system
US6367031B1 (en) * 1998-12-17 2002-04-02 Honeywell International Inc. Critical control adaption of integrated modular architecture
US7376864B1 (en) * 1998-12-30 2008-05-20 Oracle International Corporation Method and system for diagnostic preservation of the state of a computer system
US6735741B1 (en) 1999-07-30 2004-05-11 International Business Machines Corporation Method system, and program for dynamic resource linking when copies are maintained at different storage locations
KR100334912B1 (en) * 2000-01-28 2002-05-04 오길록 Method for remotely creating replication objects in a network computer system
GB2359384B (en) 2000-02-16 2004-06-16 Data Connection Ltd Automatic reconnection of partner software processes in a fault-tolerant computer system
US6735717B1 (en) * 2000-04-13 2004-05-11 Gnp Computers, Inc. Distributed computing system clustering model providing soft real-time responsiveness and continuous availability
US6687851B1 (en) 2000-04-13 2004-02-03 Stratus Technologies Bermuda Ltd. Method and system for upgrading fault-tolerant systems
US6820213B1 (en) 2000-04-13 2004-11-16 Stratus Technologies Bermuda, Ltd. Fault-tolerant computer system with voter delay buffer
US6691225B1 (en) 2000-04-14 2004-02-10 Stratus Technologies Bermuda Ltd. Method and apparatus for deterministically booting a computer system having redundant components
US6854051B2 (en) * 2000-04-19 2005-02-08 Hewlett-Packard Development Company, L.P. Cycle count replication in a simultaneous and redundantly threaded processor
GB0013336D0 (en) * 2000-06-01 2000-07-26 Sgs Thomson Microelectronics Forming an executable program
US6691250B1 (en) 2000-06-29 2004-02-10 Cisco Technology, Inc. Fault handling process for enabling recovery, diagnosis, and self-testing of computer systems
US7007190B1 (en) * 2000-09-06 2006-02-28 Cisco Technology, Inc. Data replication for redundant network components
US6854072B1 (en) 2000-10-17 2005-02-08 Continuous Computing Corporation High availability file server for providing transparent access to all data before and after component failover
US7146432B2 (en) * 2001-01-17 2006-12-05 International Business Machines Corporation Methods, systems and computer program products for providing failure recovery of network secure communications in a cluster computing environment
US6941366B2 (en) * 2001-01-17 2005-09-06 International Business Machines Corporation Methods, systems and computer program products for transferring security processing between processors in a cluster computing environment
US7340530B2 (en) * 2001-01-17 2008-03-04 International Business Machines Corporation Methods, for providing data from network secure communications in a cluster computing environment
US6591358B2 (en) * 2001-01-26 2003-07-08 Syed Kamal H. Jaffrey Computer system with operating system functions distributed among plural microcontrollers for managing device resources and CPU
US6845467B1 (en) 2001-02-13 2005-01-18 Cisco Systems Canada Co. System and method of operation of dual redundant controllers
US6853617B2 (en) 2001-05-09 2005-02-08 Chiaro Networks, Ltd. System and method for TCP connection protection switching
US7028217B2 (en) * 2001-06-04 2006-04-11 Lucent Technologies Inc. System and method of general purpose data replication between mated processors
US6965558B1 (en) * 2001-08-23 2005-11-15 Cisco Technology, Inc. Method and system for protecting a network interface
US20030065970A1 (en) * 2001-09-28 2003-04-03 Kadam Akshay R. System and method for creating fault tolerant applications
US6910158B2 (en) * 2001-10-01 2005-06-21 International Business Machines Corporation Test tool and methods for facilitating testing of duplexed computer functions
US6954877B2 (en) * 2001-11-29 2005-10-11 Agami Systems, Inc. Fault tolerance using logical checkpointing in computing systems
US7114095B2 (en) * 2002-05-31 2006-09-26 Hewlett-Packard Development Company, Lp. Apparatus and methods for switching hardware operation configurations
US7206964B2 (en) * 2002-08-30 2007-04-17 Availigent, Inc. Consistent asynchronous checkpointing of multithreaded application programs based on semi-active or passive replication
US7305582B1 (en) * 2002-08-30 2007-12-04 Availigent, Inc. Consistent asynchronous checkpointing of multithreaded application programs based on active replication
US7028218B2 (en) * 2002-12-02 2006-04-11 Emc Corporation Redundant multi-processor and logical processor configuration for a file server
US20050120081A1 (en) * 2003-09-26 2005-06-02 Ikenn Amy L. Building control system having fault tolerant clients
US7114109B2 (en) * 2004-03-11 2006-09-26 International Business Machines Corporation Method and apparatus for customizing and monitoring multiple interfaces and implementing enhanced fault tolerance and isolation features
FR2873464B1 (en) * 2004-07-23 2006-09-29 Thales Sa ARCHITECTURE OF HIGH AVAILABILITY, HIGH PERFORMANCE COMPONENT SOFTWARE COMPONENTS
US7496787B2 (en) * 2004-12-27 2009-02-24 Stratus Technologies Bermuda Ltd. Systems and methods for checkpointing
US20060212840A1 (en) * 2005-03-16 2006-09-21 Danny Kumamoto Method and system for efficient use of secondary threads in a multiple execution path processor
US20060222125A1 (en) * 2005-03-31 2006-10-05 Edwards John W Jr Systems and methods for maintaining synchronicity during signal transmission
US20060222126A1 (en) * 2005-03-31 2006-10-05 Stratus Technologies Bermuda Ltd. Systems and methods for maintaining synchronicity during signal transmission
US8135806B2 (en) * 2005-05-06 2012-03-13 Broadcom Corporation Virtual system configuration
US20070011499A1 (en) * 2005-06-07 2007-01-11 Stratus Technologies Bermuda Ltd. Methods for ensuring safe component removal
US20070008890A1 (en) * 2005-07-11 2007-01-11 Tseitlin Eugene R Method and apparatus for non-stop multi-node system synchronization
US20070028144A1 (en) * 2005-07-29 2007-02-01 Stratus Technologies Bermuda Ltd. Systems and methods for checkpointing
US20070038891A1 (en) * 2005-08-12 2007-02-15 Stratus Technologies Bermuda Ltd. Hardware checkpointing system
US8584145B1 (en) * 2010-08-06 2013-11-12 Open Invention Network, Llc System and method for dynamic transparent consistent application-replication of multi-process multi-threaded applications
US8621275B1 (en) 2010-08-06 2013-12-31 Open Invention Network, Llc System and method for event-driven live migration of multi-process applications
US7493512B2 (en) * 2005-10-04 2009-02-17 First Data Corporation System and method for providing data services via a network
US20070076228A1 (en) * 2005-10-04 2007-04-05 Jacob Apelbaum System and method for providing data services via a network
US7651593B2 (en) 2005-12-19 2010-01-26 Commvault Systems, Inc. Systems and methods for performing data replication
US7991865B2 (en) 2006-05-23 2011-08-02 Cisco Technology, Inc. Method and system for detecting changes in a network using simple network management protocol polling
KR101020016B1 (en) * 2006-08-28 2011-03-09 인터내셔널 비지네스 머신즈 코포레이션 A method for improving transfer of event logs for replication of executing programs
US20080178050A1 (en) * 2007-01-23 2008-07-24 International Business Machines Corporation Data backup system and method for synchronizing a replication of permanent data and temporary data in the event of an operational error
WO2008146091A1 (en) * 2007-05-25 2008-12-04 Freescale Semiconductor, Inc. Data processing system, data processing method, and apparatus
DE102007033885A1 (en) * 2007-07-20 2009-01-22 Siemens Ag Method for the transparent replication of a software component of a software system
CN101446909B (en) * 2007-11-30 2011-12-28 国际商业机器公司 Method and system for managing task events
CN101471833B (en) * 2007-12-29 2012-01-25 联想(北京)有限公司 Method and apparatus for processing data
US9495382B2 (en) 2008-12-10 2016-11-15 Commvault Systems, Inc. Systems and methods for performing discrete data replication
US8204859B2 (en) 2008-12-10 2012-06-19 Commvault Systems, Inc. Systems and methods for managing replicated database data
US8924963B2 (en) * 2009-03-31 2014-12-30 Microsoft Corporation In-process intermediary to create virtual processes
US8238538B2 (en) 2009-05-28 2012-08-07 Comcast Cable Communications, Llc Stateful home phone service
US20110083046A1 (en) * 2009-10-07 2011-04-07 International Business Machines Corporation High availability operator groupings for stream processing applications
US8504515B2 (en) 2010-03-30 2013-08-06 Commvault Systems, Inc. Stubbing systems and methods in a data replication environment
US8943510B2 (en) * 2010-12-17 2015-01-27 Microsoft Corporation Mutual-exclusion algorithms resilient to transient memory faults
JP5935439B2 (en) * 2012-03-28 2016-06-15 日本電気株式会社 Backup method for fault-tolerant servers
US9251002B2 (en) 2013-01-15 2016-02-02 Stratus Technologies Bermuda Ltd. System and method for writing checkpointing data
EP3090344B1 (en) 2013-12-30 2018-07-18 Stratus Technologies Bermuda Ltd. Dynamic checkpointing systems and methods
ES2652262T3 (en) 2013-12-30 2018-02-01 Stratus Technologies Bermuda Ltd. Method of delaying checkpoints by inspecting network packets
US9588844B2 (en) 2013-12-30 2017-03-07 Stratus Technologies Bermuda Ltd. Checkpointing systems and methods using data forwarding
US9495260B2 (en) 2014-07-01 2016-11-15 Sas Institute Inc. Fault tolerant communications
CA2967748A1 (en) 2014-11-13 2016-05-19 Virtual Software Systems, Inc. System for cross-host, multi-thread session alignment
US9619148B2 (en) 2015-07-27 2017-04-11 Sas Institute Inc. Distributed data set storage and retrieval
US9946718B2 (en) 2015-07-27 2018-04-17 Sas Institute Inc. Distributed data set encryption and decryption
US11042318B2 (en) 2019-07-29 2021-06-22 Commvault Systems, Inc. Block-level data replication
US11809285B2 (en) 2022-02-09 2023-11-07 Commvault Systems, Inc. Protecting a management database of a data storage management system to meet a recovery point objective (RPO)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE1269827B (en) * 1965-09-09 1968-06-06 Siemens Ag Method and additional device for the synchronization of data processing systems working in parallel
EP0478290A2 (en) * 1990-09-26 1992-04-01 Honeywell Inc. Method of maintaining synchronization of a free-running secondary processor
DE4104114A1 (en) * 1991-02-11 1992-08-13 Siemens Ag Redundant data processing system e.g. for control applications - uses program counter and count comparator for synchronising stand-by system following interrupt

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4590554A (en) * 1982-11-23 1986-05-20 Parallel Computers Systems, Inc. Backup fault tolerant computer system
US4608688A (en) * 1983-12-27 1986-08-26 At&T Bell Laboratories Processing system tolerant of loss of access to secondary storage
CA2003338A1 (en) * 1987-11-09 1990-06-09 Richard W. Cutts, Jr. Synchronization of fault-tolerant computer system having multiple processors
US4965717A (en) * 1988-12-09 1990-10-23 Tandem Computers Incorporated Multiple processor system having shared memory with private-write capability
US5295258A (en) * 1989-12-22 1994-03-15 Tandem Computers Incorporated Fault-tolerant computer system with online recovery and reintegration of redundant components
EP0444376B1 (en) * 1990-02-27 1996-11-06 International Business Machines Corporation Mechanism for passing messages between several processors coupled through a shared intelligent memory
DE69028517D1 (en) * 1990-05-11 1996-10-17 Ibm Method and device for deriving the state of a mirrored unit when reinitializing a system
US5157663A (en) * 1990-09-24 1992-10-20 Novell, Inc. Fault tolerant computer system
US5193180A (en) * 1991-06-21 1993-03-09 Pure Software Inc. System for modifying relocatable object code files to monitor accesses to dynamically allocated memory
US5440727A (en) * 1991-12-18 1995-08-08 International Business Machines Corporation Asynchronous replica management in shared nothing architectures
US5363503A (en) * 1992-01-22 1994-11-08 Unisys Corporation Fault tolerant computer system with provision for handling external events
US5555404A (en) * 1992-03-17 1996-09-10 Telenor As Continuously available database server having multiple groups of nodes with minimum intersecting sets of database fragment replicas
US5423037A (en) * 1992-03-17 1995-06-06 Teleserve Transaction Technology As Continuously available database server having multiple groups of nodes, each group maintaining a database copy with fragments stored on multiple nodes
GB2268817B (en) * 1992-07-17 1996-05-01 Integrated Micro Products Ltd A fault-tolerant computer system
US6233702B1 (en) * 1992-12-17 2001-05-15 Compaq Computer Corporation Self-checked, lock step processor pairs
JPH0773059A (en) * 1993-03-02 1995-03-17 Tandem Comput Inc Fault-tolerant computer system
US5812748A (en) * 1993-06-23 1998-09-22 Vinca Corporation Method for improving recovery performance from hardware and software errors in a fault-tolerant computer system
US5812757A (en) * 1993-10-08 1998-09-22 Mitsubishi Denki Kabushiki Kaisha Processing board, a computer, and a fault recovery method for the computer
US5590277A (en) * 1994-06-22 1996-12-31 Lucent Technologies Inc. Progressive retry method and apparatus for software failure recovery in multi-process message-passing applications
US5630056A (en) * 1994-09-20 1997-05-13 Stratus Computer, Inc. Digital data processing methods and apparatus for fault detection and fault tolerance
US5621885A (en) * 1995-06-07 1997-04-15 Tandem Computers, Incorporated System and method for providing a fault tolerant computer program runtime support environment
US5802265A (en) * 1995-12-01 1998-09-01 Stratus Computer, Inc. Transparent fault tolerant computer system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE1269827B (en) * 1965-09-09 1968-06-06 Siemens Ag Method and additional device for the synchronization of data processing systems working in parallel
EP0478290A2 (en) * 1990-09-26 1992-04-01 Honeywell Inc. Method of maintaining synchronization of a free-running secondary processor
DE4104114A1 (en) * 1991-02-11 1992-08-13 Siemens Ag Redundant data processing system e.g. for control applications - uses program counter and count comparator for synchronising stand-by system following interrupt

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JAMES R. LARUS: "Efficient Program Tracing", COMPUTER, vol. 26, no. 5, May 1993 (1993-05-01), LONG BEACH US, pages 52 - 61, XP000365282 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1495414A2 (en) * 2002-03-25 2005-01-12 Eternal Systems, Inc. Transparent consistent active replication of multithreaded application programs
EP1495571A4 (en) * 2002-03-25 2007-12-19 Availigent Inc Transparent consistent semi-active and passive replication of multithreaded application programs
EP1495414A4 (en) * 2002-03-25 2007-12-19 Availigent Inc Transparent consistent active replication of multithreaded application programs

Also Published As

Publication number Publication date
US5968185A (en) 1999-10-19
US5802265A (en) 1998-09-01
AU1121197A (en) 1997-07-14

Similar Documents

Publication Publication Date Title
US5968185A (en) Transparent fault tolerant computer system
WO1997022930A9 (en) Transparent fault tolerant computer system
Bressoud TFT: A software system for application-transparent fault tolerance
US8020041B2 (en) Method and computer system for making a computer have high availability
US6625751B1 (en) Software fault tolerant computer system
US7523344B2 (en) Method and apparatus for facilitating process migration
US7818736B2 (en) Dynamic update mechanisms in operating systems
JP3573463B2 (en) Method and system for reconstructing the state of a computation
KR0137406B1 (en) Fault tolerant computer system
EP1495571B1 (en) Transparent consistent semi-active and passive replication of multithreaded application programs
US5359730A (en) Method of operating a data processing system having a dynamic software update facility
US20080307265A1 (en) Method for Managing a Software Process, Method and System for Redistribution or for Continuity of Operation in a Multi-Computer Architecture
GB2515501A (en) Replication for on-line hot-standby database
Speirs et al. Using passive replicates in delta-4 to provide dependable distributed computing
WO2002091178A2 (en) Method and apparatus for upgrading managed application state for a java based application
US6161135A (en) Method and apparatus for software features synchronization between software systems
CN111400086A (en) Method and system for realizing fault tolerance of virtual machine
Xie et al. X10-FT: Transparent fault tolerance for APGAS language and runtime
Muller et al. Lessons from FTM: an experiment in design and implementation of a low-cost fault tolerant system
Bondavalli et al. State restoration in a COTS-based N-modular architecture
Schneider et al. Towards fault-tolerant process control software
US20070038849A1 (en) Computing system and method
Fu et al. Research on rtos-integrated tmr for fault tolerant systems
Petrick Stateful task recovery in the UNIX operating system
CN114296875A (en) Data synchronization method, system and computer readable medium based on fault tolerant system

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AU BB BG BR CA CN CZ EE GE HU IL IS JP KP KR LK LR LT LV MD MG MK MN MX NO NZ PL RO SG SI SK TR UA UZ VN AM AZ BY KG KZ MD RU TJ TM

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): KE LS MW SD SZ UG AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
COP Corrected version of pamphlet

Free format text: PAGES 1/7-7/7, DRAWINGS, REPLACED BY NEW PAGES BEARING THE SAME NUMBER; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

NENP Non-entry into the national phase

Ref country code: JP

Ref document number: 97522796

Format of ref document f/p: F

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA