US20050246581A1 - Error handling system in a redundant processor - Google Patents

Error handling system in a redundant processor Download PDF

Info

Publication number
US20050246581A1
US20050246581A1 US11/045,401 US4540105A US2005246581A1 US 20050246581 A1 US20050246581 A1 US 20050246581A1 US 4540105 A US4540105 A US 4540105A US 2005246581 A1 US2005246581 A1 US 2005246581A1
Authority
US
United States
Prior art keywords
processor
disparity
pio
processor elements
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/045,401
Inventor
Robert Jardine
James Klecka
William Bruckert
James Smullen
David Garcia
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Development Co LP
Original Assignee
Hewlett Packard Development Co LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Development Co LP filed Critical Hewlett Packard Development Co LP
Priority to US11/045,401 priority Critical patent/US20050246581A1/en
Assigned to HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P. reassignment HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BRUCKERT, WILLIAM F., KLECKA, JAMES S., GARCIA, DAVID J., JARDINE, ROBERT L., SMULLEN, JAMES R.
Publication of US20050246581A1 publication Critical patent/US20050246581A1/en
Priority to CN 200610007181 priority patent/CN1811722A/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1658Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/165Error detection by comparing the output of redundant processing systems with continued operation after detection of the error
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1675Temporal synchronisation or re-synchronisation of redundant processing components
    • G06F11/1687Temporal synchronisation or re-synchronisation of redundant processing components at event level, e.g. by interrupt or result of polling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3404Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for parallel or distributed programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3495Performance evaluation by tracing or monitoring for systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/3636Software debugging by tracing the execution of the program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/366Software debugging using diagnostics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/1641Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/1641Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
    • G06F11/1645Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components and the comparison itself uses redundant hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1675Temporal synchronisation or re-synchronisation of redundant processing components
    • G06F11/1683Temporal synchronisation or re-synchronisation of redundant processing components at instruction level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/18Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits
    • G06F11/183Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components
    • G06F11/184Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components where the redundant components implement processing functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/18Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits
    • G06F11/183Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components
    • G06F11/184Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components where the redundant components implement processing functionality
    • G06F11/185Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components where the redundant components implement processing functionality and the voting is itself performed redundantly
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/88Monitoring involving counting

Definitions

  • System availability, scalability, and data integrity are fundamental characteristics of enterprise systems.
  • a continuous performance capability is imposed in financial, communication, and other fields that use enterprise systems for applications such as stock exchange transaction handling, credit and debit card systems, telephone networks, and the like.
  • Highly reliable systems are often implemented in applications with high financial or human costs, in circumstances of massive scaling, and in conditions that outages and data corruption cannot be tolerated.
  • Some systems combine multiple redundant processors running the same operations so that an error in a single processor can be detected and/or corrected.
  • Results attained for each of the processors can be mutually compared. If all results are the same, all processors are presumed, with high probability of correctness, to be functioning properly. However, if results differ analysis is performed to determine which processor is operating incorrectly. Results from the multiple processors can be “voted” with the “winning” result determined to be correct. For example, a system with three processor elements typically uses the result attained by two of the three processors.
  • an error handling method comprises detecting equivalent disparity among processor elements of the computing device operating and responding to the detected equivalent disparity by evaluating secondary considerations of processor fidelity.
  • FIG. 1 is a schematic block diagram that illustrates an embodiment of a control apparatus for usage in a redundant-processor computing device and having capability to resolve a mutual disparity or tie condition;
  • FIG. 2 is a schematic block diagram depicting an embodiment of a computing system with capability to resolve disparity and break ties among a plurality of processor elements using a probation vector;
  • FIG. 3 is a schematic block diagram illustrating an embodiment of a computing system configured in a redundant-processor arrangement that imposes a selected-duration short delay in the event of a disparity or tie condition;
  • FIG. 4 is a schematic block diagram showing an embodiment of a processor complex within which the illustrative error handling system may be implemented;
  • FIG. 5 is a schematic block diagram showing an embodiment of a computing system capable of detecting equivalent disparity among the processor elements and responding by evaluating secondary considerations of processor fidelity;
  • FIG. 6 is a flow chart depicting an embodiment of an error handling method in a redundant-processor computing device that has tie-breaking capability during programmed input/output (PIO) voting in a duplex configuration;
  • FIG. 7 is a flow chart illustrating an embodiment of an error handling method in a redundant-processor computing device during direct memory access (DMA) read voting in a duplex configuration.
  • DMA direct memory access
  • a processor may incorporate multiple, redundant, loosely-coupled processor elements for error detection.
  • a duplex arrangement using two processor elements is susceptible to a“voting tie” situation. Ties may be avoided by using an odd number of processors at the expense of fault detection capability if a single processor element is used and by adding costs for incorporation of additional processor elements.
  • the illustrative system and method may use other information to resolve conflicts and break ties. Accordingly, an effective processor may be configured using only two processor elements for voting or comparison.
  • FIG. 1 a schematic block diagram illustrates an embodiment of a control apparatus 100 for usage in a redundant-processor computing device 102 .
  • the control apparatus 100 is operative in a configuration with a plurality of processor elements 104 A and 104 B and can resolve a mutual disparity or “tie” condition among processor elements.
  • the control apparatus 100 can be used to break ties in the case of voting error, for example with an even number of processor elements, using other available information.
  • the control apparatus 100 includes a control element 106 that detects equivalent disparity among the processor elements 104 A, 104 B and responds by evaluating secondary considerations of processor fidelity.
  • the control element 106 determines whether evaluation of the secondary considerations is insufficient to resolve the disparity among the processor elements 104 A, 104 B and, if so, terminates computing device operations.
  • the computing device 102 can be a computer processor that uses multiple, redundant, loosely-synchronized processor elements 104 A, 104 B to detect and manage errors.
  • a configuration with an even number of processor elements 104 A, 104 B is susceptible to a voting “tie” condition in which actions or results from the processor elements differ.
  • a computing device 102 may have two processor elements 104 A, 104 B so that any disparity is equivalent and results in a tie condition.
  • an odd number of processing element for example three, can be used at added cost to avoid ties.
  • other information which may be called secondary considerations of fidelity may be available to resolve the disparity and break the tie.
  • the other information is heuristic data which is sufficiently predictive to be trusted for disparity resolution. If the tie cannot be broken by use of the other information, then the error is considered sufficiently serious that the processor is halted due to an inability to guarantee correctness of either of the unequal voted data items.
  • Some embodiments may include a control element 106 that evaluates the secondary conditions of processor fidelity while the processor elements 104 A, 104 B are executing before equivalent disparity is detected and sets a probation vector 108 according to the evaluation.
  • the probation vector 108 may be implemented in a voting unit 110 and used by the voting unit 110 to resolve disparities and break ties in predetermined conditions.
  • the probation vector 108 can have one bit of state per processor element 104 A, 104 B.
  • Control logic such as software, executing in each processor element 104 A, 104 B can set the bit in conditions that the logic has accumulated information for usage in breaking future ties, or very recent ties. The control logic can periodically reset the probation vector bits.
  • the voting unit 110 upon detecting a disparity or tie condition, may delay acting upon the condition or declaring a fatal error situation. Instead, the voting unit 110 can hold the compared values for a short duration time period before acting. Accordingly, the control element 106 can interject a delay between equivalent disparity detection and termination of computer device operation.
  • the delay enables control logic, for example software, to possibly detect other errors or gather information pertinent to resolving the disparity or breaking the tie.
  • the delay can also break potential race conditions. For example, if a self-detectable error occurs simultaneously or nearly the same time as a misvote, the delay enables further collection of information or analysis before the voter declares the misvote condition, enabling recognition of the error and resolution of the vote.
  • the voting unit 110 resolves the disparity or breaks the tie in favor of the processor element, either 104 A or 104 B, that is not on probation. While the condition remains an error condition, the error is made recoverable. If the control logic does not set the bit in the probation vector prior to or during the delay, the error is considered to be fatal to the computing device 102 , and operation is halted.
  • the computing device 102 can include a logical synchronization unit 112 that contains the voting unit 110 and input/output interfaces 114 A and 114 B.
  • the interfaces 114 A and 114 B may include a programmed input/output (PIO) interface and a direct memory access (DMA) interface.
  • PIO programmed input/output
  • DMA direct memory access
  • a possible equivalent disparity or tie condition may include a condition of a first processor element that performs a programmed input/output (PIO) action while a second processor element does not.
  • a second example of a equivalent disparity or tie condition may be a miscompare on voted data whereby data supplied by the two processor elements 104 A, 104 B does not match on a programmed input/output (PIO) action or a direct memory access (DMA) action.
  • PIO programmed input/output
  • DMA direct memory access
  • Other equivalent disparity or tie conditions include a first processor element performing a PIO read while a second processor element performs a PIO write, or first and second processors reading or writing different addresses.
  • a schematic block diagram depicts an embodiment of a computing system 200 with capability to resolve disparity among a plurality of processor elements 202 A, 202 B configured in a redundant-processor arrangement.
  • the computing system 200 further includes a probation vector 204 coupled to the processor elements 202 A, 202 B and has a signal allocated to each of the processor elements 202 A, 202 B.
  • a control element 206 is coupled to the processor elements 202 A, 202 B and evaluates processor fidelity, setting the probation vector 204 according to results of the evaluation.
  • the probation vector 204 is used to monitor secondary considerations of processor element fidelity before an error is detected, when more abundant information relating to processor element conditions and functionality are available. In contrast, a system that does not begin acquiring status information until an error is detected may have more limited functional capabilities, and a possible inability to perform actions diagnostic of processor element fidelity. Acquisition of status and operational information while the processor elements 202 A, 202 B are executing in due course simplifies operation because information is merely noted when available. Higher complexity algorithms that execute following error or disparity detection and require information to be evoked or stimulated, as well as monitored, can be avoided.
  • the computing system 200 also includes a voter 208 that is coupled to the plurality of processor elements 202 A, 202 B and compares actions taken by the processor elements 202 A, 202 B to determine disparity in processor actions.
  • a control element 210 responds to disparity among the processor elements 202 A, 202 B based on the probation vector 204 .
  • the control element 210 can interject a delay between disparity detection and computer device operation termination to allow monitoring of additional information that may be useful in resolving disparity and breaking ties.
  • the control element 210 can also determine whether evaluation of processor fidelity is insufficient to resolve the disparity among the processor elements 202 A, 202 B and, if so, terminate computing system operations.
  • a schematic block diagram illustrates an embodiment of a computing system 300 including a plurality of processor elements 302 A, 302 B configured in a redundant-processor arrangement, and control logic 304 that imposes a selected-duration short delay in the event of a disparity or tie condition.
  • the control logic 304 compares actions taken by the processor elements 302 A, 302 B and determines disparity in the actions, then waits the selected delay duration after equivalent disparity detection before initiating an action in response to the disparity condition.
  • the control logic 304 may respond after the delay according to evaluation of secondary considerations of processor fidelity.
  • the selected delay has duration sufficient to enable near-simultaneous arrival of information for usage in resolving the disparity condition.
  • the delay is imposed in case of processor disparity or tie, to enable simultaneous or near-simultaneous arrival of information that can be used in disparity resolution and/or tie-breaking.
  • the delay has suitable duration to enable logic to receive a high priority interrupt and perform a few computations and is sufficiently long to enable information to be acquired at the same time or very close to occurrence of an error.
  • Typical delay duration, using current processor operating speeds and technology, is on the order of tens or hundreds of microseconds, sufficient to handle the interrupt and execute hundreds or thousands of instructions. The delay may assist in avoiding or breaking race conditions.
  • the selected delay also has an upper limit. A lengthy examination of error state or diagnostic execution may not be acceptable.
  • the parallel input/output (PIO) operation or direct memory access (DMA) operation is suspended, possibly disrupting communications with other processors as well as communications on the interprocessor network due to backpressure. Such disruption is generally not desirable and is avoided by imposing an upper limit on the delay duration.
  • a schematic block diagram shows an embodiment of a processor complex 400 within which the illustrative error handling system may be implemented.
  • the processor complex 400 includes a plurality of logical processors 408 , each a computing engine capable of executing processes and implemented using one or more processor elements 402 , each in a different processor slice 410 , combined with one or more logical synchronization units (LSUs) 412 .
  • Each processor element 402 is a single microprocessor or microprocessor core capable of executing a single instruction stream.
  • a processor slice 410 typically comprises one or more processor elements 402 , each with a dedicated memory 414 or sharing a partitioned memory.
  • a processor complex 400 comprises one or more logical processors 408 .
  • a processor complex 400 comprises one or more processor slices 410 . Within a complex 400 , each slice 410 has the same number of processor elements.
  • a processor complex with one slice is called a simplex complex.
  • a two-slice processor complex is called a duplex, dual modular redundant, or DMR complex.
  • a three-slice processor complex is called a triplex, tri-modular redundant, or TMR complex.
  • a processor complex 400 includes both processor elements 402 and corresponding logical synchronization units 412 .
  • a computing system comprises one or more logical processors 408 .
  • the computing system also comprises one or more processor complexes 400 .
  • the processor complexes 400 are interconnected via a network, for example a System Area Network (SAN), a local area network (LAN), a wide area network (WAN), or the like, or simply a connection to a bus.
  • the network is used for connection to both other processors and to input/output (I/O) devices. Voting or output data comparison is performed for all data transfers between the logical processor and the network or the network I/O adapter.
  • a logical processor 408 one, two, three, or possibly more processor elements cooperate to perform logical processor operations.
  • Cooperative actions include coordinating or synchronizing mutually among the processor elements, exchanging data, replicating input data, and voting on operations and output data selection.
  • the various cooperative actions can be implemented within or supported by implementation of the logical synchronization units 412 .
  • a schematic block diagram illustrates an embodiment of a computing system 500 comprising a plurality of processor elements 502 A, 502 B configured in a redundant-processor arrangement, and a voter 504 coupled to the plurality of processor elements 502 A, 502 B that compares actions taken by the processor elements and determines disparity in the actions.
  • the computing system 500 further includes a control element 506 coupled to the processor elements 502 A, 502 B and the voter 504 that detects equivalent disparity among the processor elements 502 A, 502 B and responds by evaluating secondary considerations of processor fidelity.
  • the illustrative computing system 500 may comprise a single logical processor of multiple logical processors in a complex or system.
  • the computing system 500 may also include a logical synchronization unit 512 comprising the voter 504 and a network interface 520 , such as a SAN interface that can be configured to connected to the network 518 by one or more ports.
  • a network interface 520 such as a SAN interface that can be configured to connected to the network 518 by one or more ports.
  • the network connection may be made as shown in FIG. 5 via an X-fiber port and a Y-fiber port, although wire ports may otherwise be used.
  • the logical synchronization unit 512 maintains multiple bit sets that represent, at any point in time, the set of processor elements that are enabled to perform selected operations.
  • the voter 504 includes a plurality of multiple-bit configuration registers to indicate which processor elements are expected and enabled to participate in selected operations, for example including programmed input/output (PIO) operations with the processor elements 502 A, 502 B, and network interface 520 DMA operations including direct memory access (DMA) output voting and DMA input replication.
  • PIO programmed input/output
  • DMA direct memory access
  • the voter configuration bits represent which processor elements are meant to be “assigned” to a logical processor and are therefore eligible for performing various operations such as output voting operations.
  • Configured processor elements are defined, for any particular operation type, as the set of processor elements enabled by the configuration bits in the logical synchronization unit 512 to perform that operation type.
  • the logical synchronization unit 512 ignores operations of non-configured processor elements on an operation-type basis.
  • Configuration bits are set whenever processor elements “join” the configuration of a logical processor 508 . Configuration bits are reset or cleared when processor elements leave the configuration, for example by being voted out.
  • Participating processor elements are the set of configured processor elements, for a particular instance of an operation, that actually perform the operation at a particular time, within a timeout period.
  • the set of configured processor elements is related to a type of operation while the set of participating processor elements is related to an individual instance of an operation.
  • the configured processor elements have outbound data transfers voted.
  • Participating processor elements are the elements that actually issue a particular outbound data transfer.
  • processor elements For a particular voted operation, all configured processor elements are expected to participate. A processor element that does not participate times out on the operation.
  • voted operations can result in conditions including full agreement, timeout, simple miscompare, tie miscompare, and full miscompare.
  • full agreement all configured processor elements participate within a reasonable time and supply identical data for an identical operation so that all voted data matches.
  • a timeout condition one or more configured processor elements do not participate in time.
  • a simple miscompare a strict majority of configured processor elements supplies identical data, and a strict minority supplies different data.
  • a strict majority is greater than half, for example one in simplex, two in duplex, and two or three in triplex.
  • a tie miscompare occurs with an even number of processor elements, for example two, configured in which the data does not compare.
  • full miscompare all sets of voted data, for example all three in triplex, miscompare pair-wise so that no strict majority results.
  • the different types of conditions are mapped into one of three error conditions including no error, minor error, and major error.
  • Minor errors include such timeouts or disagreements among voted data in which available information is sufficient to enable resolution. Triplex configurations in which two processors are in agreement and duplex configurations in disagreement when other information is available to resolve the disagreement are examples of minor error conditions.
  • Major errors include such timeouts and disagreements among voted data that the condition cannot be resolved. Triplex configurations with three-way disagreement or timeouts in two of the three processors, and duplex configurations with disagreeing processors and no available information for resolution are examples of major error conditions.
  • Self-signaling errors are defined as errors detected by an explicit detection element or mechanism as distinguished from implicit detection techniques such as voting.
  • Various types of errors may be self-signaling. Examples of self-signaling errors include direct memory access (DMA) read timeouts, errors detected by parity or other error checking codes, loss-of-signal in optical signals, and loss of electrical continuity for electrical signals.
  • DMA direct memory access
  • voting detects errors, but in a duplex configuration voting alone does not distinguish which voted data is correct and which incorrect. Self-signaling errors designate which of the two data suppliers is incorrect.
  • the processor elements 502 A, 502 B each have a memory 514 or share a partitioned memory with other processor elements in the same slice.
  • the PIO operations are processor-initiated reads (loads) or writes (stores) to any part of the processor's address space other than “main memory”.
  • the address space may contain registers, pseudo-registers, and memory other than main memory.
  • PIO operations may be targeted to resources in either the voter 504 or to a network interface 520 . Operations targeted to the voter 504 may be either unvoted or voted (compared), depending on the address of the register being accessed. Operations to the network interface 520 may always be voted in some implementations.
  • the voter 504 captures the operation and read address and sets a timer. The voter 504 waits for all configured processor elements to perform the same operation. When all configured processor elements initiate the operation, the operation and address are compared, for example a bit-by-bit comparison of the entire operation and address. If one or more of the configured processor elements fails to initiate the operation within a configurable timeout period, a PIO timeout condition exists. In circumstances that the illustrative error handling and tie-breaking technique is not implemented or is disabled, operation can be described as follows. If all configured processor elements participate, then for the case of full agreement or simple miscompare, the operation proceeds, ignoring miscompared data, if any.
  • a simple miscompare is handled as a minor error. Otherwise (not full agreement or simple miscompare), the operation is aborted—not performed, and a “bus error” is returned to all requesting processor elements, and a major error is reported. If the operation is not aborted, then if the PIO read is targeted to the network interface 520 , the voter 504 forwards the operation and address to the network interface 520 and waits for a response. If the operation is targeted to the voter 504 , then the voter 504 accesses data directly. The response data, when available, is replicated by the voter 504 and sent as a response to all participating processor elements at approximately the same instant.
  • the voter 504 captures the operation, write address, and write data, then sets a timer and waits for all configured processor elements to perform the same operation.
  • the operation, write address, and write data are bit-by-bit compared. If one or more of the configured processor elements fails to initiate the operation within a configurable timeout period, a PIO timeout condition exists. In circumstances that the illustrative error handling and tie-breaking technique is not implemented or is disabled, the operation is as follows. If all configured processor elements participate, in the case of full agreement or simple miscompare, the operation proceeds, ignoring miscompared data. A simple miscompare is handled as a minor error.
  • the operation is aborted—not performed. No direct response is made to the processor element because no response or acknowledgement is normally made to write operations.
  • the operation is aborted, a major error is reported.
  • An operation that is not aborted is handled according to the target address of the write operation.
  • the voter 504 forwards the operation, address, and data to the network interface 520 .
  • the voter 504 performs the write operation directly. No response is made to the processor element.
  • the voter 504 suspends all future PIO write operations, which are also aborted, until the software detects the error and re-enables PIO voting.
  • the error is detected by handling an error interrupt. Note, however, that in the triplex case for a simple miscompare such as one processor element write and two processor elements time out, no requirement is made to abort all future programmed I/O write operations.
  • PIO operations are initiated by the processor elements, as contrasted to DMA operations which are initiated by the network interface 520 . Therefore, PIO timeouts are possible in two different circumstances. In a first circumstance, one or two processor elements, operating correctly, initiate a PIO operation, and one processor element, operating incorrectly or stopped, fails to perform the PIO operation. The error is detected when the timer expires. Without further information, the processor element operating incorrectly may be indeterminable, for example when two processor elements are configured and one times out.
  • one processor element operating incorrectly, initiates a PIO operation that should not occur, and the other processor element or processor elements, operating correctly, do not initiate a PIO operation.
  • the error is detected when the timer expires, although without further information the processor element operating incorrectly may be indeterminable, for example for two active processor elements, one of which times out. Accordingly, the processor elements that do not participate are not necessarily incorrect.
  • the strict majority is always trustworthy so that if one processor element times out and the remaining two processor elements perform a PIO operation, then the processor element that times out is in error and ignored, while the PIO operation proceeds and the data voted with a minor error indicated. Also in the triplex case, if two processor elements time out when a single processor element performs a PIO operation, then the processor element that performs the PIO operation is in error and the PIO is ignored. The lone processor element can be considered a “rogue processor element” and the attempted PIO operation is called a “rogue operation”.
  • Some embodiments of the computing system 500 further comprise a two-processor elements configuration 502 A, 502 B, the programmed input/output (PIO) interface 522 , and a direct memory access (DMA) interface 524 coupled to the voter 504 .
  • An action disparity that is detectable by the control element 506 is a miscompare on voted data with non-matching data supplied by two processor elements 502 A, 502 B either on a PIO action or a DMA action.
  • Direct memory access (DMA) reads are outbound operations initiated by the network interface 520 .
  • the voter 504 replicates and forwards the DMA read operation and address to all configured processor elements at approximately the same instant, subject to congestion delays in the different slices that may cause an operation to arrive at the processor elements at slightly different times.
  • the voter 504 then starts a timer and waits for the responses.
  • Response data flows from the processor elements 502 A, 502 B, through the voter 504 , to the network interface 520 .
  • Responses arriving from the configured processor elements are buffered in the voter 504 rather than being sent immediately to the network interface 520 .
  • the later responses, upon arrival, are compared with the earlier responses saved in the data buffers.
  • Any processor element that does not respond within the timeout period is declared to have timed out. Unlike PIO timeouts, no rogue situations can occur with DMA timeouts because the DMA operation is initiated through the network interface 520 . Accordingly, if one processor element times out, that processor element is necessarily erroneous in both duplex and triplex cases. The condition is considered a minor error, and the DMA operation proceeds using data supplied by the processor element or processor elements that do not time out. If two processor elements time out, or the only processor element in a simplex case, then a major error is indicated and the DMA operation is aborted.
  • the voter 504 generates an error notification interrupt to all configured processor elements in the case of any disagreement or timeout, whether data is successfully forwarded to the network interface 520 or otherwise. The interrupt indicates which processor elements did time out, if any, and all comparison results.
  • Direct memory access (DMA) writes are inbound operations with data flowing from the network interface 520 through the voter 504 to the processor elements 502 A, 502 B. DMA write operations are initiated by the network interface 520 .
  • the voter 504 replicates and forwards the operation, address, and data to all configured slices at approximately the same instant. No response is made to the network interface 520 .
  • PIO and DMA timeout values are both configurable by control logic, such as software, and may be different.
  • the timer is started when the first PIO operation arrives.
  • the timer is restarted when each operation arrives, giving a full timeout period for the later arrivals relative to the earlier ones, a behavior used in all implementations due to the possibility of a PIO operation being initiated by a “rogue processor element”.
  • PIO operations can never time out in the simplex case because the operation originates from the processor element.
  • DMA operations can time out in simplex, duplex, and triplex configurations because the operation is originated from the network interface 520 .
  • the timer is started when the DMA request is forwarded by the voter 504 to memories of all processor elements.
  • the timer may optionally be restarted when each response arrives, giving a full timeout period for the later responses.
  • a single timeout interval may be applied to all configured processor elements. Either option is possible since no rogue DMA operations can occur.
  • duplex tie handling generally is inapplicable to DMA timeouts, and to simplex or triplex configurations.
  • Two disparity or tie conditions include a PIO timeout in which one processor element performs a PIO operation and the other processor element does not, and a miscompare on voted data in which data supplied by the two processor elements does not match, either on a PIO operation or a DMA read operation.
  • processor fidelity may be considered sufficiently strong, even if circumstantial, to implicate one of the two processor elements. For example, a recent logged history of other detected recoverable errors may be indicative of a degradation of processor element reliability.
  • a newly-reintegrated processor element or a new slice may be a more likely source of error than an element that has long been installed without a history of error.
  • Such early life problems are frequently discovered within a short time, on the order of minutes, following installation.
  • a further example of reliability information is inherent in the multiple-dimensional configuration of logical and physical processors, for example as shown in FIG. 4 .
  • Processor slices with multiple processor elements connected physically but not logically include hardware that is shared within a processor slice. Hardware may be shared among logical processors. When shared hardware ceases functioning correctly, errors such as intermittent errors can occur. If errors are detected in one logical processor but not another, information about the errors can be communicated between processor elements in a processor slice so that all processor elements within the slice have sufficient information to break any ties that occur.
  • Some embodiments of the computing system 500 further comprise a probation vector 526 coupled to the voter 504 and coupled to the processor elements 502 A, 502 B.
  • the probation vector 526 holds a signal for each processor element 502 A, 502 B.
  • a processor control policy executable on the processor elements 502 A, 502 B evaluates the secondary conditions of processor fidelity during processor element execution.
  • the processor control policy sets bits in the probation vector according to the evaluation of secondary considerations before determining whether a major or minor error has occurred.
  • the probation vector 526 enables implementations to supply the other information to the voter 504 in such a way that duplex tie conditions can be simply resolved.
  • the probation vector 526 comprises one bit for each processor element 502 A, 502 B.
  • the logical synchronization unit 512 uses the probation vector 526 as a tie-breaker in some conditions.
  • PIO writes to the probation vector 526 are not voted.
  • the initial or reset value of the probation vector 526 is all zero.
  • Each processor element 502 A, 502 B can set or reset the probation bit only for the processor slice associated with the processor element, and not for any other slice.
  • the probation vector 526 is not used in simplex and triplex modes for which the bits may still be set by the control logic, but are ignored.
  • the probation vector 526 is ignored if both processor slices agree, indicating no error.
  • the probation vector 526 is also ignored in the case of self-signaling errors, which indicate the error source so that tie-breaking is superfluous. Self-signaling errors are thus classified as minor errors.
  • the errors occur when the voter 504 can be certain that one slice should be ignored, for example in an outbound DMA read response, when one slice does not respond and times out, or when the slice supplies data marked as “known bad”.
  • “Known bad” data relates to another example of self-signaling error and is data returned from a processor element's memory or cache that generates a detected error, such as a parity or other detected error, when accessed.
  • Control logic in each processor element sets the associated probation bit independently of the other processor elements; reaching an agreement among processor elements is not required. Accordingly, both probation bits may be asserted at any time.
  • the control logic resets the probation bits after some amount of time. The time duration is an implementation-defined parameter or policy.
  • the probation bits may be set for all processor elements on a processor slice due to an error that places behavior of the entire slice in doubt. Accordingly, the control logic can propagate the probation bits from one processor element, where an error has been detected, to other processor elements in the slice by an implementation-defined technique. Examples of an implementation-defined system include exchanging probation bits via a register in a slice application-specific integrated circuit (ASIC), and/or using inter-processor, intra-slice interrupts.
  • ASIC application-specific integrated circuit
  • the control element 506 interjects a delay between equivalent disparity detection and computing device operation termination.
  • a delay is imposed prior to declaring the situation an error condition.
  • operation is held in limbo. The operation is neither completed nor aborted.
  • the delay enables the tie to be broken by control logic setting the probation bit after the error but before the timeout elapses. The delay does not occur, and therefore does not add any latency, in full agreement cases, and in simplex and triplex cases.
  • a sufficiently small delay may be of the order of the range of tens or hundreds of microseconds.
  • the logical synchronization unit 512 when the delay period begins, sends an interrupt to all participating slices to inform the slices that a miscomparison has occurred, although an error has not been declared.
  • the interrupt is in addition to the interrupt that is generated at the end of the delay, when the voting error is final, either major or minor.
  • both probation bits are enabled or both probation bits are disabled, then a major error exists and the operation is aborted. If the bits are in opposite states, then the slice with the probation bit off or reset is obeyed, and the slice with the probation bit is on or set is ignored. If the probation bits are used to break the tie, then a minor error is declared.
  • the policy for setting probation bits and duration that the bit setting is maintained is implementation-specific.
  • the logical synchronization unit 512 reports all errors, both major and minor, to control logic, such as software via an interrupt and status register.
  • status register bits indicate that the tie-break mechanism has been invoked and designate which processor element or slice is obeyed.
  • a flow chart depicts an embodiment of an error handling method 600 in a redundant-processor computing device during programmed input/output (PIO) voting in a duplex configuration.
  • the method 600 comprises detecting equivalent disparity 612 , and 610 , 614 among processor elements of the computing device, and responding to the detected equivalent disparity by evaluating 624 , 634 secondary considerations of processor fidelity.
  • the method can further comprise determining whether evaluation 624 , 634 of the secondary considerations is insufficient to resolve the disparity among the processor elements. If so, computing device operations are terminated 626 and 628 , 636 and 638 . Delay can be inserted 620 and 622 , 630 and 632 between equivalent disparity detection and computer device operation termination 626 , 636 .
  • control logic receives 602 a programmed input/output (PIO) request from one processor element (PE), buffers 604 the request, and starts 606 a timer. If a second request is received 608 , the two requests are compared 610 . Otherwise, if the timer has not elapsed 612 , whether the second request is received is determined 608 . If the timer has elapsed 612 , analysis of secondary considerations of processor fidelity begins.
  • PIO programmed input/output
  • the secondary conditions of processor fidelity can be evaluated during processor element execution, and a probation vector can be set according to the evaluation prior to determination of major or minor error. If probation bits are equal 624 , the operation is aborted 626 since the tie or disparity condition cannot be resolved and a major error interrupt is sent to the processor elements. The method is terminated 628 .
  • the control logic follows direction 648 of the processor element that is not on probation, and the PIO operation specified by the non-probation processor element is performed.
  • a minor error interrupt is sent 650 to the processor elements, and the method completes 652 with a suitable minor error handling technique.
  • the minor error is addressed by marking the loser of the voting decision as no longer participating in the logical processor. Subsequent input/output operations or other accesses to the voter are ignored.
  • software processing in the loser is interrupted and software executing in remaining processor elements shuts down the offending processor element.
  • Control logic sends 630 a “tie break pending” interrupt to the processor elements, and waits 632 the configured time. If probation bits are equal 634 , the operation is aborted 636 since correct operation cannot be determined. A major error interrupt is sent to the processor elements. The method is terminated 638 .
  • control logic determines whether the processor element requesting the PIO operation is on probation 640 . If the processor element is on probation 640 , the control logic follows direction 642 of the processor element that is not on probation and ignores or aborts the PIO operation, sends 644 a minor error interrupt to the processor elements, handles the minor error, and terminates 646 the method. If the processor element requesting the PIO is not on probation 640 , the control logic follows direction 648 of the processor element that is not on probation, and the PIO operation specified by the non-probation processor element is performed. A minor error interrupt is sent 650 to the processor elements, and the method is complete 652 .
  • a flow chart depicts an embodiment of an error handling method 700 in a redundant-processor computing device during direct memory access (DMA) read voting in a duplex configuration.
  • the method 700 comprises detecting equivalent disparity 726 and 728 among processor element memories of the computing device, and responding to the detected equivalent disparity by evaluating 738 secondary considerations of processor fidelity.
  • DMA direct memory access
  • the method can further comprise determining whether evaluation 738 of the secondary considerations is insufficient to resolve the disparity among the processor elements. If so, computing device operations are terminated 740 and 742 . Delay can be inserted 736 between equivalent disparity detection and computer device operation termination 740 , 742 .
  • control logic receives 702 a direct memory access (DMA) read request from a network agent such as a system area network (SAN) agent, replicates and forwards 704 the request to both processor elements, and starts 706 a timer. If a first response is received 708 , the first response is buffered 716 . In some embodiments, the timer may be restarted as the first response is buffered 716 . If the timer has not timed out 710 , whether the first response is received is again determined 708 . If the timer has timed out 710 , the operation is aborted 712 and a major error interrupt is sent to both processor elements. The major error interrupt is indicative of a double timeout. The method is then terminated 714 .
  • DMA direct memory access
  • first response is received 708 and the first response is buffered 716 . If the second response has been received 718 , the first and second response data are compared 726 . Otherwise, if the timer has not timed out 720 , whether the second response has been received is again determined 718 . If the timer has timed out 720 , the operation is completed 722 using data from the first response and a minor error interrupt is sent to both processor elements. The single timeout condition is indicative of a “self-signaling error”. The method is then terminated 724 .
  • the second response is received 718 and data from the first and second responses is compared 726 , if the first and second response data are equal 728 the data match so that the operation is completed 730 with no error.
  • the method completes successfully 732 . Otherwise, the first and second response data are unequal 728 and a “tie break pending” interrupt 734 is sent to the processor elements. A delay is inserted 736 to wait for a configured time. Probation bits are read to determine whether the probation bits are equal 738 . If so, the operation is aborted 740 since the tie cannot be resolved using the probation bits and a major error interrupt is sent to the processor elements. The method terminates unsuccessfully 742 .
  • probation bits are not equal 738 and the operation is completed 744 using the response from the processor element that is not on probation.
  • a minor error interrupt is sent 746 to the processor elements, the minor error handled by marking the loser for removal or shutting down the offending processor element, and the method is terminated 748 .

Abstract

In a redundant-processor computing device, an error handling method comprises detecting equivalent disparity among processor elements of the computing device operating and responding to the detected equivalent disparity by evaluating secondary considerations of processor fidelity.

Description

    BACKGROUND
  • System availability, scalability, and data integrity are fundamental characteristics of enterprise systems. A continuous performance capability is imposed in financial, communication, and other fields that use enterprise systems for applications such as stock exchange transaction handling, credit and debit card systems, telephone networks, and the like. Highly reliable systems are often implemented in applications with high financial or human costs, in circumstances of massive scaling, and in conditions that outages and data corruption cannot be tolerated.
  • Some systems combine multiple redundant processors running the same operations so that an error in a single processor can be detected and/or corrected. Results attained for each of the processors can be mutually compared. If all results are the same, all processors are presumed, with high probability of correctness, to be functioning properly. However, if results differ analysis is performed to determine which processor is operating incorrectly. Results from the multiple processors can be “voted” with the “winning” result determined to be correct. For example, a system with three processor elements typically uses the result attained by two of the three processors.
  • A difficulty arises for duplex systems with two executing processors since the even number of processor elements can result in a “voting tie” situation that may lead to aborted operation and outage. Ties can be avoided by running an odd number of processors, although a single processor does not have the fault detection capability provided by voting. A three or more processor system adds product cost.
  • SUMMARY
  • In accordance with an embodiment of a redundant-processor computing device, an error handling method comprises detecting equivalent disparity among processor elements of the computing device operating and responding to the detected equivalent disparity by evaluating secondary considerations of processor fidelity.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Embodiments of the invention relating to both structure and method of operation may best be understood by referring to the following description and accompanying drawings:
  • FIG. 1 is a schematic block diagram that illustrates an embodiment of a control apparatus for usage in a redundant-processor computing device and having capability to resolve a mutual disparity or tie condition;
  • FIG. 2 is a schematic block diagram depicting an embodiment of a computing system with capability to resolve disparity and break ties among a plurality of processor elements using a probation vector;
  • FIG. 3 is a schematic block diagram illustrating an embodiment of a computing system configured in a redundant-processor arrangement that imposes a selected-duration short delay in the event of a disparity or tie condition;
  • FIG. 4 is a schematic block diagram showing an embodiment of a processor complex within which the illustrative error handling system may be implemented;
  • FIG. 5 is a schematic block diagram showing an embodiment of a computing system capable of detecting equivalent disparity among the processor elements and responding by evaluating secondary considerations of processor fidelity;
  • FIG. 6 is a flow chart depicting an embodiment of an error handling method in a redundant-processor computing device that has tie-breaking capability during programmed input/output (PIO) voting in a duplex configuration; and
  • FIG. 7 is a flow chart illustrating an embodiment of an error handling method in a redundant-processor computing device during direct memory access (DMA) read voting in a duplex configuration.
  • DETAILED DESCRIPTION
  • A processor may incorporate multiple, redundant, loosely-coupled processor elements for error detection. A duplex arrangement using two processor elements is susceptible to a“voting tie” situation. Ties may be avoided by using an odd number of processors at the expense of fault detection capability if a single processor element is used and by adding costs for incorporation of additional processor elements. The illustrative system and method may use other information to resolve conflicts and break ties. Accordingly, an effective processor may be configured using only two processor elements for voting or comparison.
  • Referring to FIG. 1, a schematic block diagram illustrates an embodiment of a control apparatus 100 for usage in a redundant-processor computing device 102. The control apparatus 100 is operative in a configuration with a plurality of processor elements 104A and 104B and can resolve a mutual disparity or “tie” condition among processor elements. The control apparatus 100 can be used to break ties in the case of voting error, for example with an even number of processor elements, using other available information. The control apparatus 100 includes a control element 106 that detects equivalent disparity among the processor elements 104A, 104B and responds by evaluating secondary considerations of processor fidelity.
  • The control element 106 determines whether evaluation of the secondary considerations is insufficient to resolve the disparity among the processor elements 104A, 104B and, if so, terminates computing device operations.
  • The computing device 102 can be a computer processor that uses multiple, redundant, loosely-synchronized processor elements 104A, 104B to detect and manage errors. A configuration with an even number of processor elements 104A, 104B is susceptible to a voting “tie” condition in which actions or results from the processor elements differ. For example, a computing device 102 may have two processor elements 104A, 104B so that any disparity is equivalent and results in a tie condition. Typically, an odd number of processing element, for example three, can be used at added cost to avoid ties.
  • In some situations, other information which may be called secondary considerations of fidelity may be available to resolve the disparity and break the tie. The other information is heuristic data which is sufficiently predictive to be trusted for disparity resolution. If the tie cannot be broken by use of the other information, then the error is considered sufficiently serious that the processor is halted due to an inability to guarantee correctness of either of the unequal voted data items.
  • Some embodiments may include a control element 106 that evaluates the secondary conditions of processor fidelity while the processor elements 104A, 104B are executing before equivalent disparity is detected and sets a probation vector 108 according to the evaluation. For example, the probation vector 108 may be implemented in a voting unit 110 and used by the voting unit 110 to resolve disparities and break ties in predetermined conditions. In a particular example, the probation vector 108 can have one bit of state per processor element 104A, 104B. Control logic, such as software, executing in each processor element 104A, 104B can set the bit in conditions that the logic has accumulated information for usage in breaking future ties, or very recent ties. The control logic can periodically reset the probation vector bits.
  • The voting unit 110, upon detecting a disparity or tie condition, may delay acting upon the condition or declaring a fatal error situation. Instead, the voting unit 110 can hold the compared values for a short duration time period before acting. Accordingly, the control element 106 can interject a delay between equivalent disparity detection and termination of computer device operation. The delay enables control logic, for example software, to possibly detect other errors or gather information pertinent to resolving the disparity or breaking the tie. The delay can also break potential race conditions. For example, if a self-detectable error occurs simultaneously or nearly the same time as a misvote, the delay enables further collection of information or analysis before the voter declares the misvote condition, enabling recognition of the error and resolution of the vote. In a particular duplex embodiment, if the control logic sets one of the two bits in the probation vector 108 during the short delay imposed by the voting unit 110, then the voting unit 110 resolves the disparity or breaks the tie in favor of the processor element, either 104A or 104B, that is not on probation. While the condition remains an error condition, the error is made recoverable. If the control logic does not set the bit in the probation vector prior to or during the delay, the error is considered to be fatal to the computing device 102, and operation is halted.
  • In a particular embodiment, the computing device 102 can include a logical synchronization unit 112 that contains the voting unit 110 and input/ output interfaces 114A and 114B. For example, the interfaces 114A and 114B may include a programmed input/output (PIO) interface and a direct memory access (DMA) interface. A possible equivalent disparity or tie condition may include a condition of a first processor element that performs a programmed input/output (PIO) action while a second processor element does not. A second example of a equivalent disparity or tie condition may be a miscompare on voted data whereby data supplied by the two processor elements 104A, 104B does not match on a programmed input/output (PIO) action or a direct memory access (DMA) action. Other equivalent disparity or tie conditions include a first processor element performing a PIO read while a second processor element performs a PIO write, or first and second processors reading or writing different addresses.
  • Referring to FIG. 2, a schematic block diagram depicts an embodiment of a computing system 200 with capability to resolve disparity among a plurality of processor elements 202A, 202B configured in a redundant-processor arrangement. The computing system 200 further includes a probation vector 204 coupled to the processor elements 202A, 202B and has a signal allocated to each of the processor elements 202A, 202B. A control element 206 is coupled to the processor elements 202A, 202B and evaluates processor fidelity, setting the probation vector 204 according to results of the evaluation.
  • The probation vector 204 is used to monitor secondary considerations of processor element fidelity before an error is detected, when more abundant information relating to processor element conditions and functionality are available. In contrast, a system that does not begin acquiring status information until an error is detected may have more limited functional capabilities, and a possible inability to perform actions diagnostic of processor element fidelity. Acquisition of status and operational information while the processor elements 202A, 202B are executing in due course simplifies operation because information is merely noted when available. Higher complexity algorithms that execute following error or disparity detection and require information to be evoked or stimulated, as well as monitored, can be avoided.
  • The computing system 200 also includes a voter 208 that is coupled to the plurality of processor elements 202A, 202B and compares actions taken by the processor elements 202A, 202B to determine disparity in processor actions. A control element 210 responds to disparity among the processor elements 202A, 202B based on the probation vector 204. In some embodiment, the control element 210 can interject a delay between disparity detection and computer device operation termination to allow monitoring of additional information that may be useful in resolving disparity and breaking ties. The control element 210 can also determine whether evaluation of processor fidelity is insufficient to resolve the disparity among the processor elements 202A, 202B and, if so, terminate computing system operations.
  • Referring to FIG. 3, a schematic block diagram illustrates an embodiment of a computing system 300 including a plurality of processor elements 302A, 302B configured in a redundant-processor arrangement, and control logic 304 that imposes a selected-duration short delay in the event of a disparity or tie condition. The control logic 304 compares actions taken by the processor elements 302A, 302B and determines disparity in the actions, then waits the selected delay duration after equivalent disparity detection before initiating an action in response to the disparity condition. The control logic 304 may respond after the delay according to evaluation of secondary considerations of processor fidelity.
  • The selected delay has duration sufficient to enable near-simultaneous arrival of information for usage in resolving the disparity condition. The delay is imposed in case of processor disparity or tie, to enable simultaneous or near-simultaneous arrival of information that can be used in disparity resolution and/or tie-breaking. The delay has suitable duration to enable logic to receive a high priority interrupt and perform a few computations and is sufficiently long to enable information to be acquired at the same time or very close to occurrence of an error. Typical delay duration, using current processor operating speeds and technology, is on the order of tens or hundreds of microseconds, sufficient to handle the interrupt and execute hundreds or thousands of instructions. The delay may assist in avoiding or breaking race conditions.
  • The selected delay also has an upper limit. A lengthy examination of error state or diagnostic execution may not be acceptable. During the time between disparity and selection of the winning processor, the parallel input/output (PIO) operation or direct memory access (DMA) operation is suspended, possibly disrupting communications with other processors as well as communications on the interprocessor network due to backpressure. Such disruption is generally not desirable and is avoided by imposing an upper limit on the delay duration.
  • Referring to FIG. 4, a schematic block diagram shows an embodiment of a processor complex 400 within which the illustrative error handling system may be implemented. In a redundant-processor arrangement, the processor complex 400 includes a plurality of logical processors 408, each a computing engine capable of executing processes and implemented using one or more processor elements 402, each in a different processor slice 410, combined with one or more logical synchronization units (LSUs) 412. Each processor element 402 is a single microprocessor or microprocessor core capable of executing a single instruction stream. A processor slice 410 typically comprises one or more processor elements 402, each with a dedicated memory 414 or sharing a partitioned memory. A processor complex 400 comprises one or more logical processors 408. A processor complex 400 comprises one or more processor slices 410. Within a complex 400, each slice 410 has the same number of processor elements.
  • A processor complex with one slice is called a simplex complex. A two-slice processor complex is called a duplex, dual modular redundant, or DMR complex. A three-slice processor complex is called a triplex, tri-modular redundant, or TMR complex. A processor complex 400 includes both processor elements 402 and corresponding logical synchronization units 412.
  • A computing system comprises one or more logical processors 408. The computing system also comprises one or more processor complexes 400. The processor complexes 400 are interconnected via a network, for example a System Area Network (SAN), a local area network (LAN), a wide area network (WAN), or the like, or simply a connection to a bus. The network is used for connection to both other processors and to input/output (I/O) devices. Voting or output data comparison is performed for all data transfers between the logical processor and the network or the network I/O adapter.
  • In a logical processor 408, one, two, three, or possibly more processor elements cooperate to perform logical processor operations. Cooperative actions include coordinating or synchronizing mutually among the processor elements, exchanging data, replicating input data, and voting on operations and output data selection. In the illustrative embodiment, the various cooperative actions can be implemented within or supported by implementation of the logical synchronization units 412.
  • Referring to FIG. 5, a schematic block diagram illustrates an embodiment of a computing system 500 comprising a plurality of processor elements 502A, 502B configured in a redundant-processor arrangement, and a voter 504 coupled to the plurality of processor elements 502A, 502B that compares actions taken by the processor elements and determines disparity in the actions. The computing system 500 further includes a control element 506 coupled to the processor elements 502A, 502B and the voter 504 that detects equivalent disparity among the processor elements 502A, 502B and responds by evaluating secondary considerations of processor fidelity. The illustrative computing system 500 may comprise a single logical processor of multiple logical processors in a complex or system.
  • The computing system 500 may also include a logical synchronization unit 512 comprising the voter 504 and a network interface 520, such as a SAN interface that can be configured to connected to the network 518 by one or more ports. For example the network connection may be made as shown in FIG. 5 via an X-fiber port and a Y-fiber port, although wire ports may otherwise be used.
  • The logical synchronization unit 512 maintains multiple bit sets that represent, at any point in time, the set of processor elements that are enabled to perform selected operations. The voter 504 includes a plurality of multiple-bit configuration registers to indicate which processor elements are expected and enabled to participate in selected operations, for example including programmed input/output (PIO) operations with the processor elements 502A, 502B, and network interface 520 DMA operations including direct memory access (DMA) output voting and DMA input replication.
  • The voter configuration bits represent which processor elements are meant to be “assigned” to a logical processor and are therefore eligible for performing various operations such as output voting operations. Configured processor elements are defined, for any particular operation type, as the set of processor elements enabled by the configuration bits in the logical synchronization unit 512 to perform that operation type. The logical synchronization unit 512 ignores operations of non-configured processor elements on an operation-type basis. Configuration bits are set whenever processor elements “join” the configuration of a logical processor 508. Configuration bits are reset or cleared when processor elements leave the configuration, for example by being voted out.
  • Participating processor elements are the set of configured processor elements, for a particular instance of an operation, that actually perform the operation at a particular time, within a timeout period. The set of configured processor elements is related to a type of operation while the set of participating processor elements is related to an individual instance of an operation. The configured processor elements have outbound data transfers voted. Participating processor elements are the elements that actually issue a particular outbound data transfer.
  • For a particular voted operation, all configured processor elements are expected to participate. A processor element that does not participate times out on the operation.
  • Generally, voted operations can result in conditions including full agreement, timeout, simple miscompare, tie miscompare, and full miscompare. In full agreement, all configured processor elements participate within a reasonable time and supply identical data for an identical operation so that all voted data matches. In a timeout condition, one or more configured processor elements do not participate in time. For a simple miscompare, a strict majority of configured processor elements supplies identical data, and a strict minority supplies different data. A strict majority is greater than half, for example one in simplex, two in duplex, and two or three in triplex. A tie miscompare occurs with an even number of processor elements, for example two, configured in which the data does not compare. For a full miscompare, all sets of voted data, for example all three in triplex, miscompare pair-wise so that no strict majority results. The different types of conditions are mapped into one of three error conditions including no error, minor error, and major error.
  • For no error, the operation proceeds and no error is reported. For a minor error, the operation proceeds but an error is reported. Minor errors include such timeouts or disagreements among voted data in which available information is sufficient to enable resolution. Triplex configurations in which two processors are in agreement and duplex configurations in disagreement when other information is available to resolve the disagreement are examples of minor error conditions. For a major error, the operation does not proceed, but rather is aborted, and an error is reported. Major errors include such timeouts and disagreements among voted data that the condition cannot be resolved. Triplex configurations with three-way disagreement or timeouts in two of the three processors, and duplex configurations with disagreeing processors and no available information for resolution are examples of major error conditions.
  • In addition to timeouts and disagreement among redundant processors, other errors that may be handled using the illustrative techniques include breaks in cabling, for example between the processor elements and voter. Such errors can often be self-signaling. Self-signaling errors are defined as errors detected by an explicit detection element or mechanism as distinguished from implicit detection techniques such as voting. Various types of errors may be self-signaling. Examples of self-signaling errors include direct memory access (DMA) read timeouts, errors detected by parity or other error checking codes, loss-of-signal in optical signals, and loss of electrical continuity for electrical signals. With respect to the various illustrative systems, voting detects errors, but in a duplex configuration voting alone does not distinguish which voted data is correct and which incorrect. Self-signaling errors designate which of the two data suppliers is incorrect.
  • The processor elements 502A, 502B each have a memory 514 or share a partitioned memory with other processor elements in the same slice.
  • The PIO operations are processor-initiated reads (loads) or writes (stores) to any part of the processor's address space other than “main memory”. The address space may contain registers, pseudo-registers, and memory other than main memory. PIO operations may be targeted to resources in either the voter 504 or to a network interface 520. Operations targeted to the voter 504 may be either unvoted or voted (compared), depending on the address of the register being accessed. Operations to the network interface 520 may always be voted in some implementations.
  • When any configured processor element initiates a voted PIO read, the voter 504 captures the operation and read address and sets a timer. The voter 504 waits for all configured processor elements to perform the same operation. When all configured processor elements initiate the operation, the operation and address are compared, for example a bit-by-bit comparison of the entire operation and address. If one or more of the configured processor elements fails to initiate the operation within a configurable timeout period, a PIO timeout condition exists. In circumstances that the illustrative error handling and tie-breaking technique is not implemented or is disabled, operation can be described as follows. If all configured processor elements participate, then for the case of full agreement or simple miscompare, the operation proceeds, ignoring miscompared data, if any. A simple miscompare is handled as a minor error. Otherwise (not full agreement or simple miscompare), the operation is aborted—not performed, and a “bus error” is returned to all requesting processor elements, and a major error is reported. If the operation is not aborted, then if the PIO read is targeted to the network interface 520, the voter 504 forwards the operation and address to the network interface 520 and waits for a response. If the operation is targeted to the voter 504, then the voter 504 accesses data directly. The response data, when available, is replicated by the voter 504 and sent as a response to all participating processor elements at approximately the same instant.
  • When any configured processor element initiates a voted PIO write operation, the voter 504 captures the operation, write address, and write data, then sets a timer and waits for all configured processor elements to perform the same operation. When all configured processor elements initiate the operation, the operation, write address, and write data are bit-by-bit compared. If one or more of the configured processor elements fails to initiate the operation within a configurable timeout period, a PIO timeout condition exists. In circumstances that the illustrative error handling and tie-breaking technique is not implemented or is disabled, the operation is as follows. If all configured processor elements participate, in the case of full agreement or simple miscompare, the operation proceeds, ignoring miscompared data. A simple miscompare is handled as a minor error. Otherwise (not full agreement or simple miscompare), the operation is aborted—not performed. No direct response is made to the processor element because no response or acknowledgement is normally made to write operations. When the operation is aborted, a major error is reported. An operation that is not aborted is handled according to the target address of the write operation. For a PIO write to the network interface 520, the voter 504 forwards the operation, address, and data to the network interface 520. For a PIO write to the voter 504, the voter 504 performs the write operation directly. No response is made to the processor element. Because of the possible side-effects with PIO write operations, and because software does not necessarily verify the effect, or success, of each write operation, when a PIO write operation is aborted due to a major voting error, the voter 504 suspends all future PIO write operations, which are also aborted, until the software detects the error and re-enables PIO voting. Typically the error is detected by handling an error interrupt. Note, however, that in the triplex case for a simple miscompare such as one processor element write and two processor elements time out, no requirement is made to abort all future programmed I/O write operations.
  • PIO operations are initiated by the processor elements, as contrasted to DMA operations which are initiated by the network interface 520. Therefore, PIO timeouts are possible in two different circumstances. In a first circumstance, one or two processor elements, operating correctly, initiate a PIO operation, and one processor element, operating incorrectly or stopped, fails to perform the PIO operation. The error is detected when the timer expires. Without further information, the processor element operating incorrectly may be indeterminable, for example when two processor elements are configured and one times out.
  • In a second circumstance, one processor element, operating incorrectly, initiates a PIO operation that should not occur, and the other processor element or processor elements, operating correctly, do not initiate a PIO operation. Again the error is detected when the timer expires, although without further information the processor element operating incorrectly may be indeterminable, for example for two active processor elements, one of which times out. Accordingly, the processor elements that do not participate are not necessarily incorrect.
  • In the triplex case, the strict majority is always trustworthy so that if one processor element times out and the remaining two processor elements perform a PIO operation, then the processor element that times out is in error and ignored, while the PIO operation proceeds and the data voted with a minor error indicated. Also in the triplex case, if two processor elements time out when a single processor element performs a PIO operation, then the processor element that performs the PIO operation is in error and the PIO is ignored. The lone processor element can be considered a “rogue processor element” and the attempted PIO operation is called a “rogue operation”.
  • In the duplex case, whether the processor element performing the PIO or the processor element that is not participating is in error cannot be determined, without other evidence. Accordingly a tie or disparity condition exists.
  • Some embodiments of the computing system 500 further comprise a two- processor elements configuration 502A, 502B, the programmed input/output (PIO) interface 522, and a direct memory access (DMA) interface 524 coupled to the voter 504. An action disparity that is detectable by the control element 506 is a miscompare on voted data with non-matching data supplied by two processor elements 502A, 502B either on a PIO action or a DMA action.
  • Direct memory access (DMA) reads are outbound operations initiated by the network interface 520. The voter 504 replicates and forwards the DMA read operation and address to all configured processor elements at approximately the same instant, subject to congestion delays in the different slices that may cause an operation to arrive at the processor elements at slightly different times. The voter 504 then starts a timer and waits for the responses. Response data flows from the processor elements 502A, 502B, through the voter 504, to the network interface 520. Responses arriving from the configured processor elements are buffered in the voter 504 rather than being sent immediately to the network interface 520. The later responses, upon arrival, are compared with the earlier responses saved in the data buffers. When a strict majority of the responses agree, one copy of the data, from one of the agreeing responses, is communicated to the network interface 520. If a strict majority of the responses do not agree, then data is not sent to the network interface 520 in a manner that can be interpreted as valid data.
  • Any processor element that does not respond within the timeout period is declared to have timed out. Unlike PIO timeouts, no rogue situations can occur with DMA timeouts because the DMA operation is initiated through the network interface 520. Accordingly, if one processor element times out, that processor element is necessarily erroneous in both duplex and triplex cases. The condition is considered a minor error, and the DMA operation proceeds using data supplied by the processor element or processor elements that do not time out. If two processor elements time out, or the only processor element in a simplex case, then a major error is indicated and the DMA operation is aborted. The voter 504 generates an error notification interrupt to all configured processor elements in the case of any disagreement or timeout, whether data is successfully forwarded to the network interface 520 or otherwise. The interrupt indicates which processor elements did time out, if any, and all comparison results.
  • Direct memory access (DMA) writes are inbound operations with data flowing from the network interface 520 through the voter 504 to the processor elements 502A, 502B. DMA write operations are initiated by the network interface 520. The voter 504 replicates and forwards the operation, address, and data to all configured slices at approximately the same instant. No response is made to the network interface 520.
  • The voted PIO operations and DMA responses are protected by timeouts. PIO and DMA timeout values are both configurable by control logic, such as software, and may be different.
  • For PIO timeouts, the timer is started when the first PIO operation arrives. The timer is restarted when each operation arrives, giving a full timeout period for the later arrivals relative to the earlier ones, a behavior used in all implementations due to the possibility of a PIO operation being initiated by a “rogue processor element”. PIO operations can never time out in the simplex case because the operation originates from the processor element.
  • For DMA read response timeouts, DMA operations can time out in simplex, duplex, and triplex configurations because the operation is originated from the network interface 520. The timer is started when the DMA request is forwarded by the voter 504 to memories of all processor elements. The timer may optionally be restarted when each response arrives, giving a full timeout period for the later responses. Alternatively, a single timeout interval may be applied to all configured processor elements. Either option is possible since no rogue DMA operations can occur.
  • In an illustrative embodiment, special case handling can be used when a PIO timeout occurs or a miscompare occurs on voted data in a duplex configuration. In the illustrative example, duplex tie handling generally is inapplicable to DMA timeouts, and to simplex or triplex configurations. Two disparity or tie conditions include a PIO timeout in which one processor element performs a PIO operation and the other processor element does not, and a miscompare on voted data in which data supplied by the two processor elements does not match, either on a PIO operation or a DMA read operation.
  • In the absence of any other information, a tie or disparity condition in a duplex configuration is ambiguous whereby the trustworthiness of each processor element is not obvious, leading to a typical policy of halting the logical processor.
  • However, occasionally, other information, termed secondary considerations of processor fidelity, may exist. The other information may be considered sufficiently strong, even if circumstantial, to implicate one of the two processor elements. For example, a recent logged history of other detected recoverable errors may be indicative of a degradation of processor element reliability.
  • Another example of pertinent reliability information is a recent history of processor replacement. A newly-reintegrated processor element or a new slice may be a more likely source of error than an element that has long been installed without a history of error. Such early life problems are frequently discovered within a short time, on the order of minutes, following installation.
  • A further example of reliability information is inherent in the multiple-dimensional configuration of logical and physical processors, for example as shown in FIG. 4. Processor slices with multiple processor elements connected physically but not logically include hardware that is shared within a processor slice. Hardware may be shared among logical processors. When shared hardware ceases functioning correctly, errors such as intermittent errors can occur. If errors are detected in one logical processor but not another, information about the errors can be communicated between processor elements in a processor slice so that all processor elements within the slice have sufficient information to break any ties that occur.
  • Some embodiments of the computing system 500 further comprise a probation vector 526 coupled to the voter 504 and coupled to the processor elements 502A, 502B. The probation vector 526 holds a signal for each processor element 502A, 502B. A processor control policy executable on the processor elements 502A, 502B evaluates the secondary conditions of processor fidelity during processor element execution. The processor control policy sets bits in the probation vector according to the evaluation of secondary considerations before determining whether a major or minor error has occurred. In a particular embodiment, the probation vector 526 enables implementations to supply the other information to the voter 504 in such a way that duplex tie conditions can be simply resolved. The probation vector 526 comprises one bit for each processor element 502A, 502B. The logical synchronization unit 512 uses the probation vector 526 as a tie-breaker in some conditions.
  • In an illustrative embodiment, PIO writes to the probation vector 526 are not voted. The initial or reset value of the probation vector 526 is all zero. Each processor element 502A, 502B can set or reset the probation bit only for the processor slice associated with the processor element, and not for any other slice. The probation vector 526 is not used in simplex and triplex modes for which the bits may still be set by the control logic, but are ignored.
  • The probation vector 526 is ignored if both processor slices agree, indicating no error. The probation vector 526 is also ignored in the case of self-signaling errors, which indicate the error source so that tie-breaking is superfluous. Self-signaling errors are thus classified as minor errors. The errors occur when the voter 504 can be certain that one slice should be ignored, for example in an outbound DMA read response, when one slice does not respond and times out, or when the slice supplies data marked as “known bad”. “Known bad” data relates to another example of self-signaling error and is data returned from a processor element's memory or cache that generates a detected error, such as a parity or other detected error, when accessed.
  • Control logic in each processor element sets the associated probation bit independently of the other processor elements; reaching an agreement among processor elements is not required. Accordingly, both probation bits may be asserted at any time. The control logic resets the probation bits after some amount of time. The time duration is an implementation-defined parameter or policy.
  • In some circumstances, the probation bits may be set for all processor elements on a processor slice due to an error that places behavior of the entire slice in doubt. Accordingly, the control logic can propagate the probation bits from one processor element, where an error has been detected, to other processor elements in the slice by an implementation-defined technique. Examples of an implementation-defined system include exchanging probation bits via a register in a slice application-specific integrated circuit (ASIC), and/or using inter-processor, intra-slice interrupts.
  • In some embodiments of the computing system 500, the control element 506 interjects a delay between equivalent disparity detection and computing device operation termination. When a duplex, non-self-signaling error occurs, a delay is imposed prior to declaring the situation an error condition. During the delay period, operation is held in limbo. The operation is neither completed nor aborted. The delay enables the tie to be broken by control logic setting the probation bit after the error but before the timeout elapses. The delay does not occur, and therefore does not add any latency, in full agreement cases, and in simplex and triplex cases.
  • The delay period in a tie-breaker situation is maintained sufficiently small to avoid excessive network backpressure, therefore preventing an increase in congestion in the network. For current technology, a sufficiently small delay may be of the order of the range of tens or hundreds of microseconds.
  • In an illustrative embodiment, when the delay period begins, the logical synchronization unit 512 sends an interrupt to all participating slices to inform the slices that a miscomparison has occurred, although an error has not been declared. The interrupt is in addition to the interrupt that is generated at the end of the delay, when the voting error is final, either major or minor.
  • In a tie-breaker or disparity condition, if after the delay interval has elapsed, both probation bits are enabled or both probation bits are disabled, then a major error exists and the operation is aborted. If the bits are in opposite states, then the slice with the probation bit off or reset is obeyed, and the slice with the probation bit is on or set is ignored. If the probation bits are used to break the tie, then a minor error is declared.
  • The policy for setting probation bits and duration that the bit setting is maintained is implementation-specific.
  • In the illustrative embodiment, the logical synchronization unit 512 reports all errors, both major and minor, to control logic, such as software via an interrupt and status register. In one implementation, status register bits indicate that the tie-break mechanism has been invoked and designate which processor element or slice is obeyed.
  • Referring to FIG. 6, a flow chart depicts an embodiment of an error handling method 600 in a redundant-processor computing device during programmed input/output (PIO) voting in a duplex configuration. The method 600 comprises detecting equivalent disparity 612, and 610, 614 among processor elements of the computing device, and responding to the detected equivalent disparity by evaluating 624, 634 secondary considerations of processor fidelity.
  • The method can further comprise determining whether evaluation 624, 634 of the secondary considerations is insufficient to resolve the disparity among the processor elements. If so, computing device operations are terminated 626 and 628, 636 and 638. Delay can be inserted 620 and 622, 630 and 632 between equivalent disparity detection and computer device operation termination 626, 636.
  • In the illustrative method 600, control logic receives 602 a programmed input/output (PIO) request from one processor element (PE), buffers 604 the request, and starts 606 a timer. If a second request is received 608, the two requests are compared 610. Otherwise, if the timer has not elapsed 612, whether the second request is received is determined 608. If the timer has elapsed 612, analysis of secondary considerations of processor fidelity begins.
  • In conditions that a second request is received 608, following comparison 610 of the requests, if the requests match 614, the control logic performs the PIO operation 616 and no error is indicated, and the method is complete 618. Otherwise, an equivalent disparity condition exists in the form of a miscompare on voted data whereby command, address, or data supplied by two processor elements does not match on a programmed input/output (PIO) action. Operation, address, and, for a write operation, data are compared to determine a match condition. Secondary consideration analysis begins with the control logic sending 620 a “tie break pending” interrupt to the processor elements, and waiting 622 the configured time. Generally, the secondary conditions of processor fidelity can be evaluated during processor element execution, and a probation vector can be set according to the evaluation prior to determination of major or minor error. If probation bits are equal 624, the operation is aborted 626 since the tie or disparity condition cannot be resolved and a major error interrupt is sent to the processor elements. The method is terminated 628.
  • Otherwise, if the probation bits are not equal 624, the control logic follows direction 648 of the processor element that is not on probation, and the PIO operation specified by the non-probation processor element is performed. A minor error interrupt is sent 650 to the processor elements, and the method completes 652 with a suitable minor error handling technique. In some embodiments, the minor error is addressed by marking the loser of the voting decision as no longer participating in the logical processor. Subsequent input/output operations or other accesses to the voter are ignored. In other embodiments, software processing in the loser is interrupted and software executing in remaining processor elements shuts down the offending processor element.
  • For analysis of secondary considerations of processor fidelity after the timer elapses 612, an equivalent disparity condition occurs in which a first processor element performs a programmed input/output (PIO) action while a second processor element does not. Control logic sends 630 a “tie break pending” interrupt to the processor elements, and waits 632 the configured time. If probation bits are equal 634, the operation is aborted 636 since correct operation cannot be determined. A major error interrupt is sent to the processor elements. The method is terminated 638.
  • Otherwise, if the probation bits are not equal 634, control logic determines whether the processor element requesting the PIO operation is on probation 640. If the processor element is on probation 640, the control logic follows direction 642 of the processor element that is not on probation and ignores or aborts the PIO operation, sends 644 a minor error interrupt to the processor elements, handles the minor error, and terminates 646 the method. If the processor element requesting the PIO is not on probation 640, the control logic follows direction 648 of the processor element that is not on probation, and the PIO operation specified by the non-probation processor element is performed. A minor error interrupt is sent 650 to the processor elements, and the method is complete 652.
  • Referring to FIG. 7, a flow chart depicts an embodiment of an error handling method 700 in a redundant-processor computing device during direct memory access (DMA) read voting in a duplex configuration. The method 700 comprises detecting equivalent disparity 726 and 728 among processor element memories of the computing device, and responding to the detected equivalent disparity by evaluating 738 secondary considerations of processor fidelity.
  • The method can further comprise determining whether evaluation 738 of the secondary considerations is insufficient to resolve the disparity among the processor elements. If so, computing device operations are terminated 740 and 742. Delay can be inserted 736 between equivalent disparity detection and computer device operation termination 740, 742.
  • In the illustrative method 700, control logic receives 702 a direct memory access (DMA) read request from a network agent such as a system area network (SAN) agent, replicates and forwards 704 the request to both processor elements, and starts 706 a timer. If a first response is received 708, the first response is buffered 716. In some embodiments, the timer may be restarted as the first response is buffered 716. If the timer has not timed out 710, whether the first response is received is again determined 708. If the timer has timed out 710, the operation is aborted 712 and a major error interrupt is sent to both processor elements. The major error interrupt is indicative of a double timeout. The method is then terminated 714.
  • In conditions that the first response is received 708 and the first response is buffered 716, whether a second response has been received is determined 718. If the second response has been received 718, the first and second response data are compared 726. Otherwise, if the timer has not timed out 720, whether the second response has been received is again determined 718. If the timer has timed out 720, the operation is completed 722 using data from the first response and a minor error interrupt is sent to both processor elements. The single timeout condition is indicative of a “self-signaling error”. The method is then terminated 724.
  • In conditions that the second response is received 718 and data from the first and second responses is compared 726, if the first and second response data are equal 728 the data match so that the operation is completed 730 with no error. The method completes successfully 732. Otherwise, the first and second response data are unequal 728 and a “tie break pending” interrupt 734 is sent to the processor elements. A delay is inserted 736 to wait for a configured time. Probation bits are read to determine whether the probation bits are equal 738. If so, the operation is aborted 740 since the tie cannot be resolved using the probation bits and a major error interrupt is sent to the processor elements. The method terminates unsuccessfully 742. Otherwise, the probation bits are not equal 738 and the operation is completed 744 using the response from the processor element that is not on probation. A minor error interrupt is sent 746 to the processor elements, the minor error handled by marking the loser for removal or shutting down the offending processor element, and the method is terminated 748.
  • While the present disclosure describes various embodiments, these embodiments are to be understood as illustrative and do not limit the claim scope. Many variations, modifications, additions and improvements of the described embodiments are possible. For example, those having ordinary skill in the art will readily implement the steps necessary to provide the structures and methods disclosed herein, and will understand that the process parameters, materials, and dimensions are given by way of example only. The parameters, materials, and dimensions can be varied to achieve the desired structure as well as modifications, which are within the scope of the claims. For example, although the illustrative structures and methods are most applicable to multiple-processor systems in a duplex configuration, various aspects may be implemented in configurations with more or fewer processors. Furthermore, the illustrative embodiments depict particular arrangements of components. Any suitable arrangement of components may be used. The various operations performed may be implemented in any suitable matter, for example in hardware, software, firmware, or the like.

Claims (29)

1. A control apparatus for usage in a redundant-processor computing device including a plurality of processor elements, the control apparatus comprising:
a control element that detects equivalent disparity among the processor elements and responds by evaluating secondary considerations of processor fidelity.
2. The apparatus according to claim 1 further comprising:
a control element that determines whether evaluation of the secondary considerations is insufficient to resolve the equivalent disparity among the processor elements and, if so, terminates operations of the computing device.
3. The apparatus according to claim 2 further comprising:
a control element that interjects a delay between equivalent disparity detection and the evaluation of secondary considerations of processor fidelity.
4. The apparatus according to claim 1 further comprising:
a control element that determines whether the evaluation of the secondary considerations is sufficient to resolve the equivalent disparity among the processor elements and, if so, completes an operation according to the resolution.
5. The apparatus according to claim 1 further comprising:
a control element that evaluates the secondary conditions of processor fidelity and sets a probation vector according to the evaluation.
6. The apparatus according to claim 1 further comprising:
a processor element that evaluates the secondary conditions of processor fidelity and sets a probation vector according to the evaluation.
7. The apparatus according to claim 1 wherein an equivalent disparity condition comprises one or more of conditions including:
a condition of a first processor element performing a programmed input/output (PIO) action while a second processor element does not;
a condition of a first processor element performing a PIO read while a second processor element performs a PIO write;
a condition of a first processor and a second processor reading different addresses;
a condition of a first processor and a second processor writing different addresses; and
a miscompare on voted data whereby data supplied by two processor elements does not match on a programmed input/output (PIO) action or a direct memory access (DMA) action.
8. An error handling method in a redundant-processor computing device comprising:
detecting equivalent disparity among processor elements of the computing device; and
responding to the detected equivalent disparity by evaluating secondary considerations of processor fidelity.
9. The method according to claim 8 further comprising:
determining whether the evaluation of the secondary considerations is insufficient to resolve the equivalent disparity among the processor elements; and
if so, terminating operations of the computing device.
10. The method according to claim 9 further comprising:
inserting a delay between equivalent disparity detection and termination of computer device operation.
11. The method according to claim 8 further comprising:
determining whether evaluation of the secondary considerations is sufficient to resolve the equivalent disparity among the processor elements; and
if so, completing an operation according to the resolution.
12. The method according to claim 8 further comprising:
evaluating the secondary conditions of processor fidelity; and
setting a probation vector according to the evaluation.
13. The method according to claim 8 wherein an equivalent disparity condition one or more of conditions including:
a condition of a first processor element performing a programmed input/output (PIO) action while a second processor element does not;
a condition of a first processor element performing a PIO read while a second processor element performs a PIO write;
a condition of a first processor and a second processor reading different addresses;
a condition of a first processor and a second processor writing different addresses; and
a miscompare on voted data whereby data supplied by two processor elements does not match on a programmed input/output (PIO) action or a direct memory access (DMA) action.
14. A computing system comprising:
a plurality of processor elements configured in a redundant-processor arrangement;
a voter coupled to the plurality of processor elements that compares actions taken by the processor elements and determines disparity in the actions; and
a control element coupled to the processor elements and the voter that detects equivalent disparity among the processor elements and responds by evaluating secondary considerations of processor fidelity.
15. The system according to claim 14 further comprising:
a two-processor element configuration; and
a programmed input/output (PIO) interface coupled to the voter whereby an action disparity that is detectable by the control element is a PIO timeout with one processor element performing a PIO action and one processor element not performing the PIO action.
16. The system according to claim 14 further comprising:
a two-processor element configuration; and
a programmed input/output (PIO) interface and a direct memory access (DMA) interface coupled to the voter whereby an action disparity that is detectable by the control element is a miscompare on voted data with non-matching data supplied by two processor elements either on a PIO action or a DMA action.
17. The system according to claim 14 further comprising:
a probation vector coupled to the voter and coupled to the processor elements and having a signal allocated to each of the processor elements; and
a control element that evaluates the secondary conditions of processor fidelity and sets the probation vector according to the secondary considerations of processor fidelity.
18. The system according to claim 14 further comprising:
a control element that determines whether evaluation of the secondary considerations is insufficient to resolve the disparity among the processor elements and, if so, terminates computing device operations.
19. The system according to claim 18 further comprising:
a control element that interjects a delay between equivalent disparity detection and evaluation of secondary considerations of processor fidelity.
20. A computing system comprising:
a plurality of processor elements configured in a redundant-processor arrangement;
a probation vector coupled to the processor elements and having a signal allocated to each of the processor elements; and
a logic that evaluates processor fidelity and sets the probation vector according to the evaluation.
21. The system according to claim 20 further comprising:
a control element that evaluates processor fidelity and sets a probation vector according to the evaluation.
22. The system according to claim 20 further comprising:
a processor element that evaluates processor fidelity and sets a probation vector according to the evaluation.
23. The system according to claim 20 further comprising:
a voter coupled to the plurality of processor elements that compares actions taken by the processor elements and determines disparity in the actions; and
a control-element that responds-to disparity among the processor elements based on the probation vector.
24. The system according to claim 23 further comprising:
a control element that interjects a delay between disparity detection and computer device operation termination.
25. The system according to claim 20 further comprising:
a control element that determines whether evaluation of processor fidelity is insufficient to resolve the disparity among the processor elements and, if so, terminates computing system operations.
26. The system according to claim 20 wherein for a two-processor system a disparity condition comprises one or more conditions selected from a group consisting of:
a condition of a first processor element performing a programmed input/output (PIO) action while a second processor element does not;
a condition of a first processor element performing a PIO read while a second processor element performs a PIO write;
a condition of a first processor and a second processor reading different addresses;
a condition of a first processor and a second processor writing different addresses; and
a miscompare on voted data whereby data supplied by two processor elements does not match on a programmed input/output (PIO) action or a direct memory access (DMA) action.
27. A computing system comprising:
a plurality of processor elements configured in a redundant-processor arrangement; and
control logic coupled to the processor element plurality that mutually compares actions taken by ones of the processor elements and determines equivalent disparity in the actions, and waits a selected delay after equivalent disparity detection before initiating an action responsive to the disparity condition.
28. The computing system according to claim 27 further comprising:
control logic that responds after the delay according to evaluation of secondary considerations of processor fidelity.
29. The computing system according to claim 27 wherein:
the selected delay has a duration sufficient to enable near-simultaneous arrival of information for usage in resolving the disparity condition.
US11/045,401 2004-03-30 2005-01-27 Error handling system in a redundant processor Abandoned US20050246581A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US11/045,401 US20050246581A1 (en) 2004-03-30 2005-01-27 Error handling system in a redundant processor
CN 200610007181 CN1811722A (en) 2005-01-27 2006-01-26 Error handling system in a redundant processor

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US55781204P 2004-03-30 2004-03-30
US11/045,401 US20050246581A1 (en) 2004-03-30 2005-01-27 Error handling system in a redundant processor

Publications (1)

Publication Number Publication Date
US20050246581A1 true US20050246581A1 (en) 2005-11-03

Family

ID=35346428

Family Applications (5)

Application Number Title Priority Date Filing Date
US10/953,242 Abandoned US20050240806A1 (en) 2004-03-30 2004-09-28 Diagnostic memory dump method in a redundant processor
US10/990,151 Active 2025-09-20 US7890706B2 (en) 2004-03-30 2004-11-16 Delegated write for race avoidance in a processor
US11/042,981 Expired - Fee Related US7434098B2 (en) 2004-03-30 2005-01-25 Method and system of determining whether a user program has made a system level call
US11/045,401 Abandoned US20050246581A1 (en) 2004-03-30 2005-01-27 Error handling system in a redundant processor
US11/071,944 Abandoned US20050223275A1 (en) 2004-03-30 2005-03-04 Performance data access

Family Applications Before (3)

Application Number Title Priority Date Filing Date
US10/953,242 Abandoned US20050240806A1 (en) 2004-03-30 2004-09-28 Diagnostic memory dump method in a redundant processor
US10/990,151 Active 2025-09-20 US7890706B2 (en) 2004-03-30 2004-11-16 Delegated write for race avoidance in a processor
US11/042,981 Expired - Fee Related US7434098B2 (en) 2004-03-30 2005-01-25 Method and system of determining whether a user program has made a system level call

Family Applications After (1)

Application Number Title Priority Date Filing Date
US11/071,944 Abandoned US20050223275A1 (en) 2004-03-30 2005-03-04 Performance data access

Country Status (2)

Country Link
US (5) US20050240806A1 (en)
CN (2) CN1690970A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060020850A1 (en) * 2004-07-20 2006-01-26 Jardine Robert L Latent error detection
US7047440B1 (en) * 2004-07-27 2006-05-16 Freydel Lev R Dual/triple redundant computer system
US20060212763A1 (en) * 2005-03-17 2006-09-21 Fujitsu Limited Error notification method and information processing apparatus
US20070174746A1 (en) * 2005-12-20 2007-07-26 Juerg Haefliger Tuning core voltages of processors
US20070220369A1 (en) * 2006-02-21 2007-09-20 International Business Machines Corporation Fault isolation and availability mechanism for multi-processor system
US20070283061A1 (en) * 2004-08-06 2007-12-06 Robert Bosch Gmbh Method for Delaying Accesses to Date and/or Instructions of a Two-Computer System, and Corresponding Delay Unit
US20080040582A1 (en) * 2006-08-11 2008-02-14 Fujitsu Limited Data processing unit and data processing apparatus using data processing unit
US20080165521A1 (en) * 2007-01-09 2008-07-10 Kerry Bernstein Three-dimensional architecture for self-checking and self-repairing integrated circuits
US20100205607A1 (en) * 2009-02-11 2010-08-12 Hewlett-Packard Development Company, L.P. Method and system for scheduling tasks in a multi processor computing system
US20100275065A1 (en) * 2009-04-27 2010-10-28 Honeywell International Inc. Dual-dual lockstep processor assemblies and modules
US20120137163A1 (en) * 2009-08-19 2012-05-31 Kentaro Sasagawa Multi-core system, method of controlling multi-core system, and multiprocessor
US20120317576A1 (en) * 2009-12-15 2012-12-13 Bernd Mueller method for operating an arithmetic unit
US20130007513A1 (en) * 2010-03-23 2013-01-03 Adrian Traskov Redundant two-processor controller and control method
US20130124922A1 (en) * 2011-11-10 2013-05-16 Ge Aviation Systems Llc Method of providing high integrity processing
CN103645953A (en) * 2008-08-08 2014-03-19 亚马逊技术有限公司 Providing executing programs with reliable access to non-local block data storage
US20140373028A1 (en) * 2013-06-18 2014-12-18 Advanced Micro Devices, Inc. Software Only Inter-Compute Unit Redundant Multithreading for GPUs
US20180074888A1 (en) * 2016-09-09 2018-03-15 The Charles Stark Draper Laboratory, Inc. Methods and systems for achieving trusted fault tolerance of a system of untrusted subsystems
US20190034301A1 (en) * 2017-07-31 2019-01-31 Oracle International Corporation System recovery using a failover processor
US11372981B2 (en) 2020-01-09 2022-06-28 Rockwell Collins, Inc. Profile-based monitoring for dual redundant systems

Families Citing this family (70)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060020852A1 (en) * 2004-03-30 2006-01-26 Bernick David L Method and system of servicing asynchronous interrupts in multiple processors executing a user program
US20050240806A1 (en) * 2004-03-30 2005-10-27 Hewlett-Packard Development Company, L.P. Diagnostic memory dump method in a redundant processor
US7412545B2 (en) * 2004-07-22 2008-08-12 International Business Machines Corporation Apparatus and method for updating I/O capability of a logically-partitioned computer system
US7516359B2 (en) * 2004-10-25 2009-04-07 Hewlett-Packard Development Company, L.P. System and method for using information relating to a detected loss of lockstep for determining a responsive action
US7818614B2 (en) * 2004-10-25 2010-10-19 Hewlett-Packard Development Company, L.P. System and method for reintroducing a processor module to an operating system after lockstep recovery
US7627781B2 (en) * 2004-10-25 2009-12-01 Hewlett-Packard Development Company, L.P. System and method for establishing a spare processor for recovering from loss of lockstep in a boot processor
US7624302B2 (en) * 2004-10-25 2009-11-24 Hewlett-Packard Development Company, L.P. System and method for switching the role of boot processor to a spare processor responsive to detection of loss of lockstep in a boot processor
EP1807760B1 (en) * 2004-10-25 2008-09-17 Robert Bosch Gmbh Data processing system with a variable clock speed
US7502958B2 (en) * 2004-10-25 2009-03-10 Hewlett-Packard Development Company, L.P. System and method for providing firmware recoverable lockstep protection
US7383471B2 (en) * 2004-12-28 2008-06-03 Hewlett-Packard Development Company, L.P. Diagnostic memory dumping
JP4528144B2 (en) * 2005-01-26 2010-08-18 富士通株式会社 Memory dump program boot method, mechanism and program
US20060190700A1 (en) * 2005-02-22 2006-08-24 International Business Machines Corporation Handling permanent and transient errors using a SIMD unit
WO2006090407A1 (en) * 2005-02-23 2006-08-31 Hewlett-Packard Development Company, L.P. A method or apparatus for storing data in a computer system
US7590885B2 (en) * 2005-04-26 2009-09-15 Hewlett-Packard Development Company, L.P. Method and system of copying memory from a source processor to a target processor by duplicating memory writes
DE102005037246A1 (en) * 2005-08-08 2007-02-15 Robert Bosch Gmbh Method and device for controlling a computer system having at least two execution units and a comparison unit
US8694621B2 (en) * 2005-08-19 2014-04-08 Riverbed Technology, Inc. Capture, analysis, and visualization of concurrent system and network behavior of an application
JP4645837B2 (en) * 2005-10-31 2011-03-09 日本電気株式会社 Memory dump method, computer system, and program
US7627584B2 (en) * 2005-11-30 2009-12-01 Oracle International Corporation Database system configured for automatic failover with no data loss
US7668879B2 (en) * 2005-11-30 2010-02-23 Oracle International Corporation Database system configured for automatic failover with no data loss
US20070124522A1 (en) * 2005-11-30 2007-05-31 Ellison Brandon J Node detach in multi-node system
US7496786B2 (en) * 2006-01-10 2009-02-24 Stratus Technologies Bermuda Ltd. Systems and methods for maintaining lock step operation
US8127099B2 (en) * 2006-12-26 2012-02-28 International Business Machines Corporation Resource recovery using borrowed blocks of memory
US7743285B1 (en) * 2007-04-17 2010-06-22 Hewlett-Packard Development Company, L.P. Chip multiprocessor with configurable fault isolation
US20080263391A1 (en) * 2007-04-20 2008-10-23 International Business Machines Corporation Apparatus, System, and Method For Adapter Card Failover
US20080270653A1 (en) * 2007-04-26 2008-10-30 Balle Susanne M Intelligent resource management in multiprocessor computer systems
JP4838226B2 (en) * 2007-11-20 2011-12-14 富士通株式会社 Network logging processing program, information processing system, and network logging information automatic saving method
DE102007062974B4 (en) * 2007-12-21 2010-04-08 Phoenix Contact Gmbh & Co. Kg Signal processing device
JP5309703B2 (en) * 2008-03-07 2013-10-09 日本電気株式会社 Shared memory control circuit, control method, and control program
US7991933B2 (en) * 2008-06-25 2011-08-02 Dell Products L.P. Synchronizing processors when entering system management mode
JP5507830B2 (en) * 2008-11-04 2014-05-28 ルネサスエレクトロニクス株式会社 Microcontroller and automobile control device
US8429633B2 (en) * 2008-11-21 2013-04-23 International Business Machines Corporation Managing memory to support large-scale interprocedural static analysis for security problems
CN101782862B (en) * 2009-01-16 2013-03-13 鸿富锦精密工业(深圳)有限公司 Processor distribution control system and control method thereof
US8631208B2 (en) * 2009-01-27 2014-01-14 Intel Corporation Providing address range coherency capability to a device
TWI448847B (en) * 2009-02-27 2014-08-11 Foxnum Technology Co Ltd Processor distribution control system and control method
CN101840390B (en) * 2009-03-18 2012-05-23 中国科学院微电子研究所 Hardware synchronous circuit structure suitable for multiprocessor system and implement method thereof
US8364862B2 (en) * 2009-06-11 2013-01-29 Intel Corporation Delegating a poll operation to another device
US8479042B1 (en) * 2010-11-01 2013-07-02 Xilinx, Inc. Transaction-level lockstep
TWI447574B (en) * 2010-12-27 2014-08-01 Ibm Method,computer readable medium, appliance,and system for recording and prevevting crash in an appliance
US8635492B2 (en) * 2011-02-15 2014-01-21 International Business Machines Corporation State recovery and lockstep execution restart in a system with multiprocessor pairing
US8930752B2 (en) 2011-02-15 2015-01-06 International Business Machines Corporation Scheduler for multiprocessor system switch with selective pairing
US8671311B2 (en) 2011-02-15 2014-03-11 International Business Machines Corporation Multiprocessor switch with selective pairing
EP2701063A4 (en) * 2011-04-22 2014-05-07 Fujitsu Ltd Information processing device and information processing device processing method
US8554726B2 (en) * 2011-06-01 2013-10-08 Clustrix, Inc. Systems and methods for reslicing data in a relational database
DE102012010143B3 (en) 2012-05-24 2013-11-14 Phoenix Contact Gmbh & Co. Kg Analog signal input circuit with a number of analog signal acquisition channels
JP5601353B2 (en) * 2012-06-29 2014-10-08 横河電機株式会社 Network management system
GB2508344A (en) 2012-11-28 2014-06-04 Ibm Creating an operating system dump
JP6175958B2 (en) 2013-07-26 2017-08-09 富士通株式会社 MEMORY DUMP METHOD, PROGRAM, AND INFORMATION PROCESSING DEVICE
US9251014B2 (en) 2013-08-08 2016-02-02 International Business Machines Corporation Redundant transactions for detection of timing sensitive errors
JP6221702B2 (en) * 2013-12-05 2017-11-01 富士通株式会社 Information processing apparatus, information processing method, and information processing program
WO2015116057A1 (en) 2014-01-29 2015-08-06 Hewlett-Packard Development Company, L.P. Dumping resources
US9710273B2 (en) 2014-11-21 2017-07-18 Oracle International Corporation Method for migrating CPU state from an inoperable core to a spare core
CN104699550B (en) * 2014-12-05 2017-09-12 中国航空工业集团公司第六三一研究所 A kind of error recovery method based on lockstep frameworks
US9411363B2 (en) * 2014-12-10 2016-08-09 Intel Corporation Synchronization in a computing device
JP2016170463A (en) * 2015-03-11 2016-09-23 富士通株式会社 Information processing device, kernel dump method, and kernel dump program
DE102015218898A1 (en) * 2015-09-30 2017-03-30 Robert Bosch Gmbh Method for the redundant processing of data
US10067763B2 (en) * 2015-12-11 2018-09-04 International Business Machines Corporation Handling unaligned load operations in a multi-slice computer processor
US9971650B2 (en) 2016-06-06 2018-05-15 International Business Machines Corporation Parallel data collection and recovery for failing virtual computer processing system
US10579536B2 (en) * 2016-08-09 2020-03-03 Arizona Board Of Regents On Behalf Of Arizona State University Multi-mode radiation hardened multi-core microprocessors
US10521327B2 (en) 2016-09-29 2019-12-31 2236008 Ontario Inc. Non-coupled software lockstep
GB2555628B (en) * 2016-11-04 2019-02-20 Advanced Risc Mach Ltd Main processor error detection using checker processors
US10740167B2 (en) * 2016-12-07 2020-08-11 Electronics And Telecommunications Research Institute Multi-core processor and cache management method thereof
TW202301125A (en) 2017-07-30 2023-01-01 埃拉德 希提 Memory chip with a memory-based distributed processor architecture
JP7099050B2 (en) * 2018-05-29 2022-07-12 セイコーエプソン株式会社 Circuits, electronic devices and mobiles
US10901878B2 (en) * 2018-12-19 2021-01-26 International Business Machines Corporation Reduction of pseudo-random test case generation overhead
US11221899B2 (en) * 2019-09-24 2022-01-11 Arm Limited Efficient memory utilisation in a processing cluster having a split mode and a lock mode
US10977168B1 (en) * 2019-12-26 2021-04-13 Anthem, Inc. Automation testing tool framework
CN111123792B (en) * 2019-12-29 2021-07-02 苏州浪潮智能科技有限公司 Multi-main-system interactive communication and management method and device
US11645185B2 (en) * 2020-09-25 2023-05-09 Intel Corporation Detection of faults in performance of micro instructions
US20230066835A1 (en) * 2021-08-27 2023-03-02 Keysight Technologies, Inc. Methods, systems and computer readable media for improving remote direct memory access performance
KR20230034646A (en) * 2021-09-03 2023-03-10 에스케이하이닉스 주식회사 Memory system and operation method thereof

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5423024A (en) * 1991-05-06 1995-06-06 Stratus Computer, Inc. Fault tolerant processing section with dynamically reconfigurable voting
US6199171B1 (en) * 1998-06-26 2001-03-06 International Business Machines Corporation Time-lag duplexing techniques
US6247143B1 (en) * 1998-06-30 2001-06-12 Sun Microsystems, Inc. I/O handling for a multiprocessor computer system
US20020152418A1 (en) * 2001-04-11 2002-10-17 Gerry Griffin Apparatus and method for two computing elements in a fault-tolerant server to execute instructions in lockstep
US20030149909A1 (en) * 2001-10-01 2003-08-07 International Business Machines Corporation Halting execution of duplexed commands
US6704887B2 (en) * 2001-03-08 2004-03-09 The United States Of America As Represented By The Secretary Of The Air Force Method and apparatus for improved security in distributed-environment voting
US6820213B1 (en) * 2000-04-13 2004-11-16 Stratus Technologies Bermuda, Ltd. Fault-tolerant computer system with voter delay buffer
US6948092B2 (en) * 1998-12-10 2005-09-20 Hewlett-Packard Development Company, L.P. System recovery from errors for processor and associated components
US7231543B2 (en) * 2004-01-14 2007-06-12 Hewlett-Packard Development Company, L.P. Systems and methods for fault-tolerant processing with processor regrouping based on connectivity conditions

Family Cites Families (73)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3665404A (en) 1970-04-09 1972-05-23 Burroughs Corp Multi-processor processing system having interprocessor interrupt apparatus
US4228496A (en) 1976-09-07 1980-10-14 Tandem Computers Incorporated Multiprocessor system
US4293921A (en) 1979-06-15 1981-10-06 Martin Marietta Corporation Method and signal processor for frequency analysis of time domain signals
US4481578A (en) * 1982-05-21 1984-11-06 Pitney Bowes Inc. Direct memory access data transfer system for use with plural processors
JPS61253572A (en) * 1985-05-02 1986-11-11 Hitachi Ltd Load distributing system for loose coupling multi-processor system
US4733353A (en) 1985-12-13 1988-03-22 General Electric Company Frame synchronization of multiply redundant computers
JP2695157B2 (en) * 1986-12-29 1997-12-24 松下電器産業株式会社 Variable pipeline processor
EP0306211A3 (en) * 1987-09-04 1990-09-26 Digital Equipment Corporation Synchronized twin computer system
AU616213B2 (en) * 1987-11-09 1991-10-24 Tandem Computers Incorporated Method and apparatus for synchronizing a plurality of processors
CA2003338A1 (en) 1987-11-09 1990-06-09 Richard W. Cutts, Jr. Synchronization of fault-tolerant computer system having multiple processors
JP2644780B2 (en) * 1987-11-18 1997-08-25 株式会社日立製作所 Parallel computer with processing request function
GB8729901D0 (en) * 1987-12-22 1988-02-03 Lucas Ind Plc Dual computer cross-checking system
JPH0797328B2 (en) 1988-10-25 1995-10-18 インターナシヨナル・ビジネス・マシーンズ・コーポレーシヨン False tolerant synchronization system
US4965717A (en) 1988-12-09 1990-10-23 Tandem Computers Incorporated Multiple processor system having shared memory with private-write capability
US5369767A (en) 1989-05-17 1994-11-29 International Business Machines Corp. Servicing interrupt requests in a data processing system without using the services of an operating system
US5295258A (en) * 1989-12-22 1994-03-15 Tandem Computers Incorporated Fault-tolerant computer system with online recovery and reintegration of redundant components
US5317752A (en) * 1989-12-22 1994-05-31 Tandem Computers Incorporated Fault-tolerant computer system with auto-restart after power-fall
US5291608A (en) 1990-02-13 1994-03-01 International Business Machines Corporation Display adapter event handler with rendering context manager
US5111384A (en) * 1990-02-16 1992-05-05 Bull Hn Information Systems Inc. System for performing dump analysis
DK0532582T3 (en) * 1990-06-01 1996-01-29 Du Pont Composite orthopedic implant with varying modulus of elasticity
US5226152A (en) * 1990-12-07 1993-07-06 Motorola, Inc. Functional lockstep arrangement for redundant processors
US5295259A (en) * 1991-02-05 1994-03-15 Advanced Micro Devices, Inc. Data cache and method for handling memory errors during copy-back
US5339404A (en) 1991-05-28 1994-08-16 International Business Machines Corporation Asynchronous TMR processing system
JPH05128080A (en) * 1991-10-14 1993-05-25 Mitsubishi Electric Corp Information processor
US5613127A (en) 1992-08-17 1997-03-18 Honeywell Inc. Separately clocked processor synchronization improvement
US6233702B1 (en) 1992-12-17 2001-05-15 Compaq Computer Corporation Self-checked, lock step processor pairs
US5535397A (en) 1993-06-30 1996-07-09 Intel Corporation Method and apparatus for providing a context switch in response to an interrupt in a computer process
US5572620A (en) 1993-07-29 1996-11-05 Honeywell Inc. Fault-tolerant voter system for output data from a plurality of non-synchronized redundant processors
US5504859A (en) 1993-11-09 1996-04-02 International Business Machines Corporation Data processor with enhanced error recovery
EP0986007A3 (en) 1993-12-01 2001-11-07 Marathon Technologies Corporation Method of isolating I/O requests
JP3481737B2 (en) * 1995-08-07 2003-12-22 富士通株式会社 Dump collection device and dump collection method
US6449730B2 (en) 1995-10-24 2002-09-10 Seachange Technology, Inc. Loosely coupled mass storage computer cluster
US5999933A (en) * 1995-12-14 1999-12-07 Compaq Computer Corporation Process and apparatus for collecting a data structure of a memory dump into a logical table
US5850555A (en) 1995-12-19 1998-12-15 Advanced Micro Devices, Inc. System and method for validating interrupts before presentation to a CPU
US6141769A (en) 1996-05-16 2000-10-31 Resilience Corporation Triple modular redundant computer system and associated method
GB9617033D0 (en) * 1996-08-14 1996-09-25 Int Computers Ltd Diagnostic memory access
US5790397A (en) 1996-09-17 1998-08-04 Marathon Technologies Corporation Fault resilient/fault tolerant computing
US5796939A (en) 1997-03-10 1998-08-18 Digital Equipment Corporation High frequency sampling of processor performance counters
US5903717A (en) * 1997-04-02 1999-05-11 General Dynamics Information Systems, Inc. Fault tolerant computer system
US5896523A (en) 1997-06-04 1999-04-20 Marathon Technologies Corporation Loosely-coupled, synchronized execution
WO1999026133A2 (en) 1997-11-14 1999-05-27 Marathon Technologies Corporation Method for maintaining the synchronized execution in fault resilient/fault tolerant computer systems
US6173356B1 (en) * 1998-02-20 2001-01-09 Silicon Aquarius, Inc. Multi-port DRAM with integrated SRAM and systems and methods using the same
US6141635A (en) * 1998-06-12 2000-10-31 Unisys Corporation Method of diagnosing faults in an emulated computer system via a heterogeneous diagnostic program
US5991900A (en) * 1998-06-15 1999-11-23 Sun Microsystems, Inc. Bus controller
US6223304B1 (en) 1998-06-18 2001-04-24 Telefonaktiebolaget Lm Ericsson (Publ) Synchronization of processors in a fault tolerant multi-processor system
US6314501B1 (en) * 1998-07-23 2001-11-06 Unisys Corporation Computer system and method for operating multiple operating systems in different partitions of the computer system and for allowing the different partitions to communicate with one another through shared memory
US6195715B1 (en) 1998-11-13 2001-02-27 Creative Technology Ltd. Interrupt control for multiple programs communicating with a common interrupt by associating programs to GP registers, defining interrupt register, polling GP registers, and invoking callback routine associated with defined interrupt register
US6263373B1 (en) * 1998-12-04 2001-07-17 International Business Machines Corporation Data processing system and method for remotely controlling execution of a processor utilizing a test access port
US6393582B1 (en) 1998-12-10 2002-05-21 Compaq Computer Corporation Error self-checking and recovery using lock-step processor pair architecture
US6449732B1 (en) 1998-12-18 2002-09-10 Triconex Corporation Method and apparatus for processing control using a multiple redundant processor control system
US6543010B1 (en) * 1999-02-24 2003-04-01 Hewlett-Packard Development Company, L.P. Method and apparatus for accelerating a memory dump
US6397365B1 (en) 1999-05-18 2002-05-28 Hewlett-Packard Company Memory error correction using redundant sliced memory and standard ECC mechanisms
US6658654B1 (en) 2000-07-06 2003-12-02 International Business Machines Corporation Method and system for low-overhead measurement of per-thread performance information in a multithreaded environment
EP1213650A3 (en) * 2000-08-21 2006-08-30 Texas Instruments France Priority arbitration based on current task and MMU
US6604177B1 (en) 2000-09-29 2003-08-05 Hewlett-Packard Development Company, L.P. Communication of dissimilar data between lock-stepped processors
US6604717B2 (en) * 2000-11-15 2003-08-12 Stanfield Mccoy J. Bag holder
US7017073B2 (en) * 2001-02-28 2006-03-21 International Business Machines Corporation Method and apparatus for fault-tolerance via dual thread crosschecking
US7065672B2 (en) * 2001-03-28 2006-06-20 Stratus Technologies Bermuda Ltd. Apparatus and methods for fault-tolerant computing using a switching fabric
US6971043B2 (en) 2001-04-11 2005-11-29 Stratus Technologies Bermuda Ltd Apparatus and method for accessing a mass storage device in a fault-tolerant server
US7207041B2 (en) * 2001-06-28 2007-04-17 Tranzeo Wireless Technologies, Inc. Open platform architecture for shared resource access management
US7076510B2 (en) * 2001-07-12 2006-07-11 Brown William P Software raid methods and apparatuses including server usage based write delegation
US6754763B2 (en) * 2001-07-30 2004-06-22 Axis Systems, Inc. Multi-board connection system for use in electronic design automation
US7194671B2 (en) * 2001-12-31 2007-03-20 Intel Corporation Mechanism handling race conditions in FRC-enabled processors
US6687799B2 (en) * 2002-01-31 2004-02-03 Hewlett-Packard Development Company, L.P. Expedited memory dumping and reloading of computer processors
US7076397B2 (en) 2002-10-17 2006-07-11 Bmc Software, Inc. System and method for statistical performance monitoring
US6983337B2 (en) 2002-12-18 2006-01-03 Intel Corporation Method, system, and program for handling device interrupts
US7526757B2 (en) 2004-01-14 2009-04-28 International Business Machines Corporation Method and apparatus for maintaining performance monitoring structures in a page table for use in monitoring performance of a computer program
JP2005259030A (en) 2004-03-15 2005-09-22 Sharp Corp Performance evaluation device, performance evaluation method, program, and computer-readable storage medium
US7162666B2 (en) 2004-03-26 2007-01-09 Emc Corporation Multi-processor system having a watchdog for interrupting the multiple processors and deferring preemption until release of spinlocks
US20050240806A1 (en) 2004-03-30 2005-10-27 Hewlett-Packard Development Company, L.P. Diagnostic memory dump method in a redundant processor
US7308605B2 (en) 2004-07-20 2007-12-11 Hewlett-Packard Development Company, L.P. Latent error detection
US7380171B2 (en) * 2004-12-06 2008-05-27 Microsoft Corporation Controlling software failure data reporting and responses
US7328331B2 (en) 2005-01-25 2008-02-05 Hewlett-Packard Development Company, L.P. Method and system of aligning execution point of duplicate copies of a user program by copying memory stores

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5423024A (en) * 1991-05-06 1995-06-06 Stratus Computer, Inc. Fault tolerant processing section with dynamically reconfigurable voting
US6199171B1 (en) * 1998-06-26 2001-03-06 International Business Machines Corporation Time-lag duplexing techniques
US6247143B1 (en) * 1998-06-30 2001-06-12 Sun Microsystems, Inc. I/O handling for a multiprocessor computer system
US6948092B2 (en) * 1998-12-10 2005-09-20 Hewlett-Packard Development Company, L.P. System recovery from errors for processor and associated components
US6820213B1 (en) * 2000-04-13 2004-11-16 Stratus Technologies Bermuda, Ltd. Fault-tolerant computer system with voter delay buffer
US6704887B2 (en) * 2001-03-08 2004-03-09 The United States Of America As Represented By The Secretary Of The Air Force Method and apparatus for improved security in distributed-environment voting
US20020152418A1 (en) * 2001-04-11 2002-10-17 Gerry Griffin Apparatus and method for two computing elements in a fault-tolerant server to execute instructions in lockstep
US6928583B2 (en) * 2001-04-11 2005-08-09 Stratus Technologies Bermuda Ltd. Apparatus and method for two computing elements in a fault-tolerant server to execute instructions in lockstep
US20030196025A1 (en) * 2001-10-01 2003-10-16 International Business Machines Corporation Synchronizing processing of commands invoked against duplexed coupling facility structures
US6615373B2 (en) * 2001-10-01 2003-09-02 International Business Machines Corporation Method, system and program products for resolving potential deadlocks
US20030149920A1 (en) * 2001-10-01 2003-08-07 International Business Machines Corporation Method, system and program products for resolving potential deadlocks
US20030149909A1 (en) * 2001-10-01 2003-08-07 International Business Machines Corporation Halting execution of duplexed commands
US7231543B2 (en) * 2004-01-14 2007-06-12 Hewlett-Packard Development Company, L.P. Systems and methods for fault-tolerant processing with processor regrouping based on connectivity conditions

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7308605B2 (en) * 2004-07-20 2007-12-11 Hewlett-Packard Development Company, L.P. Latent error detection
US20060020850A1 (en) * 2004-07-20 2006-01-26 Jardine Robert L Latent error detection
US7047440B1 (en) * 2004-07-27 2006-05-16 Freydel Lev R Dual/triple redundant computer system
US20070283061A1 (en) * 2004-08-06 2007-12-06 Robert Bosch Gmbh Method for Delaying Accesses to Date and/or Instructions of a Two-Computer System, and Corresponding Delay Unit
US20060212763A1 (en) * 2005-03-17 2006-09-21 Fujitsu Limited Error notification method and information processing apparatus
US7584388B2 (en) * 2005-03-17 2009-09-01 Fujitsu Limited Error notification method and information processing apparatus
US20070174746A1 (en) * 2005-12-20 2007-07-26 Juerg Haefliger Tuning core voltages of processors
US7516358B2 (en) 2005-12-20 2009-04-07 Hewlett-Packard Development Company, L.P. Tuning core voltages of processors
US20070220369A1 (en) * 2006-02-21 2007-09-20 International Business Machines Corporation Fault isolation and availability mechanism for multi-processor system
US7765383B2 (en) * 2006-08-11 2010-07-27 Fujitsu Semiconductor Limited Data processing unit and data processing apparatus using data processing unit
US20080040582A1 (en) * 2006-08-11 2008-02-14 Fujitsu Limited Data processing unit and data processing apparatus using data processing unit
US20080165521A1 (en) * 2007-01-09 2008-07-10 Kerry Bernstein Three-dimensional architecture for self-checking and self-repairing integrated circuits
CN103645953A (en) * 2008-08-08 2014-03-19 亚马逊技术有限公司 Providing executing programs with reliable access to non-local block data storage
US20100205607A1 (en) * 2009-02-11 2010-08-12 Hewlett-Packard Development Company, L.P. Method and system for scheduling tasks in a multi processor computing system
US8875142B2 (en) * 2009-02-11 2014-10-28 Hewlett-Packard Development Company, L.P. Job scheduling on a multiprocessing system based on reliability and performance rankings of processors and weighted effect of detected errors
US20100275065A1 (en) * 2009-04-27 2010-10-28 Honeywell International Inc. Dual-dual lockstep processor assemblies and modules
US7979746B2 (en) * 2009-04-27 2011-07-12 Honeywell International Inc. Dual-dual lockstep processor assemblies and modules
US20120137163A1 (en) * 2009-08-19 2012-05-31 Kentaro Sasagawa Multi-core system, method of controlling multi-core system, and multiprocessor
US8719628B2 (en) * 2009-08-19 2014-05-06 Nec Corporation Multi-core system, method of controlling multi-core system, and multiprocessor
US20120317576A1 (en) * 2009-12-15 2012-12-13 Bernd Mueller method for operating an arithmetic unit
US8959392B2 (en) * 2010-03-23 2015-02-17 Continental Teves Ag & Co. Ohg Redundant two-processor controller and control method
US20130007513A1 (en) * 2010-03-23 2013-01-03 Adrian Traskov Redundant two-processor controller and control method
US9170907B2 (en) * 2011-11-10 2015-10-27 Ge Aviation Systems Llc Method of providing high integrity processing
US8924780B2 (en) * 2011-11-10 2014-12-30 Ge Aviation Systems Llc Method of providing high integrity processing
US20150106657A1 (en) * 2011-11-10 2015-04-16 Ge Aviation Systems Llc Method of providing high integrity processing
US20130124922A1 (en) * 2011-11-10 2013-05-16 Ge Aviation Systems Llc Method of providing high integrity processing
US20140373028A1 (en) * 2013-06-18 2014-12-18 Advanced Micro Devices, Inc. Software Only Inter-Compute Unit Redundant Multithreading for GPUs
US9274904B2 (en) * 2013-06-18 2016-03-01 Advanced Micro Devices, Inc. Software only inter-compute unit redundant multithreading for GPUs
US9367372B2 (en) 2013-06-18 2016-06-14 Advanced Micro Devices, Inc. Software only intra-compute unit redundant multithreading for GPUs
US20180074888A1 (en) * 2016-09-09 2018-03-15 The Charles Stark Draper Laboratory, Inc. Methods and systems for achieving trusted fault tolerance of a system of untrusted subsystems
US20190034301A1 (en) * 2017-07-31 2019-01-31 Oracle International Corporation System recovery using a failover processor
US10474549B2 (en) * 2017-07-31 2019-11-12 Oracle International Corporation System recovery using a failover processor
US11163654B2 (en) 2017-07-31 2021-11-02 Oracle International Corporation System recovery using a failover processor
US11599433B2 (en) 2017-07-31 2023-03-07 Oracle International Corporation System recovery using a failover processor
US11372981B2 (en) 2020-01-09 2022-06-28 Rockwell Collins, Inc. Profile-based monitoring for dual redundant systems

Also Published As

Publication number Publication date
US7890706B2 (en) 2011-02-15
US20050246587A1 (en) 2005-11-03
US20050223178A1 (en) 2005-10-06
CN1690970A (en) 2005-11-02
US20050240806A1 (en) 2005-10-27
US7434098B2 (en) 2008-10-07
US20050223275A1 (en) 2005-10-06
CN100472456C (en) 2009-03-25
CN1696903A (en) 2005-11-16

Similar Documents

Publication Publication Date Title
US20050246581A1 (en) Error handling system in a redundant processor
Bernick et al. NonStop/spl reg/advanced architecture
US6948092B2 (en) System recovery from errors for processor and associated components
US6260159B1 (en) Tracking memory page modification in a bridge for a multi-processor system
US6496940B1 (en) Multiple processor system with standby sparing
US6802023B2 (en) Redundant controller data storage system having hot insertion system and method
US4916704A (en) Interface of non-fault tolerant components to fault tolerant system
US5226152A (en) Functional lockstep arrangement for redundant processors
US5239641A (en) Method and apparatus for synchronizing a plurality of processors
US5255367A (en) Fault tolerant, synchronized twin computer system with error checking of I/O communication
US6393582B1 (en) Error self-checking and recovery using lock-step processor pair architecture
US7296181B2 (en) Lockstep error signaling
US6587961B1 (en) Multi-processor system bridge with controlled access
US20020133740A1 (en) Redundant controller data storage system having system and method for handling controller resets
JP2500038B2 (en) Multiprocessor computer system, fault tolerant processing method and data processing system
US6223230B1 (en) Direct memory access in a bridge for a multi-processor system
US6173351B1 (en) Multi-processor system bridge
JPH0792765B2 (en) Input / output controller
CN1729456A (en) On-die mechanism for high-reliability processor
KR20000011835A (en) Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applicatons in a network
EP0779579B1 (en) Bus error handler on dual bus system
CN101714108A (en) Synchronization control apparatus, information processing apparatus, and synchronization management method
US7631226B2 (en) Computer system, bus controller, and bus fault handling method used in the same computer system and bus controller
US5905875A (en) Multiprocessor system connected by a duplicated system bus having a bus status notification line
US7243257B2 (en) Computer system for preventing inter-node fault propagation

Legal Events

Date Code Title Description
AS Assignment

Owner name: HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P., TEXAS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JARDINE, ROBERT L.;KLECKA, JAMES S.;BRUCKERT, WILLIAM F.;AND OTHERS;REEL/FRAME:016236/0182;SIGNING DATES FROM 20050126 TO 20050127

STCB Information on status: application discontinuation

Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION