US20050246581A1 - Error handling system in a redundant processor - Google Patents
Error handling system in a redundant processor Download PDFInfo
- Publication number
- US20050246581A1 US20050246581A1 US11/045,401 US4540105A US2005246581A1 US 20050246581 A1 US20050246581 A1 US 20050246581A1 US 4540105 A US4540105 A US 4540105A US 2005246581 A1 US2005246581 A1 US 2005246581A1
- Authority
- US
- United States
- Prior art keywords
- processor
- disparity
- pio
- processor elements
- elements
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1658—Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1629—Error detection by comparing the output of redundant processing systems
- G06F11/165—Error detection by comparing the output of redundant processing systems with continued operation after detection of the error
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1675—Temporal synchronisation or re-synchronisation of redundant processing components
- G06F11/1687—Temporal synchronisation or re-synchronisation of redundant processing components at event level, e.g. by interrupt or result of polling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3404—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for parallel or distributed programming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3495—Performance evaluation by tracing or monitoring for systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/362—Software debugging
- G06F11/3636—Software debugging by tracing the execution of the program
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/362—Software debugging
- G06F11/366—Software debugging using diagnostics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1629—Error detection by comparing the output of redundant processing systems
- G06F11/1641—Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1629—Error detection by comparing the output of redundant processing systems
- G06F11/1641—Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
- G06F11/1645—Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components and the comparison itself uses redundant hardware
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1675—Temporal synchronisation or re-synchronisation of redundant processing components
- G06F11/1683—Temporal synchronisation or re-synchronisation of redundant processing components at instruction level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/18—Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits
- G06F11/183—Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components
- G06F11/184—Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components where the redundant components implement processing functionality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/18—Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits
- G06F11/183—Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components
- G06F11/184—Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components where the redundant components implement processing functionality
- G06F11/185—Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits by voting, the voting not being performed by the redundant components where the redundant components implement processing functionality and the voting is itself performed redundantly
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/88—Monitoring involving counting
Definitions
- System availability, scalability, and data integrity are fundamental characteristics of enterprise systems.
- a continuous performance capability is imposed in financial, communication, and other fields that use enterprise systems for applications such as stock exchange transaction handling, credit and debit card systems, telephone networks, and the like.
- Highly reliable systems are often implemented in applications with high financial or human costs, in circumstances of massive scaling, and in conditions that outages and data corruption cannot be tolerated.
- Some systems combine multiple redundant processors running the same operations so that an error in a single processor can be detected and/or corrected.
- Results attained for each of the processors can be mutually compared. If all results are the same, all processors are presumed, with high probability of correctness, to be functioning properly. However, if results differ analysis is performed to determine which processor is operating incorrectly. Results from the multiple processors can be “voted” with the “winning” result determined to be correct. For example, a system with three processor elements typically uses the result attained by two of the three processors.
- an error handling method comprises detecting equivalent disparity among processor elements of the computing device operating and responding to the detected equivalent disparity by evaluating secondary considerations of processor fidelity.
- FIG. 1 is a schematic block diagram that illustrates an embodiment of a control apparatus for usage in a redundant-processor computing device and having capability to resolve a mutual disparity or tie condition;
- FIG. 2 is a schematic block diagram depicting an embodiment of a computing system with capability to resolve disparity and break ties among a plurality of processor elements using a probation vector;
- FIG. 3 is a schematic block diagram illustrating an embodiment of a computing system configured in a redundant-processor arrangement that imposes a selected-duration short delay in the event of a disparity or tie condition;
- FIG. 4 is a schematic block diagram showing an embodiment of a processor complex within which the illustrative error handling system may be implemented;
- FIG. 5 is a schematic block diagram showing an embodiment of a computing system capable of detecting equivalent disparity among the processor elements and responding by evaluating secondary considerations of processor fidelity;
- FIG. 6 is a flow chart depicting an embodiment of an error handling method in a redundant-processor computing device that has tie-breaking capability during programmed input/output (PIO) voting in a duplex configuration;
- FIG. 7 is a flow chart illustrating an embodiment of an error handling method in a redundant-processor computing device during direct memory access (DMA) read voting in a duplex configuration.
- DMA direct memory access
- a processor may incorporate multiple, redundant, loosely-coupled processor elements for error detection.
- a duplex arrangement using two processor elements is susceptible to a“voting tie” situation. Ties may be avoided by using an odd number of processors at the expense of fault detection capability if a single processor element is used and by adding costs for incorporation of additional processor elements.
- the illustrative system and method may use other information to resolve conflicts and break ties. Accordingly, an effective processor may be configured using only two processor elements for voting or comparison.
- FIG. 1 a schematic block diagram illustrates an embodiment of a control apparatus 100 for usage in a redundant-processor computing device 102 .
- the control apparatus 100 is operative in a configuration with a plurality of processor elements 104 A and 104 B and can resolve a mutual disparity or “tie” condition among processor elements.
- the control apparatus 100 can be used to break ties in the case of voting error, for example with an even number of processor elements, using other available information.
- the control apparatus 100 includes a control element 106 that detects equivalent disparity among the processor elements 104 A, 104 B and responds by evaluating secondary considerations of processor fidelity.
- the control element 106 determines whether evaluation of the secondary considerations is insufficient to resolve the disparity among the processor elements 104 A, 104 B and, if so, terminates computing device operations.
- the computing device 102 can be a computer processor that uses multiple, redundant, loosely-synchronized processor elements 104 A, 104 B to detect and manage errors.
- a configuration with an even number of processor elements 104 A, 104 B is susceptible to a voting “tie” condition in which actions or results from the processor elements differ.
- a computing device 102 may have two processor elements 104 A, 104 B so that any disparity is equivalent and results in a tie condition.
- an odd number of processing element for example three, can be used at added cost to avoid ties.
- other information which may be called secondary considerations of fidelity may be available to resolve the disparity and break the tie.
- the other information is heuristic data which is sufficiently predictive to be trusted for disparity resolution. If the tie cannot be broken by use of the other information, then the error is considered sufficiently serious that the processor is halted due to an inability to guarantee correctness of either of the unequal voted data items.
- Some embodiments may include a control element 106 that evaluates the secondary conditions of processor fidelity while the processor elements 104 A, 104 B are executing before equivalent disparity is detected and sets a probation vector 108 according to the evaluation.
- the probation vector 108 may be implemented in a voting unit 110 and used by the voting unit 110 to resolve disparities and break ties in predetermined conditions.
- the probation vector 108 can have one bit of state per processor element 104 A, 104 B.
- Control logic such as software, executing in each processor element 104 A, 104 B can set the bit in conditions that the logic has accumulated information for usage in breaking future ties, or very recent ties. The control logic can periodically reset the probation vector bits.
- the voting unit 110 upon detecting a disparity or tie condition, may delay acting upon the condition or declaring a fatal error situation. Instead, the voting unit 110 can hold the compared values for a short duration time period before acting. Accordingly, the control element 106 can interject a delay between equivalent disparity detection and termination of computer device operation.
- the delay enables control logic, for example software, to possibly detect other errors or gather information pertinent to resolving the disparity or breaking the tie.
- the delay can also break potential race conditions. For example, if a self-detectable error occurs simultaneously or nearly the same time as a misvote, the delay enables further collection of information or analysis before the voter declares the misvote condition, enabling recognition of the error and resolution of the vote.
- the voting unit 110 resolves the disparity or breaks the tie in favor of the processor element, either 104 A or 104 B, that is not on probation. While the condition remains an error condition, the error is made recoverable. If the control logic does not set the bit in the probation vector prior to or during the delay, the error is considered to be fatal to the computing device 102 , and operation is halted.
- the computing device 102 can include a logical synchronization unit 112 that contains the voting unit 110 and input/output interfaces 114 A and 114 B.
- the interfaces 114 A and 114 B may include a programmed input/output (PIO) interface and a direct memory access (DMA) interface.
- PIO programmed input/output
- DMA direct memory access
- a possible equivalent disparity or tie condition may include a condition of a first processor element that performs a programmed input/output (PIO) action while a second processor element does not.
- a second example of a equivalent disparity or tie condition may be a miscompare on voted data whereby data supplied by the two processor elements 104 A, 104 B does not match on a programmed input/output (PIO) action or a direct memory access (DMA) action.
- PIO programmed input/output
- DMA direct memory access
- Other equivalent disparity or tie conditions include a first processor element performing a PIO read while a second processor element performs a PIO write, or first and second processors reading or writing different addresses.
- a schematic block diagram depicts an embodiment of a computing system 200 with capability to resolve disparity among a plurality of processor elements 202 A, 202 B configured in a redundant-processor arrangement.
- the computing system 200 further includes a probation vector 204 coupled to the processor elements 202 A, 202 B and has a signal allocated to each of the processor elements 202 A, 202 B.
- a control element 206 is coupled to the processor elements 202 A, 202 B and evaluates processor fidelity, setting the probation vector 204 according to results of the evaluation.
- the probation vector 204 is used to monitor secondary considerations of processor element fidelity before an error is detected, when more abundant information relating to processor element conditions and functionality are available. In contrast, a system that does not begin acquiring status information until an error is detected may have more limited functional capabilities, and a possible inability to perform actions diagnostic of processor element fidelity. Acquisition of status and operational information while the processor elements 202 A, 202 B are executing in due course simplifies operation because information is merely noted when available. Higher complexity algorithms that execute following error or disparity detection and require information to be evoked or stimulated, as well as monitored, can be avoided.
- the computing system 200 also includes a voter 208 that is coupled to the plurality of processor elements 202 A, 202 B and compares actions taken by the processor elements 202 A, 202 B to determine disparity in processor actions.
- a control element 210 responds to disparity among the processor elements 202 A, 202 B based on the probation vector 204 .
- the control element 210 can interject a delay between disparity detection and computer device operation termination to allow monitoring of additional information that may be useful in resolving disparity and breaking ties.
- the control element 210 can also determine whether evaluation of processor fidelity is insufficient to resolve the disparity among the processor elements 202 A, 202 B and, if so, terminate computing system operations.
- a schematic block diagram illustrates an embodiment of a computing system 300 including a plurality of processor elements 302 A, 302 B configured in a redundant-processor arrangement, and control logic 304 that imposes a selected-duration short delay in the event of a disparity or tie condition.
- the control logic 304 compares actions taken by the processor elements 302 A, 302 B and determines disparity in the actions, then waits the selected delay duration after equivalent disparity detection before initiating an action in response to the disparity condition.
- the control logic 304 may respond after the delay according to evaluation of secondary considerations of processor fidelity.
- the selected delay has duration sufficient to enable near-simultaneous arrival of information for usage in resolving the disparity condition.
- the delay is imposed in case of processor disparity or tie, to enable simultaneous or near-simultaneous arrival of information that can be used in disparity resolution and/or tie-breaking.
- the delay has suitable duration to enable logic to receive a high priority interrupt and perform a few computations and is sufficiently long to enable information to be acquired at the same time or very close to occurrence of an error.
- Typical delay duration, using current processor operating speeds and technology, is on the order of tens or hundreds of microseconds, sufficient to handle the interrupt and execute hundreds or thousands of instructions. The delay may assist in avoiding or breaking race conditions.
- the selected delay also has an upper limit. A lengthy examination of error state or diagnostic execution may not be acceptable.
- the parallel input/output (PIO) operation or direct memory access (DMA) operation is suspended, possibly disrupting communications with other processors as well as communications on the interprocessor network due to backpressure. Such disruption is generally not desirable and is avoided by imposing an upper limit on the delay duration.
- a schematic block diagram shows an embodiment of a processor complex 400 within which the illustrative error handling system may be implemented.
- the processor complex 400 includes a plurality of logical processors 408 , each a computing engine capable of executing processes and implemented using one or more processor elements 402 , each in a different processor slice 410 , combined with one or more logical synchronization units (LSUs) 412 .
- Each processor element 402 is a single microprocessor or microprocessor core capable of executing a single instruction stream.
- a processor slice 410 typically comprises one or more processor elements 402 , each with a dedicated memory 414 or sharing a partitioned memory.
- a processor complex 400 comprises one or more logical processors 408 .
- a processor complex 400 comprises one or more processor slices 410 . Within a complex 400 , each slice 410 has the same number of processor elements.
- a processor complex with one slice is called a simplex complex.
- a two-slice processor complex is called a duplex, dual modular redundant, or DMR complex.
- a three-slice processor complex is called a triplex, tri-modular redundant, or TMR complex.
- a processor complex 400 includes both processor elements 402 and corresponding logical synchronization units 412 .
- a computing system comprises one or more logical processors 408 .
- the computing system also comprises one or more processor complexes 400 .
- the processor complexes 400 are interconnected via a network, for example a System Area Network (SAN), a local area network (LAN), a wide area network (WAN), or the like, or simply a connection to a bus.
- the network is used for connection to both other processors and to input/output (I/O) devices. Voting or output data comparison is performed for all data transfers between the logical processor and the network or the network I/O adapter.
- a logical processor 408 one, two, three, or possibly more processor elements cooperate to perform logical processor operations.
- Cooperative actions include coordinating or synchronizing mutually among the processor elements, exchanging data, replicating input data, and voting on operations and output data selection.
- the various cooperative actions can be implemented within or supported by implementation of the logical synchronization units 412 .
- a schematic block diagram illustrates an embodiment of a computing system 500 comprising a plurality of processor elements 502 A, 502 B configured in a redundant-processor arrangement, and a voter 504 coupled to the plurality of processor elements 502 A, 502 B that compares actions taken by the processor elements and determines disparity in the actions.
- the computing system 500 further includes a control element 506 coupled to the processor elements 502 A, 502 B and the voter 504 that detects equivalent disparity among the processor elements 502 A, 502 B and responds by evaluating secondary considerations of processor fidelity.
- the illustrative computing system 500 may comprise a single logical processor of multiple logical processors in a complex or system.
- the computing system 500 may also include a logical synchronization unit 512 comprising the voter 504 and a network interface 520 , such as a SAN interface that can be configured to connected to the network 518 by one or more ports.
- a network interface 520 such as a SAN interface that can be configured to connected to the network 518 by one or more ports.
- the network connection may be made as shown in FIG. 5 via an X-fiber port and a Y-fiber port, although wire ports may otherwise be used.
- the logical synchronization unit 512 maintains multiple bit sets that represent, at any point in time, the set of processor elements that are enabled to perform selected operations.
- the voter 504 includes a plurality of multiple-bit configuration registers to indicate which processor elements are expected and enabled to participate in selected operations, for example including programmed input/output (PIO) operations with the processor elements 502 A, 502 B, and network interface 520 DMA operations including direct memory access (DMA) output voting and DMA input replication.
- PIO programmed input/output
- DMA direct memory access
- the voter configuration bits represent which processor elements are meant to be “assigned” to a logical processor and are therefore eligible for performing various operations such as output voting operations.
- Configured processor elements are defined, for any particular operation type, as the set of processor elements enabled by the configuration bits in the logical synchronization unit 512 to perform that operation type.
- the logical synchronization unit 512 ignores operations of non-configured processor elements on an operation-type basis.
- Configuration bits are set whenever processor elements “join” the configuration of a logical processor 508 . Configuration bits are reset or cleared when processor elements leave the configuration, for example by being voted out.
- Participating processor elements are the set of configured processor elements, for a particular instance of an operation, that actually perform the operation at a particular time, within a timeout period.
- the set of configured processor elements is related to a type of operation while the set of participating processor elements is related to an individual instance of an operation.
- the configured processor elements have outbound data transfers voted.
- Participating processor elements are the elements that actually issue a particular outbound data transfer.
- processor elements For a particular voted operation, all configured processor elements are expected to participate. A processor element that does not participate times out on the operation.
- voted operations can result in conditions including full agreement, timeout, simple miscompare, tie miscompare, and full miscompare.
- full agreement all configured processor elements participate within a reasonable time and supply identical data for an identical operation so that all voted data matches.
- a timeout condition one or more configured processor elements do not participate in time.
- a simple miscompare a strict majority of configured processor elements supplies identical data, and a strict minority supplies different data.
- a strict majority is greater than half, for example one in simplex, two in duplex, and two or three in triplex.
- a tie miscompare occurs with an even number of processor elements, for example two, configured in which the data does not compare.
- full miscompare all sets of voted data, for example all three in triplex, miscompare pair-wise so that no strict majority results.
- the different types of conditions are mapped into one of three error conditions including no error, minor error, and major error.
- Minor errors include such timeouts or disagreements among voted data in which available information is sufficient to enable resolution. Triplex configurations in which two processors are in agreement and duplex configurations in disagreement when other information is available to resolve the disagreement are examples of minor error conditions.
- Major errors include such timeouts and disagreements among voted data that the condition cannot be resolved. Triplex configurations with three-way disagreement or timeouts in two of the three processors, and duplex configurations with disagreeing processors and no available information for resolution are examples of major error conditions.
- Self-signaling errors are defined as errors detected by an explicit detection element or mechanism as distinguished from implicit detection techniques such as voting.
- Various types of errors may be self-signaling. Examples of self-signaling errors include direct memory access (DMA) read timeouts, errors detected by parity or other error checking codes, loss-of-signal in optical signals, and loss of electrical continuity for electrical signals.
- DMA direct memory access
- voting detects errors, but in a duplex configuration voting alone does not distinguish which voted data is correct and which incorrect. Self-signaling errors designate which of the two data suppliers is incorrect.
- the processor elements 502 A, 502 B each have a memory 514 or share a partitioned memory with other processor elements in the same slice.
- the PIO operations are processor-initiated reads (loads) or writes (stores) to any part of the processor's address space other than “main memory”.
- the address space may contain registers, pseudo-registers, and memory other than main memory.
- PIO operations may be targeted to resources in either the voter 504 or to a network interface 520 . Operations targeted to the voter 504 may be either unvoted or voted (compared), depending on the address of the register being accessed. Operations to the network interface 520 may always be voted in some implementations.
- the voter 504 captures the operation and read address and sets a timer. The voter 504 waits for all configured processor elements to perform the same operation. When all configured processor elements initiate the operation, the operation and address are compared, for example a bit-by-bit comparison of the entire operation and address. If one or more of the configured processor elements fails to initiate the operation within a configurable timeout period, a PIO timeout condition exists. In circumstances that the illustrative error handling and tie-breaking technique is not implemented or is disabled, operation can be described as follows. If all configured processor elements participate, then for the case of full agreement or simple miscompare, the operation proceeds, ignoring miscompared data, if any.
- a simple miscompare is handled as a minor error. Otherwise (not full agreement or simple miscompare), the operation is aborted—not performed, and a “bus error” is returned to all requesting processor elements, and a major error is reported. If the operation is not aborted, then if the PIO read is targeted to the network interface 520 , the voter 504 forwards the operation and address to the network interface 520 and waits for a response. If the operation is targeted to the voter 504 , then the voter 504 accesses data directly. The response data, when available, is replicated by the voter 504 and sent as a response to all participating processor elements at approximately the same instant.
- the voter 504 captures the operation, write address, and write data, then sets a timer and waits for all configured processor elements to perform the same operation.
- the operation, write address, and write data are bit-by-bit compared. If one or more of the configured processor elements fails to initiate the operation within a configurable timeout period, a PIO timeout condition exists. In circumstances that the illustrative error handling and tie-breaking technique is not implemented or is disabled, the operation is as follows. If all configured processor elements participate, in the case of full agreement or simple miscompare, the operation proceeds, ignoring miscompared data. A simple miscompare is handled as a minor error.
- the operation is aborted—not performed. No direct response is made to the processor element because no response or acknowledgement is normally made to write operations.
- the operation is aborted, a major error is reported.
- An operation that is not aborted is handled according to the target address of the write operation.
- the voter 504 forwards the operation, address, and data to the network interface 520 .
- the voter 504 performs the write operation directly. No response is made to the processor element.
- the voter 504 suspends all future PIO write operations, which are also aborted, until the software detects the error and re-enables PIO voting.
- the error is detected by handling an error interrupt. Note, however, that in the triplex case for a simple miscompare such as one processor element write and two processor elements time out, no requirement is made to abort all future programmed I/O write operations.
- PIO operations are initiated by the processor elements, as contrasted to DMA operations which are initiated by the network interface 520 . Therefore, PIO timeouts are possible in two different circumstances. In a first circumstance, one or two processor elements, operating correctly, initiate a PIO operation, and one processor element, operating incorrectly or stopped, fails to perform the PIO operation. The error is detected when the timer expires. Without further information, the processor element operating incorrectly may be indeterminable, for example when two processor elements are configured and one times out.
- one processor element operating incorrectly, initiates a PIO operation that should not occur, and the other processor element or processor elements, operating correctly, do not initiate a PIO operation.
- the error is detected when the timer expires, although without further information the processor element operating incorrectly may be indeterminable, for example for two active processor elements, one of which times out. Accordingly, the processor elements that do not participate are not necessarily incorrect.
- the strict majority is always trustworthy so that if one processor element times out and the remaining two processor elements perform a PIO operation, then the processor element that times out is in error and ignored, while the PIO operation proceeds and the data voted with a minor error indicated. Also in the triplex case, if two processor elements time out when a single processor element performs a PIO operation, then the processor element that performs the PIO operation is in error and the PIO is ignored. The lone processor element can be considered a “rogue processor element” and the attempted PIO operation is called a “rogue operation”.
- Some embodiments of the computing system 500 further comprise a two-processor elements configuration 502 A, 502 B, the programmed input/output (PIO) interface 522 , and a direct memory access (DMA) interface 524 coupled to the voter 504 .
- An action disparity that is detectable by the control element 506 is a miscompare on voted data with non-matching data supplied by two processor elements 502 A, 502 B either on a PIO action or a DMA action.
- Direct memory access (DMA) reads are outbound operations initiated by the network interface 520 .
- the voter 504 replicates and forwards the DMA read operation and address to all configured processor elements at approximately the same instant, subject to congestion delays in the different slices that may cause an operation to arrive at the processor elements at slightly different times.
- the voter 504 then starts a timer and waits for the responses.
- Response data flows from the processor elements 502 A, 502 B, through the voter 504 , to the network interface 520 .
- Responses arriving from the configured processor elements are buffered in the voter 504 rather than being sent immediately to the network interface 520 .
- the later responses, upon arrival, are compared with the earlier responses saved in the data buffers.
- Any processor element that does not respond within the timeout period is declared to have timed out. Unlike PIO timeouts, no rogue situations can occur with DMA timeouts because the DMA operation is initiated through the network interface 520 . Accordingly, if one processor element times out, that processor element is necessarily erroneous in both duplex and triplex cases. The condition is considered a minor error, and the DMA operation proceeds using data supplied by the processor element or processor elements that do not time out. If two processor elements time out, or the only processor element in a simplex case, then a major error is indicated and the DMA operation is aborted.
- the voter 504 generates an error notification interrupt to all configured processor elements in the case of any disagreement or timeout, whether data is successfully forwarded to the network interface 520 or otherwise. The interrupt indicates which processor elements did time out, if any, and all comparison results.
- Direct memory access (DMA) writes are inbound operations with data flowing from the network interface 520 through the voter 504 to the processor elements 502 A, 502 B. DMA write operations are initiated by the network interface 520 .
- the voter 504 replicates and forwards the operation, address, and data to all configured slices at approximately the same instant. No response is made to the network interface 520 .
- PIO and DMA timeout values are both configurable by control logic, such as software, and may be different.
- the timer is started when the first PIO operation arrives.
- the timer is restarted when each operation arrives, giving a full timeout period for the later arrivals relative to the earlier ones, a behavior used in all implementations due to the possibility of a PIO operation being initiated by a “rogue processor element”.
- PIO operations can never time out in the simplex case because the operation originates from the processor element.
- DMA operations can time out in simplex, duplex, and triplex configurations because the operation is originated from the network interface 520 .
- the timer is started when the DMA request is forwarded by the voter 504 to memories of all processor elements.
- the timer may optionally be restarted when each response arrives, giving a full timeout period for the later responses.
- a single timeout interval may be applied to all configured processor elements. Either option is possible since no rogue DMA operations can occur.
- duplex tie handling generally is inapplicable to DMA timeouts, and to simplex or triplex configurations.
- Two disparity or tie conditions include a PIO timeout in which one processor element performs a PIO operation and the other processor element does not, and a miscompare on voted data in which data supplied by the two processor elements does not match, either on a PIO operation or a DMA read operation.
- processor fidelity may be considered sufficiently strong, even if circumstantial, to implicate one of the two processor elements. For example, a recent logged history of other detected recoverable errors may be indicative of a degradation of processor element reliability.
- a newly-reintegrated processor element or a new slice may be a more likely source of error than an element that has long been installed without a history of error.
- Such early life problems are frequently discovered within a short time, on the order of minutes, following installation.
- a further example of reliability information is inherent in the multiple-dimensional configuration of logical and physical processors, for example as shown in FIG. 4 .
- Processor slices with multiple processor elements connected physically but not logically include hardware that is shared within a processor slice. Hardware may be shared among logical processors. When shared hardware ceases functioning correctly, errors such as intermittent errors can occur. If errors are detected in one logical processor but not another, information about the errors can be communicated between processor elements in a processor slice so that all processor elements within the slice have sufficient information to break any ties that occur.
- Some embodiments of the computing system 500 further comprise a probation vector 526 coupled to the voter 504 and coupled to the processor elements 502 A, 502 B.
- the probation vector 526 holds a signal for each processor element 502 A, 502 B.
- a processor control policy executable on the processor elements 502 A, 502 B evaluates the secondary conditions of processor fidelity during processor element execution.
- the processor control policy sets bits in the probation vector according to the evaluation of secondary considerations before determining whether a major or minor error has occurred.
- the probation vector 526 enables implementations to supply the other information to the voter 504 in such a way that duplex tie conditions can be simply resolved.
- the probation vector 526 comprises one bit for each processor element 502 A, 502 B.
- the logical synchronization unit 512 uses the probation vector 526 as a tie-breaker in some conditions.
- PIO writes to the probation vector 526 are not voted.
- the initial or reset value of the probation vector 526 is all zero.
- Each processor element 502 A, 502 B can set or reset the probation bit only for the processor slice associated with the processor element, and not for any other slice.
- the probation vector 526 is not used in simplex and triplex modes for which the bits may still be set by the control logic, but are ignored.
- the probation vector 526 is ignored if both processor slices agree, indicating no error.
- the probation vector 526 is also ignored in the case of self-signaling errors, which indicate the error source so that tie-breaking is superfluous. Self-signaling errors are thus classified as minor errors.
- the errors occur when the voter 504 can be certain that one slice should be ignored, for example in an outbound DMA read response, when one slice does not respond and times out, or when the slice supplies data marked as “known bad”.
- “Known bad” data relates to another example of self-signaling error and is data returned from a processor element's memory or cache that generates a detected error, such as a parity or other detected error, when accessed.
- Control logic in each processor element sets the associated probation bit independently of the other processor elements; reaching an agreement among processor elements is not required. Accordingly, both probation bits may be asserted at any time.
- the control logic resets the probation bits after some amount of time. The time duration is an implementation-defined parameter or policy.
- the probation bits may be set for all processor elements on a processor slice due to an error that places behavior of the entire slice in doubt. Accordingly, the control logic can propagate the probation bits from one processor element, where an error has been detected, to other processor elements in the slice by an implementation-defined technique. Examples of an implementation-defined system include exchanging probation bits via a register in a slice application-specific integrated circuit (ASIC), and/or using inter-processor, intra-slice interrupts.
- ASIC application-specific integrated circuit
- the control element 506 interjects a delay between equivalent disparity detection and computing device operation termination.
- a delay is imposed prior to declaring the situation an error condition.
- operation is held in limbo. The operation is neither completed nor aborted.
- the delay enables the tie to be broken by control logic setting the probation bit after the error but before the timeout elapses. The delay does not occur, and therefore does not add any latency, in full agreement cases, and in simplex and triplex cases.
- a sufficiently small delay may be of the order of the range of tens or hundreds of microseconds.
- the logical synchronization unit 512 when the delay period begins, sends an interrupt to all participating slices to inform the slices that a miscomparison has occurred, although an error has not been declared.
- the interrupt is in addition to the interrupt that is generated at the end of the delay, when the voting error is final, either major or minor.
- both probation bits are enabled or both probation bits are disabled, then a major error exists and the operation is aborted. If the bits are in opposite states, then the slice with the probation bit off or reset is obeyed, and the slice with the probation bit is on or set is ignored. If the probation bits are used to break the tie, then a minor error is declared.
- the policy for setting probation bits and duration that the bit setting is maintained is implementation-specific.
- the logical synchronization unit 512 reports all errors, both major and minor, to control logic, such as software via an interrupt and status register.
- status register bits indicate that the tie-break mechanism has been invoked and designate which processor element or slice is obeyed.
- a flow chart depicts an embodiment of an error handling method 600 in a redundant-processor computing device during programmed input/output (PIO) voting in a duplex configuration.
- the method 600 comprises detecting equivalent disparity 612 , and 610 , 614 among processor elements of the computing device, and responding to the detected equivalent disparity by evaluating 624 , 634 secondary considerations of processor fidelity.
- the method can further comprise determining whether evaluation 624 , 634 of the secondary considerations is insufficient to resolve the disparity among the processor elements. If so, computing device operations are terminated 626 and 628 , 636 and 638 . Delay can be inserted 620 and 622 , 630 and 632 between equivalent disparity detection and computer device operation termination 626 , 636 .
- control logic receives 602 a programmed input/output (PIO) request from one processor element (PE), buffers 604 the request, and starts 606 a timer. If a second request is received 608 , the two requests are compared 610 . Otherwise, if the timer has not elapsed 612 , whether the second request is received is determined 608 . If the timer has elapsed 612 , analysis of secondary considerations of processor fidelity begins.
- PIO programmed input/output
- the secondary conditions of processor fidelity can be evaluated during processor element execution, and a probation vector can be set according to the evaluation prior to determination of major or minor error. If probation bits are equal 624 , the operation is aborted 626 since the tie or disparity condition cannot be resolved and a major error interrupt is sent to the processor elements. The method is terminated 628 .
- the control logic follows direction 648 of the processor element that is not on probation, and the PIO operation specified by the non-probation processor element is performed.
- a minor error interrupt is sent 650 to the processor elements, and the method completes 652 with a suitable minor error handling technique.
- the minor error is addressed by marking the loser of the voting decision as no longer participating in the logical processor. Subsequent input/output operations or other accesses to the voter are ignored.
- software processing in the loser is interrupted and software executing in remaining processor elements shuts down the offending processor element.
- Control logic sends 630 a “tie break pending” interrupt to the processor elements, and waits 632 the configured time. If probation bits are equal 634 , the operation is aborted 636 since correct operation cannot be determined. A major error interrupt is sent to the processor elements. The method is terminated 638 .
- control logic determines whether the processor element requesting the PIO operation is on probation 640 . If the processor element is on probation 640 , the control logic follows direction 642 of the processor element that is not on probation and ignores or aborts the PIO operation, sends 644 a minor error interrupt to the processor elements, handles the minor error, and terminates 646 the method. If the processor element requesting the PIO is not on probation 640 , the control logic follows direction 648 of the processor element that is not on probation, and the PIO operation specified by the non-probation processor element is performed. A minor error interrupt is sent 650 to the processor elements, and the method is complete 652 .
- a flow chart depicts an embodiment of an error handling method 700 in a redundant-processor computing device during direct memory access (DMA) read voting in a duplex configuration.
- the method 700 comprises detecting equivalent disparity 726 and 728 among processor element memories of the computing device, and responding to the detected equivalent disparity by evaluating 738 secondary considerations of processor fidelity.
- DMA direct memory access
- the method can further comprise determining whether evaluation 738 of the secondary considerations is insufficient to resolve the disparity among the processor elements. If so, computing device operations are terminated 740 and 742 . Delay can be inserted 736 between equivalent disparity detection and computer device operation termination 740 , 742 .
- control logic receives 702 a direct memory access (DMA) read request from a network agent such as a system area network (SAN) agent, replicates and forwards 704 the request to both processor elements, and starts 706 a timer. If a first response is received 708 , the first response is buffered 716 . In some embodiments, the timer may be restarted as the first response is buffered 716 . If the timer has not timed out 710 , whether the first response is received is again determined 708 . If the timer has timed out 710 , the operation is aborted 712 and a major error interrupt is sent to both processor elements. The major error interrupt is indicative of a double timeout. The method is then terminated 714 .
- DMA direct memory access
- first response is received 708 and the first response is buffered 716 . If the second response has been received 718 , the first and second response data are compared 726 . Otherwise, if the timer has not timed out 720 , whether the second response has been received is again determined 718 . If the timer has timed out 720 , the operation is completed 722 using data from the first response and a minor error interrupt is sent to both processor elements. The single timeout condition is indicative of a “self-signaling error”. The method is then terminated 724 .
- the second response is received 718 and data from the first and second responses is compared 726 , if the first and second response data are equal 728 the data match so that the operation is completed 730 with no error.
- the method completes successfully 732 . Otherwise, the first and second response data are unequal 728 and a “tie break pending” interrupt 734 is sent to the processor elements. A delay is inserted 736 to wait for a configured time. Probation bits are read to determine whether the probation bits are equal 738 . If so, the operation is aborted 740 since the tie cannot be resolved using the probation bits and a major error interrupt is sent to the processor elements. The method terminates unsuccessfully 742 .
- probation bits are not equal 738 and the operation is completed 744 using the response from the processor element that is not on probation.
- a minor error interrupt is sent 746 to the processor elements, the minor error handled by marking the loser for removal or shutting down the offending processor element, and the method is terminated 748 .
Abstract
In a redundant-processor computing device, an error handling method comprises detecting equivalent disparity among processor elements of the computing device operating and responding to the detected equivalent disparity by evaluating secondary considerations of processor fidelity.
Description
- System availability, scalability, and data integrity are fundamental characteristics of enterprise systems. A continuous performance capability is imposed in financial, communication, and other fields that use enterprise systems for applications such as stock exchange transaction handling, credit and debit card systems, telephone networks, and the like. Highly reliable systems are often implemented in applications with high financial or human costs, in circumstances of massive scaling, and in conditions that outages and data corruption cannot be tolerated.
- Some systems combine multiple redundant processors running the same operations so that an error in a single processor can be detected and/or corrected. Results attained for each of the processors can be mutually compared. If all results are the same, all processors are presumed, with high probability of correctness, to be functioning properly. However, if results differ analysis is performed to determine which processor is operating incorrectly. Results from the multiple processors can be “voted” with the “winning” result determined to be correct. For example, a system with three processor elements typically uses the result attained by two of the three processors.
- A difficulty arises for duplex systems with two executing processors since the even number of processor elements can result in a “voting tie” situation that may lead to aborted operation and outage. Ties can be avoided by running an odd number of processors, although a single processor does not have the fault detection capability provided by voting. A three or more processor system adds product cost.
- In accordance with an embodiment of a redundant-processor computing device, an error handling method comprises detecting equivalent disparity among processor elements of the computing device operating and responding to the detected equivalent disparity by evaluating secondary considerations of processor fidelity.
- Embodiments of the invention relating to both structure and method of operation may best be understood by referring to the following description and accompanying drawings:
-
FIG. 1 is a schematic block diagram that illustrates an embodiment of a control apparatus for usage in a redundant-processor computing device and having capability to resolve a mutual disparity or tie condition; -
FIG. 2 is a schematic block diagram depicting an embodiment of a computing system with capability to resolve disparity and break ties among a plurality of processor elements using a probation vector; -
FIG. 3 is a schematic block diagram illustrating an embodiment of a computing system configured in a redundant-processor arrangement that imposes a selected-duration short delay in the event of a disparity or tie condition; -
FIG. 4 is a schematic block diagram showing an embodiment of a processor complex within which the illustrative error handling system may be implemented; -
FIG. 5 is a schematic block diagram showing an embodiment of a computing system capable of detecting equivalent disparity among the processor elements and responding by evaluating secondary considerations of processor fidelity; -
FIG. 6 is a flow chart depicting an embodiment of an error handling method in a redundant-processor computing device that has tie-breaking capability during programmed input/output (PIO) voting in a duplex configuration; and -
FIG. 7 is a flow chart illustrating an embodiment of an error handling method in a redundant-processor computing device during direct memory access (DMA) read voting in a duplex configuration. - A processor may incorporate multiple, redundant, loosely-coupled processor elements for error detection. A duplex arrangement using two processor elements is susceptible to a“voting tie” situation. Ties may be avoided by using an odd number of processors at the expense of fault detection capability if a single processor element is used and by adding costs for incorporation of additional processor elements. The illustrative system and method may use other information to resolve conflicts and break ties. Accordingly, an effective processor may be configured using only two processor elements for voting or comparison.
- Referring to
FIG. 1 , a schematic block diagram illustrates an embodiment of acontrol apparatus 100 for usage in a redundant-processor computing device 102. Thecontrol apparatus 100 is operative in a configuration with a plurality ofprocessor elements control apparatus 100 can be used to break ties in the case of voting error, for example with an even number of processor elements, using other available information. Thecontrol apparatus 100 includes acontrol element 106 that detects equivalent disparity among theprocessor elements - The
control element 106 determines whether evaluation of the secondary considerations is insufficient to resolve the disparity among theprocessor elements - The
computing device 102 can be a computer processor that uses multiple, redundant, loosely-synchronizedprocessor elements processor elements computing device 102 may have twoprocessor elements - In some situations, other information which may be called secondary considerations of fidelity may be available to resolve the disparity and break the tie. The other information is heuristic data which is sufficiently predictive to be trusted for disparity resolution. If the tie cannot be broken by use of the other information, then the error is considered sufficiently serious that the processor is halted due to an inability to guarantee correctness of either of the unequal voted data items.
- Some embodiments may include a
control element 106 that evaluates the secondary conditions of processor fidelity while theprocessor elements probation vector 108 according to the evaluation. For example, theprobation vector 108 may be implemented in avoting unit 110 and used by thevoting unit 110 to resolve disparities and break ties in predetermined conditions. In a particular example, theprobation vector 108 can have one bit of state perprocessor element processor element - The
voting unit 110, upon detecting a disparity or tie condition, may delay acting upon the condition or declaring a fatal error situation. Instead, thevoting unit 110 can hold the compared values for a short duration time period before acting. Accordingly, thecontrol element 106 can interject a delay between equivalent disparity detection and termination of computer device operation. The delay enables control logic, for example software, to possibly detect other errors or gather information pertinent to resolving the disparity or breaking the tie. The delay can also break potential race conditions. For example, if a self-detectable error occurs simultaneously or nearly the same time as a misvote, the delay enables further collection of information or analysis before the voter declares the misvote condition, enabling recognition of the error and resolution of the vote. In a particular duplex embodiment, if the control logic sets one of the two bits in theprobation vector 108 during the short delay imposed by thevoting unit 110, then thevoting unit 110 resolves the disparity or breaks the tie in favor of the processor element, either 104A or 104B, that is not on probation. While the condition remains an error condition, the error is made recoverable. If the control logic does not set the bit in the probation vector prior to or during the delay, the error is considered to be fatal to thecomputing device 102, and operation is halted. - In a particular embodiment, the
computing device 102 can include alogical synchronization unit 112 that contains thevoting unit 110 and input/output interfaces interfaces processor elements - Referring to
FIG. 2 , a schematic block diagram depicts an embodiment of acomputing system 200 with capability to resolve disparity among a plurality ofprocessor elements computing system 200 further includes aprobation vector 204 coupled to theprocessor elements processor elements control element 206 is coupled to theprocessor elements probation vector 204 according to results of the evaluation. - The
probation vector 204 is used to monitor secondary considerations of processor element fidelity before an error is detected, when more abundant information relating to processor element conditions and functionality are available. In contrast, a system that does not begin acquiring status information until an error is detected may have more limited functional capabilities, and a possible inability to perform actions diagnostic of processor element fidelity. Acquisition of status and operational information while theprocessor elements - The
computing system 200 also includes avoter 208 that is coupled to the plurality ofprocessor elements processor elements control element 210 responds to disparity among theprocessor elements probation vector 204. In some embodiment, thecontrol element 210 can interject a delay between disparity detection and computer device operation termination to allow monitoring of additional information that may be useful in resolving disparity and breaking ties. Thecontrol element 210 can also determine whether evaluation of processor fidelity is insufficient to resolve the disparity among theprocessor elements - Referring to
FIG. 3 , a schematic block diagram illustrates an embodiment of acomputing system 300 including a plurality ofprocessor elements control logic 304 that imposes a selected-duration short delay in the event of a disparity or tie condition. Thecontrol logic 304 compares actions taken by theprocessor elements control logic 304 may respond after the delay according to evaluation of secondary considerations of processor fidelity. - The selected delay has duration sufficient to enable near-simultaneous arrival of information for usage in resolving the disparity condition. The delay is imposed in case of processor disparity or tie, to enable simultaneous or near-simultaneous arrival of information that can be used in disparity resolution and/or tie-breaking. The delay has suitable duration to enable logic to receive a high priority interrupt and perform a few computations and is sufficiently long to enable information to be acquired at the same time or very close to occurrence of an error. Typical delay duration, using current processor operating speeds and technology, is on the order of tens or hundreds of microseconds, sufficient to handle the interrupt and execute hundreds or thousands of instructions. The delay may assist in avoiding or breaking race conditions.
- The selected delay also has an upper limit. A lengthy examination of error state or diagnostic execution may not be acceptable. During the time between disparity and selection of the winning processor, the parallel input/output (PIO) operation or direct memory access (DMA) operation is suspended, possibly disrupting communications with other processors as well as communications on the interprocessor network due to backpressure. Such disruption is generally not desirable and is avoided by imposing an upper limit on the delay duration.
- Referring to
FIG. 4 , a schematic block diagram shows an embodiment of aprocessor complex 400 within which the illustrative error handling system may be implemented. In a redundant-processor arrangement, theprocessor complex 400 includes a plurality oflogical processors 408, each a computing engine capable of executing processes and implemented using one ormore processor elements 402, each in adifferent processor slice 410, combined with one or more logical synchronization units (LSUs) 412. Eachprocessor element 402 is a single microprocessor or microprocessor core capable of executing a single instruction stream. Aprocessor slice 410 typically comprises one ormore processor elements 402, each with adedicated memory 414 or sharing a partitioned memory. Aprocessor complex 400 comprises one or morelogical processors 408. Aprocessor complex 400 comprises one or more processor slices 410. Within a complex 400, eachslice 410 has the same number of processor elements. - A processor complex with one slice is called a simplex complex. A two-slice processor complex is called a duplex, dual modular redundant, or DMR complex. A three-slice processor complex is called a triplex, tri-modular redundant, or TMR complex. A
processor complex 400 includes bothprocessor elements 402 and correspondinglogical synchronization units 412. - A computing system comprises one or more
logical processors 408. The computing system also comprises one ormore processor complexes 400. Theprocessor complexes 400 are interconnected via a network, for example a System Area Network (SAN), a local area network (LAN), a wide area network (WAN), or the like, or simply a connection to a bus. The network is used for connection to both other processors and to input/output (I/O) devices. Voting or output data comparison is performed for all data transfers between the logical processor and the network or the network I/O adapter. - In a
logical processor 408, one, two, three, or possibly more processor elements cooperate to perform logical processor operations. Cooperative actions include coordinating or synchronizing mutually among the processor elements, exchanging data, replicating input data, and voting on operations and output data selection. In the illustrative embodiment, the various cooperative actions can be implemented within or supported by implementation of thelogical synchronization units 412. - Referring to
FIG. 5 , a schematic block diagram illustrates an embodiment of acomputing system 500 comprising a plurality ofprocessor elements voter 504 coupled to the plurality ofprocessor elements computing system 500 further includes acontrol element 506 coupled to theprocessor elements voter 504 that detects equivalent disparity among theprocessor elements illustrative computing system 500 may comprise a single logical processor of multiple logical processors in a complex or system. - The
computing system 500 may also include alogical synchronization unit 512 comprising thevoter 504 and anetwork interface 520, such as a SAN interface that can be configured to connected to thenetwork 518 by one or more ports. For example the network connection may be made as shown inFIG. 5 via an X-fiber port and a Y-fiber port, although wire ports may otherwise be used. - The
logical synchronization unit 512 maintains multiple bit sets that represent, at any point in time, the set of processor elements that are enabled to perform selected operations. Thevoter 504 includes a plurality of multiple-bit configuration registers to indicate which processor elements are expected and enabled to participate in selected operations, for example including programmed input/output (PIO) operations with theprocessor elements network interface 520 DMA operations including direct memory access (DMA) output voting and DMA input replication. - The voter configuration bits represent which processor elements are meant to be “assigned” to a logical processor and are therefore eligible for performing various operations such as output voting operations. Configured processor elements are defined, for any particular operation type, as the set of processor elements enabled by the configuration bits in the
logical synchronization unit 512 to perform that operation type. Thelogical synchronization unit 512 ignores operations of non-configured processor elements on an operation-type basis. Configuration bits are set whenever processor elements “join” the configuration of alogical processor 508. Configuration bits are reset or cleared when processor elements leave the configuration, for example by being voted out. - Participating processor elements are the set of configured processor elements, for a particular instance of an operation, that actually perform the operation at a particular time, within a timeout period. The set of configured processor elements is related to a type of operation while the set of participating processor elements is related to an individual instance of an operation. The configured processor elements have outbound data transfers voted. Participating processor elements are the elements that actually issue a particular outbound data transfer.
- For a particular voted operation, all configured processor elements are expected to participate. A processor element that does not participate times out on the operation.
- Generally, voted operations can result in conditions including full agreement, timeout, simple miscompare, tie miscompare, and full miscompare. In full agreement, all configured processor elements participate within a reasonable time and supply identical data for an identical operation so that all voted data matches. In a timeout condition, one or more configured processor elements do not participate in time. For a simple miscompare, a strict majority of configured processor elements supplies identical data, and a strict minority supplies different data. A strict majority is greater than half, for example one in simplex, two in duplex, and two or three in triplex. A tie miscompare occurs with an even number of processor elements, for example two, configured in which the data does not compare. For a full miscompare, all sets of voted data, for example all three in triplex, miscompare pair-wise so that no strict majority results. The different types of conditions are mapped into one of three error conditions including no error, minor error, and major error.
- For no error, the operation proceeds and no error is reported. For a minor error, the operation proceeds but an error is reported. Minor errors include such timeouts or disagreements among voted data in which available information is sufficient to enable resolution. Triplex configurations in which two processors are in agreement and duplex configurations in disagreement when other information is available to resolve the disagreement are examples of minor error conditions. For a major error, the operation does not proceed, but rather is aborted, and an error is reported. Major errors include such timeouts and disagreements among voted data that the condition cannot be resolved. Triplex configurations with three-way disagreement or timeouts in two of the three processors, and duplex configurations with disagreeing processors and no available information for resolution are examples of major error conditions.
- In addition to timeouts and disagreement among redundant processors, other errors that may be handled using the illustrative techniques include breaks in cabling, for example between the processor elements and voter. Such errors can often be self-signaling. Self-signaling errors are defined as errors detected by an explicit detection element or mechanism as distinguished from implicit detection techniques such as voting. Various types of errors may be self-signaling. Examples of self-signaling errors include direct memory access (DMA) read timeouts, errors detected by parity or other error checking codes, loss-of-signal in optical signals, and loss of electrical continuity for electrical signals. With respect to the various illustrative systems, voting detects errors, but in a duplex configuration voting alone does not distinguish which voted data is correct and which incorrect. Self-signaling errors designate which of the two data suppliers is incorrect.
- The
processor elements memory 514 or share a partitioned memory with other processor elements in the same slice. - The PIO operations are processor-initiated reads (loads) or writes (stores) to any part of the processor's address space other than “main memory”. The address space may contain registers, pseudo-registers, and memory other than main memory. PIO operations may be targeted to resources in either the
voter 504 or to anetwork interface 520. Operations targeted to thevoter 504 may be either unvoted or voted (compared), depending on the address of the register being accessed. Operations to thenetwork interface 520 may always be voted in some implementations. - When any configured processor element initiates a voted PIO read, the
voter 504 captures the operation and read address and sets a timer. Thevoter 504 waits for all configured processor elements to perform the same operation. When all configured processor elements initiate the operation, the operation and address are compared, for example a bit-by-bit comparison of the entire operation and address. If one or more of the configured processor elements fails to initiate the operation within a configurable timeout period, a PIO timeout condition exists. In circumstances that the illustrative error handling and tie-breaking technique is not implemented or is disabled, operation can be described as follows. If all configured processor elements participate, then for the case of full agreement or simple miscompare, the operation proceeds, ignoring miscompared data, if any. A simple miscompare is handled as a minor error. Otherwise (not full agreement or simple miscompare), the operation is aborted—not performed, and a “bus error” is returned to all requesting processor elements, and a major error is reported. If the operation is not aborted, then if the PIO read is targeted to thenetwork interface 520, thevoter 504 forwards the operation and address to thenetwork interface 520 and waits for a response. If the operation is targeted to thevoter 504, then thevoter 504 accesses data directly. The response data, when available, is replicated by thevoter 504 and sent as a response to all participating processor elements at approximately the same instant. - When any configured processor element initiates a voted PIO write operation, the
voter 504 captures the operation, write address, and write data, then sets a timer and waits for all configured processor elements to perform the same operation. When all configured processor elements initiate the operation, the operation, write address, and write data are bit-by-bit compared. If one or more of the configured processor elements fails to initiate the operation within a configurable timeout period, a PIO timeout condition exists. In circumstances that the illustrative error handling and tie-breaking technique is not implemented or is disabled, the operation is as follows. If all configured processor elements participate, in the case of full agreement or simple miscompare, the operation proceeds, ignoring miscompared data. A simple miscompare is handled as a minor error. Otherwise (not full agreement or simple miscompare), the operation is aborted—not performed. No direct response is made to the processor element because no response or acknowledgement is normally made to write operations. When the operation is aborted, a major error is reported. An operation that is not aborted is handled according to the target address of the write operation. For a PIO write to thenetwork interface 520, thevoter 504 forwards the operation, address, and data to thenetwork interface 520. For a PIO write to thevoter 504, thevoter 504 performs the write operation directly. No response is made to the processor element. Because of the possible side-effects with PIO write operations, and because software does not necessarily verify the effect, or success, of each write operation, when a PIO write operation is aborted due to a major voting error, thevoter 504 suspends all future PIO write operations, which are also aborted, until the software detects the error and re-enables PIO voting. Typically the error is detected by handling an error interrupt. Note, however, that in the triplex case for a simple miscompare such as one processor element write and two processor elements time out, no requirement is made to abort all future programmed I/O write operations. - PIO operations are initiated by the processor elements, as contrasted to DMA operations which are initiated by the
network interface 520. Therefore, PIO timeouts are possible in two different circumstances. In a first circumstance, one or two processor elements, operating correctly, initiate a PIO operation, and one processor element, operating incorrectly or stopped, fails to perform the PIO operation. The error is detected when the timer expires. Without further information, the processor element operating incorrectly may be indeterminable, for example when two processor elements are configured and one times out. - In a second circumstance, one processor element, operating incorrectly, initiates a PIO operation that should not occur, and the other processor element or processor elements, operating correctly, do not initiate a PIO operation. Again the error is detected when the timer expires, although without further information the processor element operating incorrectly may be indeterminable, for example for two active processor elements, one of which times out. Accordingly, the processor elements that do not participate are not necessarily incorrect.
- In the triplex case, the strict majority is always trustworthy so that if one processor element times out and the remaining two processor elements perform a PIO operation, then the processor element that times out is in error and ignored, while the PIO operation proceeds and the data voted with a minor error indicated. Also in the triplex case, if two processor elements time out when a single processor element performs a PIO operation, then the processor element that performs the PIO operation is in error and the PIO is ignored. The lone processor element can be considered a “rogue processor element” and the attempted PIO operation is called a “rogue operation”.
- In the duplex case, whether the processor element performing the PIO or the processor element that is not participating is in error cannot be determined, without other evidence. Accordingly a tie or disparity condition exists.
- Some embodiments of the
computing system 500 further comprise a two-processor elements configuration interface 522, and a direct memory access (DMA)interface 524 coupled to thevoter 504. An action disparity that is detectable by thecontrol element 506 is a miscompare on voted data with non-matching data supplied by twoprocessor elements - Direct memory access (DMA) reads are outbound operations initiated by the
network interface 520. Thevoter 504 replicates and forwards the DMA read operation and address to all configured processor elements at approximately the same instant, subject to congestion delays in the different slices that may cause an operation to arrive at the processor elements at slightly different times. Thevoter 504 then starts a timer and waits for the responses. Response data flows from theprocessor elements voter 504, to thenetwork interface 520. Responses arriving from the configured processor elements are buffered in thevoter 504 rather than being sent immediately to thenetwork interface 520. The later responses, upon arrival, are compared with the earlier responses saved in the data buffers. When a strict majority of the responses agree, one copy of the data, from one of the agreeing responses, is communicated to thenetwork interface 520. If a strict majority of the responses do not agree, then data is not sent to thenetwork interface 520 in a manner that can be interpreted as valid data. - Any processor element that does not respond within the timeout period is declared to have timed out. Unlike PIO timeouts, no rogue situations can occur with DMA timeouts because the DMA operation is initiated through the
network interface 520. Accordingly, if one processor element times out, that processor element is necessarily erroneous in both duplex and triplex cases. The condition is considered a minor error, and the DMA operation proceeds using data supplied by the processor element or processor elements that do not time out. If two processor elements time out, or the only processor element in a simplex case, then a major error is indicated and the DMA operation is aborted. Thevoter 504 generates an error notification interrupt to all configured processor elements in the case of any disagreement or timeout, whether data is successfully forwarded to thenetwork interface 520 or otherwise. The interrupt indicates which processor elements did time out, if any, and all comparison results. - Direct memory access (DMA) writes are inbound operations with data flowing from the
network interface 520 through thevoter 504 to theprocessor elements network interface 520. Thevoter 504 replicates and forwards the operation, address, and data to all configured slices at approximately the same instant. No response is made to thenetwork interface 520. - The voted PIO operations and DMA responses are protected by timeouts. PIO and DMA timeout values are both configurable by control logic, such as software, and may be different.
- For PIO timeouts, the timer is started when the first PIO operation arrives. The timer is restarted when each operation arrives, giving a full timeout period for the later arrivals relative to the earlier ones, a behavior used in all implementations due to the possibility of a PIO operation being initiated by a “rogue processor element”. PIO operations can never time out in the simplex case because the operation originates from the processor element.
- For DMA read response timeouts, DMA operations can time out in simplex, duplex, and triplex configurations because the operation is originated from the
network interface 520. The timer is started when the DMA request is forwarded by thevoter 504 to memories of all processor elements. The timer may optionally be restarted when each response arrives, giving a full timeout period for the later responses. Alternatively, a single timeout interval may be applied to all configured processor elements. Either option is possible since no rogue DMA operations can occur. - In an illustrative embodiment, special case handling can be used when a PIO timeout occurs or a miscompare occurs on voted data in a duplex configuration. In the illustrative example, duplex tie handling generally is inapplicable to DMA timeouts, and to simplex or triplex configurations. Two disparity or tie conditions include a PIO timeout in which one processor element performs a PIO operation and the other processor element does not, and a miscompare on voted data in which data supplied by the two processor elements does not match, either on a PIO operation or a DMA read operation.
- In the absence of any other information, a tie or disparity condition in a duplex configuration is ambiguous whereby the trustworthiness of each processor element is not obvious, leading to a typical policy of halting the logical processor.
- However, occasionally, other information, termed secondary considerations of processor fidelity, may exist. The other information may be considered sufficiently strong, even if circumstantial, to implicate one of the two processor elements. For example, a recent logged history of other detected recoverable errors may be indicative of a degradation of processor element reliability.
- Another example of pertinent reliability information is a recent history of processor replacement. A newly-reintegrated processor element or a new slice may be a more likely source of error than an element that has long been installed without a history of error. Such early life problems are frequently discovered within a short time, on the order of minutes, following installation.
- A further example of reliability information is inherent in the multiple-dimensional configuration of logical and physical processors, for example as shown in
FIG. 4 . Processor slices with multiple processor elements connected physically but not logically include hardware that is shared within a processor slice. Hardware may be shared among logical processors. When shared hardware ceases functioning correctly, errors such as intermittent errors can occur. If errors are detected in one logical processor but not another, information about the errors can be communicated between processor elements in a processor slice so that all processor elements within the slice have sufficient information to break any ties that occur. - Some embodiments of the
computing system 500 further comprise aprobation vector 526 coupled to thevoter 504 and coupled to theprocessor elements probation vector 526 holds a signal for eachprocessor element processor elements probation vector 526 enables implementations to supply the other information to thevoter 504 in such a way that duplex tie conditions can be simply resolved. Theprobation vector 526 comprises one bit for eachprocessor element logical synchronization unit 512 uses theprobation vector 526 as a tie-breaker in some conditions. - In an illustrative embodiment, PIO writes to the
probation vector 526 are not voted. The initial or reset value of theprobation vector 526 is all zero. Eachprocessor element probation vector 526 is not used in simplex and triplex modes for which the bits may still be set by the control logic, but are ignored. - The
probation vector 526 is ignored if both processor slices agree, indicating no error. Theprobation vector 526 is also ignored in the case of self-signaling errors, which indicate the error source so that tie-breaking is superfluous. Self-signaling errors are thus classified as minor errors. The errors occur when thevoter 504 can be certain that one slice should be ignored, for example in an outbound DMA read response, when one slice does not respond and times out, or when the slice supplies data marked as “known bad”. “Known bad” data relates to another example of self-signaling error and is data returned from a processor element's memory or cache that generates a detected error, such as a parity or other detected error, when accessed. - Control logic in each processor element sets the associated probation bit independently of the other processor elements; reaching an agreement among processor elements is not required. Accordingly, both probation bits may be asserted at any time. The control logic resets the probation bits after some amount of time. The time duration is an implementation-defined parameter or policy.
- In some circumstances, the probation bits may be set for all processor elements on a processor slice due to an error that places behavior of the entire slice in doubt. Accordingly, the control logic can propagate the probation bits from one processor element, where an error has been detected, to other processor elements in the slice by an implementation-defined technique. Examples of an implementation-defined system include exchanging probation bits via a register in a slice application-specific integrated circuit (ASIC), and/or using inter-processor, intra-slice interrupts.
- In some embodiments of the
computing system 500, thecontrol element 506 interjects a delay between equivalent disparity detection and computing device operation termination. When a duplex, non-self-signaling error occurs, a delay is imposed prior to declaring the situation an error condition. During the delay period, operation is held in limbo. The operation is neither completed nor aborted. The delay enables the tie to be broken by control logic setting the probation bit after the error but before the timeout elapses. The delay does not occur, and therefore does not add any latency, in full agreement cases, and in simplex and triplex cases. - The delay period in a tie-breaker situation is maintained sufficiently small to avoid excessive network backpressure, therefore preventing an increase in congestion in the network. For current technology, a sufficiently small delay may be of the order of the range of tens or hundreds of microseconds.
- In an illustrative embodiment, when the delay period begins, the
logical synchronization unit 512 sends an interrupt to all participating slices to inform the slices that a miscomparison has occurred, although an error has not been declared. The interrupt is in addition to the interrupt that is generated at the end of the delay, when the voting error is final, either major or minor. - In a tie-breaker or disparity condition, if after the delay interval has elapsed, both probation bits are enabled or both probation bits are disabled, then a major error exists and the operation is aborted. If the bits are in opposite states, then the slice with the probation bit off or reset is obeyed, and the slice with the probation bit is on or set is ignored. If the probation bits are used to break the tie, then a minor error is declared.
- The policy for setting probation bits and duration that the bit setting is maintained is implementation-specific.
- In the illustrative embodiment, the
logical synchronization unit 512 reports all errors, both major and minor, to control logic, such as software via an interrupt and status register. In one implementation, status register bits indicate that the tie-break mechanism has been invoked and designate which processor element or slice is obeyed. - Referring to
FIG. 6 , a flow chart depicts an embodiment of anerror handling method 600 in a redundant-processor computing device during programmed input/output (PIO) voting in a duplex configuration. Themethod 600 comprises detectingequivalent disparity - The method can further comprise determining whether
evaluation device operation termination - In the
illustrative method 600, control logic receives 602 a programmed input/output (PIO) request from one processor element (PE), buffers 604 the request, and starts 606 a timer. If a second request is received 608, the two requests are compared 610. Otherwise, if the timer has not elapsed 612, whether the second request is received is determined 608. If the timer has elapsed 612, analysis of secondary considerations of processor fidelity begins. - In conditions that a second request is received 608, following
comparison 610 of the requests, if the requests match 614, the control logic performs thePIO operation 616 and no error is indicated, and the method is complete 618. Otherwise, an equivalent disparity condition exists in the form of a miscompare on voted data whereby command, address, or data supplied by two processor elements does not match on a programmed input/output (PIO) action. Operation, address, and, for a write operation, data are compared to determine a match condition. Secondary consideration analysis begins with the control logic sending 620 a “tie break pending” interrupt to the processor elements, and waiting 622 the configured time. Generally, the secondary conditions of processor fidelity can be evaluated during processor element execution, and a probation vector can be set according to the evaluation prior to determination of major or minor error. If probation bits are equal 624, the operation is aborted 626 since the tie or disparity condition cannot be resolved and a major error interrupt is sent to the processor elements. The method is terminated 628. - Otherwise, if the probation bits are not equal 624, the control logic follows
direction 648 of the processor element that is not on probation, and the PIO operation specified by the non-probation processor element is performed. A minor error interrupt is sent 650 to the processor elements, and the method completes 652 with a suitable minor error handling technique. In some embodiments, the minor error is addressed by marking the loser of the voting decision as no longer participating in the logical processor. Subsequent input/output operations or other accesses to the voter are ignored. In other embodiments, software processing in the loser is interrupted and software executing in remaining processor elements shuts down the offending processor element. - For analysis of secondary considerations of processor fidelity after the
timer elapses 612, an equivalent disparity condition occurs in which a first processor element performs a programmed input/output (PIO) action while a second processor element does not. Control logic sends 630 a “tie break pending” interrupt to the processor elements, and waits 632 the configured time. If probation bits are equal 634, the operation is aborted 636 since correct operation cannot be determined. A major error interrupt is sent to the processor elements. The method is terminated 638. - Otherwise, if the probation bits are not equal 634, control logic determines whether the processor element requesting the PIO operation is on
probation 640. If the processor element is onprobation 640, the control logic followsdirection 642 of the processor element that is not on probation and ignores or aborts the PIO operation, sends 644 a minor error interrupt to the processor elements, handles the minor error, and terminates 646 the method. If the processor element requesting the PIO is not onprobation 640, the control logic followsdirection 648 of the processor element that is not on probation, and the PIO operation specified by the non-probation processor element is performed. A minor error interrupt is sent 650 to the processor elements, and the method is complete 652. - Referring to
FIG. 7 , a flow chart depicts an embodiment of anerror handling method 700 in a redundant-processor computing device during direct memory access (DMA) read voting in a duplex configuration. Themethod 700 comprises detectingequivalent disparity - The method can further comprise determining whether
evaluation 738 of the secondary considerations is insufficient to resolve the disparity among the processor elements. If so, computing device operations are terminated 740 and 742. Delay can be inserted 736 between equivalent disparity detection and computerdevice operation termination - In the
illustrative method 700, control logic receives 702 a direct memory access (DMA) read request from a network agent such as a system area network (SAN) agent, replicates and forwards 704 the request to both processor elements, and starts 706 a timer. If a first response is received 708, the first response is buffered 716. In some embodiments, the timer may be restarted as the first response is buffered 716. If the timer has not timed out 710, whether the first response is received is again determined 708. If the timer has timed out 710, the operation is aborted 712 and a major error interrupt is sent to both processor elements. The major error interrupt is indicative of a double timeout. The method is then terminated 714. - In conditions that the first response is received 708 and the first response is buffered 716, whether a second response has been received is determined 718. If the second response has been received 718, the first and second response data are compared 726. Otherwise, if the timer has not timed out 720, whether the second response has been received is again determined 718. If the timer has timed out 720, the operation is completed 722 using data from the first response and a minor error interrupt is sent to both processor elements. The single timeout condition is indicative of a “self-signaling error”. The method is then terminated 724.
- In conditions that the second response is received 718 and data from the first and second responses is compared 726, if the first and second response data are equal 728 the data match so that the operation is completed 730 with no error. The method completes successfully 732. Otherwise, the first and second response data are unequal 728 and a “tie break pending” interrupt 734 is sent to the processor elements. A delay is inserted 736 to wait for a configured time. Probation bits are read to determine whether the probation bits are equal 738. If so, the operation is aborted 740 since the tie cannot be resolved using the probation bits and a major error interrupt is sent to the processor elements. The method terminates unsuccessfully 742. Otherwise, the probation bits are not equal 738 and the operation is completed 744 using the response from the processor element that is not on probation. A minor error interrupt is sent 746 to the processor elements, the minor error handled by marking the loser for removal or shutting down the offending processor element, and the method is terminated 748.
- While the present disclosure describes various embodiments, these embodiments are to be understood as illustrative and do not limit the claim scope. Many variations, modifications, additions and improvements of the described embodiments are possible. For example, those having ordinary skill in the art will readily implement the steps necessary to provide the structures and methods disclosed herein, and will understand that the process parameters, materials, and dimensions are given by way of example only. The parameters, materials, and dimensions can be varied to achieve the desired structure as well as modifications, which are within the scope of the claims. For example, although the illustrative structures and methods are most applicable to multiple-processor systems in a duplex configuration, various aspects may be implemented in configurations with more or fewer processors. Furthermore, the illustrative embodiments depict particular arrangements of components. Any suitable arrangement of components may be used. The various operations performed may be implemented in any suitable matter, for example in hardware, software, firmware, or the like.
Claims (29)
1. A control apparatus for usage in a redundant-processor computing device including a plurality of processor elements, the control apparatus comprising:
a control element that detects equivalent disparity among the processor elements and responds by evaluating secondary considerations of processor fidelity.
2. The apparatus according to claim 1 further comprising:
a control element that determines whether evaluation of the secondary considerations is insufficient to resolve the equivalent disparity among the processor elements and, if so, terminates operations of the computing device.
3. The apparatus according to claim 2 further comprising:
a control element that interjects a delay between equivalent disparity detection and the evaluation of secondary considerations of processor fidelity.
4. The apparatus according to claim 1 further comprising:
a control element that determines whether the evaluation of the secondary considerations is sufficient to resolve the equivalent disparity among the processor elements and, if so, completes an operation according to the resolution.
5. The apparatus according to claim 1 further comprising:
a control element that evaluates the secondary conditions of processor fidelity and sets a probation vector according to the evaluation.
6. The apparatus according to claim 1 further comprising:
a processor element that evaluates the secondary conditions of processor fidelity and sets a probation vector according to the evaluation.
7. The apparatus according to claim 1 wherein an equivalent disparity condition comprises one or more of conditions including:
a condition of a first processor element performing a programmed input/output (PIO) action while a second processor element does not;
a condition of a first processor element performing a PIO read while a second processor element performs a PIO write;
a condition of a first processor and a second processor reading different addresses;
a condition of a first processor and a second processor writing different addresses; and
a miscompare on voted data whereby data supplied by two processor elements does not match on a programmed input/output (PIO) action or a direct memory access (DMA) action.
8. An error handling method in a redundant-processor computing device comprising:
detecting equivalent disparity among processor elements of the computing device; and
responding to the detected equivalent disparity by evaluating secondary considerations of processor fidelity.
9. The method according to claim 8 further comprising:
determining whether the evaluation of the secondary considerations is insufficient to resolve the equivalent disparity among the processor elements; and
if so, terminating operations of the computing device.
10. The method according to claim 9 further comprising:
inserting a delay between equivalent disparity detection and termination of computer device operation.
11. The method according to claim 8 further comprising:
determining whether evaluation of the secondary considerations is sufficient to resolve the equivalent disparity among the processor elements; and
if so, completing an operation according to the resolution.
12. The method according to claim 8 further comprising:
evaluating the secondary conditions of processor fidelity; and
setting a probation vector according to the evaluation.
13. The method according to claim 8 wherein an equivalent disparity condition one or more of conditions including:
a condition of a first processor element performing a programmed input/output (PIO) action while a second processor element does not;
a condition of a first processor element performing a PIO read while a second processor element performs a PIO write;
a condition of a first processor and a second processor reading different addresses;
a condition of a first processor and a second processor writing different addresses; and
a miscompare on voted data whereby data supplied by two processor elements does not match on a programmed input/output (PIO) action or a direct memory access (DMA) action.
14. A computing system comprising:
a plurality of processor elements configured in a redundant-processor arrangement;
a voter coupled to the plurality of processor elements that compares actions taken by the processor elements and determines disparity in the actions; and
a control element coupled to the processor elements and the voter that detects equivalent disparity among the processor elements and responds by evaluating secondary considerations of processor fidelity.
15. The system according to claim 14 further comprising:
a two-processor element configuration; and
a programmed input/output (PIO) interface coupled to the voter whereby an action disparity that is detectable by the control element is a PIO timeout with one processor element performing a PIO action and one processor element not performing the PIO action.
16. The system according to claim 14 further comprising:
a two-processor element configuration; and
a programmed input/output (PIO) interface and a direct memory access (DMA) interface coupled to the voter whereby an action disparity that is detectable by the control element is a miscompare on voted data with non-matching data supplied by two processor elements either on a PIO action or a DMA action.
17. The system according to claim 14 further comprising:
a probation vector coupled to the voter and coupled to the processor elements and having a signal allocated to each of the processor elements; and
a control element that evaluates the secondary conditions of processor fidelity and sets the probation vector according to the secondary considerations of processor fidelity.
18. The system according to claim 14 further comprising:
a control element that determines whether evaluation of the secondary considerations is insufficient to resolve the disparity among the processor elements and, if so, terminates computing device operations.
19. The system according to claim 18 further comprising:
a control element that interjects a delay between equivalent disparity detection and evaluation of secondary considerations of processor fidelity.
20. A computing system comprising:
a plurality of processor elements configured in a redundant-processor arrangement;
a probation vector coupled to the processor elements and having a signal allocated to each of the processor elements; and
a logic that evaluates processor fidelity and sets the probation vector according to the evaluation.
21. The system according to claim 20 further comprising:
a control element that evaluates processor fidelity and sets a probation vector according to the evaluation.
22. The system according to claim 20 further comprising:
a processor element that evaluates processor fidelity and sets a probation vector according to the evaluation.
23. The system according to claim 20 further comprising:
a voter coupled to the plurality of processor elements that compares actions taken by the processor elements and determines disparity in the actions; and
a control-element that responds-to disparity among the processor elements based on the probation vector.
24. The system according to claim 23 further comprising:
a control element that interjects a delay between disparity detection and computer device operation termination.
25. The system according to claim 20 further comprising:
a control element that determines whether evaluation of processor fidelity is insufficient to resolve the disparity among the processor elements and, if so, terminates computing system operations.
26. The system according to claim 20 wherein for a two-processor system a disparity condition comprises one or more conditions selected from a group consisting of:
a condition of a first processor element performing a programmed input/output (PIO) action while a second processor element does not;
a condition of a first processor element performing a PIO read while a second processor element performs a PIO write;
a condition of a first processor and a second processor reading different addresses;
a condition of a first processor and a second processor writing different addresses; and
a miscompare on voted data whereby data supplied by two processor elements does not match on a programmed input/output (PIO) action or a direct memory access (DMA) action.
27. A computing system comprising:
a plurality of processor elements configured in a redundant-processor arrangement; and
control logic coupled to the processor element plurality that mutually compares actions taken by ones of the processor elements and determines equivalent disparity in the actions, and waits a selected delay after equivalent disparity detection before initiating an action responsive to the disparity condition.
28. The computing system according to claim 27 further comprising:
control logic that responds after the delay according to evaluation of secondary considerations of processor fidelity.
29. The computing system according to claim 27 wherein:
the selected delay has a duration sufficient to enable near-simultaneous arrival of information for usage in resolving the disparity condition.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/045,401 US20050246581A1 (en) | 2004-03-30 | 2005-01-27 | Error handling system in a redundant processor |
CN 200610007181 CN1811722A (en) | 2005-01-27 | 2006-01-26 | Error handling system in a redundant processor |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US55781204P | 2004-03-30 | 2004-03-30 | |
US11/045,401 US20050246581A1 (en) | 2004-03-30 | 2005-01-27 | Error handling system in a redundant processor |
Publications (1)
Publication Number | Publication Date |
---|---|
US20050246581A1 true US20050246581A1 (en) | 2005-11-03 |
Family
ID=35346428
Family Applications (5)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/953,242 Abandoned US20050240806A1 (en) | 2004-03-30 | 2004-09-28 | Diagnostic memory dump method in a redundant processor |
US10/990,151 Active 2025-09-20 US7890706B2 (en) | 2004-03-30 | 2004-11-16 | Delegated write for race avoidance in a processor |
US11/042,981 Expired - Fee Related US7434098B2 (en) | 2004-03-30 | 2005-01-25 | Method and system of determining whether a user program has made a system level call |
US11/045,401 Abandoned US20050246581A1 (en) | 2004-03-30 | 2005-01-27 | Error handling system in a redundant processor |
US11/071,944 Abandoned US20050223275A1 (en) | 2004-03-30 | 2005-03-04 | Performance data access |
Family Applications Before (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/953,242 Abandoned US20050240806A1 (en) | 2004-03-30 | 2004-09-28 | Diagnostic memory dump method in a redundant processor |
US10/990,151 Active 2025-09-20 US7890706B2 (en) | 2004-03-30 | 2004-11-16 | Delegated write for race avoidance in a processor |
US11/042,981 Expired - Fee Related US7434098B2 (en) | 2004-03-30 | 2005-01-25 | Method and system of determining whether a user program has made a system level call |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/071,944 Abandoned US20050223275A1 (en) | 2004-03-30 | 2005-03-04 | Performance data access |
Country Status (2)
Country | Link |
---|---|
US (5) | US20050240806A1 (en) |
CN (2) | CN1690970A (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060020850A1 (en) * | 2004-07-20 | 2006-01-26 | Jardine Robert L | Latent error detection |
US7047440B1 (en) * | 2004-07-27 | 2006-05-16 | Freydel Lev R | Dual/triple redundant computer system |
US20060212763A1 (en) * | 2005-03-17 | 2006-09-21 | Fujitsu Limited | Error notification method and information processing apparatus |
US20070174746A1 (en) * | 2005-12-20 | 2007-07-26 | Juerg Haefliger | Tuning core voltages of processors |
US20070220369A1 (en) * | 2006-02-21 | 2007-09-20 | International Business Machines Corporation | Fault isolation and availability mechanism for multi-processor system |
US20070283061A1 (en) * | 2004-08-06 | 2007-12-06 | Robert Bosch Gmbh | Method for Delaying Accesses to Date and/or Instructions of a Two-Computer System, and Corresponding Delay Unit |
US20080040582A1 (en) * | 2006-08-11 | 2008-02-14 | Fujitsu Limited | Data processing unit and data processing apparatus using data processing unit |
US20080165521A1 (en) * | 2007-01-09 | 2008-07-10 | Kerry Bernstein | Three-dimensional architecture for self-checking and self-repairing integrated circuits |
US20100205607A1 (en) * | 2009-02-11 | 2010-08-12 | Hewlett-Packard Development Company, L.P. | Method and system for scheduling tasks in a multi processor computing system |
US20100275065A1 (en) * | 2009-04-27 | 2010-10-28 | Honeywell International Inc. | Dual-dual lockstep processor assemblies and modules |
US20120137163A1 (en) * | 2009-08-19 | 2012-05-31 | Kentaro Sasagawa | Multi-core system, method of controlling multi-core system, and multiprocessor |
US20120317576A1 (en) * | 2009-12-15 | 2012-12-13 | Bernd Mueller | method for operating an arithmetic unit |
US20130007513A1 (en) * | 2010-03-23 | 2013-01-03 | Adrian Traskov | Redundant two-processor controller and control method |
US20130124922A1 (en) * | 2011-11-10 | 2013-05-16 | Ge Aviation Systems Llc | Method of providing high integrity processing |
CN103645953A (en) * | 2008-08-08 | 2014-03-19 | 亚马逊技术有限公司 | Providing executing programs with reliable access to non-local block data storage |
US20140373028A1 (en) * | 2013-06-18 | 2014-12-18 | Advanced Micro Devices, Inc. | Software Only Inter-Compute Unit Redundant Multithreading for GPUs |
US20180074888A1 (en) * | 2016-09-09 | 2018-03-15 | The Charles Stark Draper Laboratory, Inc. | Methods and systems for achieving trusted fault tolerance of a system of untrusted subsystems |
US20190034301A1 (en) * | 2017-07-31 | 2019-01-31 | Oracle International Corporation | System recovery using a failover processor |
US11372981B2 (en) | 2020-01-09 | 2022-06-28 | Rockwell Collins, Inc. | Profile-based monitoring for dual redundant systems |
Families Citing this family (70)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060020852A1 (en) * | 2004-03-30 | 2006-01-26 | Bernick David L | Method and system of servicing asynchronous interrupts in multiple processors executing a user program |
US20050240806A1 (en) * | 2004-03-30 | 2005-10-27 | Hewlett-Packard Development Company, L.P. | Diagnostic memory dump method in a redundant processor |
US7412545B2 (en) * | 2004-07-22 | 2008-08-12 | International Business Machines Corporation | Apparatus and method for updating I/O capability of a logically-partitioned computer system |
US7516359B2 (en) * | 2004-10-25 | 2009-04-07 | Hewlett-Packard Development Company, L.P. | System and method for using information relating to a detected loss of lockstep for determining a responsive action |
US7818614B2 (en) * | 2004-10-25 | 2010-10-19 | Hewlett-Packard Development Company, L.P. | System and method for reintroducing a processor module to an operating system after lockstep recovery |
US7627781B2 (en) * | 2004-10-25 | 2009-12-01 | Hewlett-Packard Development Company, L.P. | System and method for establishing a spare processor for recovering from loss of lockstep in a boot processor |
US7624302B2 (en) * | 2004-10-25 | 2009-11-24 | Hewlett-Packard Development Company, L.P. | System and method for switching the role of boot processor to a spare processor responsive to detection of loss of lockstep in a boot processor |
EP1807760B1 (en) * | 2004-10-25 | 2008-09-17 | Robert Bosch Gmbh | Data processing system with a variable clock speed |
US7502958B2 (en) * | 2004-10-25 | 2009-03-10 | Hewlett-Packard Development Company, L.P. | System and method for providing firmware recoverable lockstep protection |
US7383471B2 (en) * | 2004-12-28 | 2008-06-03 | Hewlett-Packard Development Company, L.P. | Diagnostic memory dumping |
JP4528144B2 (en) * | 2005-01-26 | 2010-08-18 | 富士通株式会社 | Memory dump program boot method, mechanism and program |
US20060190700A1 (en) * | 2005-02-22 | 2006-08-24 | International Business Machines Corporation | Handling permanent and transient errors using a SIMD unit |
WO2006090407A1 (en) * | 2005-02-23 | 2006-08-31 | Hewlett-Packard Development Company, L.P. | A method or apparatus for storing data in a computer system |
US7590885B2 (en) * | 2005-04-26 | 2009-09-15 | Hewlett-Packard Development Company, L.P. | Method and system of copying memory from a source processor to a target processor by duplicating memory writes |
DE102005037246A1 (en) * | 2005-08-08 | 2007-02-15 | Robert Bosch Gmbh | Method and device for controlling a computer system having at least two execution units and a comparison unit |
US8694621B2 (en) * | 2005-08-19 | 2014-04-08 | Riverbed Technology, Inc. | Capture, analysis, and visualization of concurrent system and network behavior of an application |
JP4645837B2 (en) * | 2005-10-31 | 2011-03-09 | 日本電気株式会社 | Memory dump method, computer system, and program |
US7627584B2 (en) * | 2005-11-30 | 2009-12-01 | Oracle International Corporation | Database system configured for automatic failover with no data loss |
US7668879B2 (en) * | 2005-11-30 | 2010-02-23 | Oracle International Corporation | Database system configured for automatic failover with no data loss |
US20070124522A1 (en) * | 2005-11-30 | 2007-05-31 | Ellison Brandon J | Node detach in multi-node system |
US7496786B2 (en) * | 2006-01-10 | 2009-02-24 | Stratus Technologies Bermuda Ltd. | Systems and methods for maintaining lock step operation |
US8127099B2 (en) * | 2006-12-26 | 2012-02-28 | International Business Machines Corporation | Resource recovery using borrowed blocks of memory |
US7743285B1 (en) * | 2007-04-17 | 2010-06-22 | Hewlett-Packard Development Company, L.P. | Chip multiprocessor with configurable fault isolation |
US20080263391A1 (en) * | 2007-04-20 | 2008-10-23 | International Business Machines Corporation | Apparatus, System, and Method For Adapter Card Failover |
US20080270653A1 (en) * | 2007-04-26 | 2008-10-30 | Balle Susanne M | Intelligent resource management in multiprocessor computer systems |
JP4838226B2 (en) * | 2007-11-20 | 2011-12-14 | 富士通株式会社 | Network logging processing program, information processing system, and network logging information automatic saving method |
DE102007062974B4 (en) * | 2007-12-21 | 2010-04-08 | Phoenix Contact Gmbh & Co. Kg | Signal processing device |
JP5309703B2 (en) * | 2008-03-07 | 2013-10-09 | 日本電気株式会社 | Shared memory control circuit, control method, and control program |
US7991933B2 (en) * | 2008-06-25 | 2011-08-02 | Dell Products L.P. | Synchronizing processors when entering system management mode |
JP5507830B2 (en) * | 2008-11-04 | 2014-05-28 | ルネサスエレクトロニクス株式会社 | Microcontroller and automobile control device |
US8429633B2 (en) * | 2008-11-21 | 2013-04-23 | International Business Machines Corporation | Managing memory to support large-scale interprocedural static analysis for security problems |
CN101782862B (en) * | 2009-01-16 | 2013-03-13 | 鸿富锦精密工业(深圳)有限公司 | Processor distribution control system and control method thereof |
US8631208B2 (en) * | 2009-01-27 | 2014-01-14 | Intel Corporation | Providing address range coherency capability to a device |
TWI448847B (en) * | 2009-02-27 | 2014-08-11 | Foxnum Technology Co Ltd | Processor distribution control system and control method |
CN101840390B (en) * | 2009-03-18 | 2012-05-23 | 中国科学院微电子研究所 | Hardware synchronous circuit structure suitable for multiprocessor system and implement method thereof |
US8364862B2 (en) * | 2009-06-11 | 2013-01-29 | Intel Corporation | Delegating a poll operation to another device |
US8479042B1 (en) * | 2010-11-01 | 2013-07-02 | Xilinx, Inc. | Transaction-level lockstep |
TWI447574B (en) * | 2010-12-27 | 2014-08-01 | Ibm | Method,computer readable medium, appliance,and system for recording and prevevting crash in an appliance |
US8635492B2 (en) * | 2011-02-15 | 2014-01-21 | International Business Machines Corporation | State recovery and lockstep execution restart in a system with multiprocessor pairing |
US8930752B2 (en) | 2011-02-15 | 2015-01-06 | International Business Machines Corporation | Scheduler for multiprocessor system switch with selective pairing |
US8671311B2 (en) | 2011-02-15 | 2014-03-11 | International Business Machines Corporation | Multiprocessor switch with selective pairing |
EP2701063A4 (en) * | 2011-04-22 | 2014-05-07 | Fujitsu Ltd | Information processing device and information processing device processing method |
US8554726B2 (en) * | 2011-06-01 | 2013-10-08 | Clustrix, Inc. | Systems and methods for reslicing data in a relational database |
DE102012010143B3 (en) | 2012-05-24 | 2013-11-14 | Phoenix Contact Gmbh & Co. Kg | Analog signal input circuit with a number of analog signal acquisition channels |
JP5601353B2 (en) * | 2012-06-29 | 2014-10-08 | 横河電機株式会社 | Network management system |
GB2508344A (en) | 2012-11-28 | 2014-06-04 | Ibm | Creating an operating system dump |
JP6175958B2 (en) | 2013-07-26 | 2017-08-09 | 富士通株式会社 | MEMORY DUMP METHOD, PROGRAM, AND INFORMATION PROCESSING DEVICE |
US9251014B2 (en) | 2013-08-08 | 2016-02-02 | International Business Machines Corporation | Redundant transactions for detection of timing sensitive errors |
JP6221702B2 (en) * | 2013-12-05 | 2017-11-01 | 富士通株式会社 | Information processing apparatus, information processing method, and information processing program |
WO2015116057A1 (en) | 2014-01-29 | 2015-08-06 | Hewlett-Packard Development Company, L.P. | Dumping resources |
US9710273B2 (en) | 2014-11-21 | 2017-07-18 | Oracle International Corporation | Method for migrating CPU state from an inoperable core to a spare core |
CN104699550B (en) * | 2014-12-05 | 2017-09-12 | 中国航空工业集团公司第六三一研究所 | A kind of error recovery method based on lockstep frameworks |
US9411363B2 (en) * | 2014-12-10 | 2016-08-09 | Intel Corporation | Synchronization in a computing device |
JP2016170463A (en) * | 2015-03-11 | 2016-09-23 | 富士通株式会社 | Information processing device, kernel dump method, and kernel dump program |
DE102015218898A1 (en) * | 2015-09-30 | 2017-03-30 | Robert Bosch Gmbh | Method for the redundant processing of data |
US10067763B2 (en) * | 2015-12-11 | 2018-09-04 | International Business Machines Corporation | Handling unaligned load operations in a multi-slice computer processor |
US9971650B2 (en) | 2016-06-06 | 2018-05-15 | International Business Machines Corporation | Parallel data collection and recovery for failing virtual computer processing system |
US10579536B2 (en) * | 2016-08-09 | 2020-03-03 | Arizona Board Of Regents On Behalf Of Arizona State University | Multi-mode radiation hardened multi-core microprocessors |
US10521327B2 (en) | 2016-09-29 | 2019-12-31 | 2236008 Ontario Inc. | Non-coupled software lockstep |
GB2555628B (en) * | 2016-11-04 | 2019-02-20 | Advanced Risc Mach Ltd | Main processor error detection using checker processors |
US10740167B2 (en) * | 2016-12-07 | 2020-08-11 | Electronics And Telecommunications Research Institute | Multi-core processor and cache management method thereof |
TW202301125A (en) | 2017-07-30 | 2023-01-01 | 埃拉德 希提 | Memory chip with a memory-based distributed processor architecture |
JP7099050B2 (en) * | 2018-05-29 | 2022-07-12 | セイコーエプソン株式会社 | Circuits, electronic devices and mobiles |
US10901878B2 (en) * | 2018-12-19 | 2021-01-26 | International Business Machines Corporation | Reduction of pseudo-random test case generation overhead |
US11221899B2 (en) * | 2019-09-24 | 2022-01-11 | Arm Limited | Efficient memory utilisation in a processing cluster having a split mode and a lock mode |
US10977168B1 (en) * | 2019-12-26 | 2021-04-13 | Anthem, Inc. | Automation testing tool framework |
CN111123792B (en) * | 2019-12-29 | 2021-07-02 | 苏州浪潮智能科技有限公司 | Multi-main-system interactive communication and management method and device |
US11645185B2 (en) * | 2020-09-25 | 2023-05-09 | Intel Corporation | Detection of faults in performance of micro instructions |
US20230066835A1 (en) * | 2021-08-27 | 2023-03-02 | Keysight Technologies, Inc. | Methods, systems and computer readable media for improving remote direct memory access performance |
KR20230034646A (en) * | 2021-09-03 | 2023-03-10 | 에스케이하이닉스 주식회사 | Memory system and operation method thereof |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5423024A (en) * | 1991-05-06 | 1995-06-06 | Stratus Computer, Inc. | Fault tolerant processing section with dynamically reconfigurable voting |
US6199171B1 (en) * | 1998-06-26 | 2001-03-06 | International Business Machines Corporation | Time-lag duplexing techniques |
US6247143B1 (en) * | 1998-06-30 | 2001-06-12 | Sun Microsystems, Inc. | I/O handling for a multiprocessor computer system |
US20020152418A1 (en) * | 2001-04-11 | 2002-10-17 | Gerry Griffin | Apparatus and method for two computing elements in a fault-tolerant server to execute instructions in lockstep |
US20030149909A1 (en) * | 2001-10-01 | 2003-08-07 | International Business Machines Corporation | Halting execution of duplexed commands |
US6704887B2 (en) * | 2001-03-08 | 2004-03-09 | The United States Of America As Represented By The Secretary Of The Air Force | Method and apparatus for improved security in distributed-environment voting |
US6820213B1 (en) * | 2000-04-13 | 2004-11-16 | Stratus Technologies Bermuda, Ltd. | Fault-tolerant computer system with voter delay buffer |
US6948092B2 (en) * | 1998-12-10 | 2005-09-20 | Hewlett-Packard Development Company, L.P. | System recovery from errors for processor and associated components |
US7231543B2 (en) * | 2004-01-14 | 2007-06-12 | Hewlett-Packard Development Company, L.P. | Systems and methods for fault-tolerant processing with processor regrouping based on connectivity conditions |
Family Cites Families (73)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3665404A (en) | 1970-04-09 | 1972-05-23 | Burroughs Corp | Multi-processor processing system having interprocessor interrupt apparatus |
US4228496A (en) | 1976-09-07 | 1980-10-14 | Tandem Computers Incorporated | Multiprocessor system |
US4293921A (en) | 1979-06-15 | 1981-10-06 | Martin Marietta Corporation | Method and signal processor for frequency analysis of time domain signals |
US4481578A (en) * | 1982-05-21 | 1984-11-06 | Pitney Bowes Inc. | Direct memory access data transfer system for use with plural processors |
JPS61253572A (en) * | 1985-05-02 | 1986-11-11 | Hitachi Ltd | Load distributing system for loose coupling multi-processor system |
US4733353A (en) | 1985-12-13 | 1988-03-22 | General Electric Company | Frame synchronization of multiply redundant computers |
JP2695157B2 (en) * | 1986-12-29 | 1997-12-24 | 松下電器産業株式会社 | Variable pipeline processor |
EP0306211A3 (en) * | 1987-09-04 | 1990-09-26 | Digital Equipment Corporation | Synchronized twin computer system |
AU616213B2 (en) * | 1987-11-09 | 1991-10-24 | Tandem Computers Incorporated | Method and apparatus for synchronizing a plurality of processors |
CA2003338A1 (en) | 1987-11-09 | 1990-06-09 | Richard W. Cutts, Jr. | Synchronization of fault-tolerant computer system having multiple processors |
JP2644780B2 (en) * | 1987-11-18 | 1997-08-25 | 株式会社日立製作所 | Parallel computer with processing request function |
GB8729901D0 (en) * | 1987-12-22 | 1988-02-03 | Lucas Ind Plc | Dual computer cross-checking system |
JPH0797328B2 (en) | 1988-10-25 | 1995-10-18 | インターナシヨナル・ビジネス・マシーンズ・コーポレーシヨン | False tolerant synchronization system |
US4965717A (en) | 1988-12-09 | 1990-10-23 | Tandem Computers Incorporated | Multiple processor system having shared memory with private-write capability |
US5369767A (en) | 1989-05-17 | 1994-11-29 | International Business Machines Corp. | Servicing interrupt requests in a data processing system without using the services of an operating system |
US5295258A (en) * | 1989-12-22 | 1994-03-15 | Tandem Computers Incorporated | Fault-tolerant computer system with online recovery and reintegration of redundant components |
US5317752A (en) * | 1989-12-22 | 1994-05-31 | Tandem Computers Incorporated | Fault-tolerant computer system with auto-restart after power-fall |
US5291608A (en) | 1990-02-13 | 1994-03-01 | International Business Machines Corporation | Display adapter event handler with rendering context manager |
US5111384A (en) * | 1990-02-16 | 1992-05-05 | Bull Hn Information Systems Inc. | System for performing dump analysis |
DK0532582T3 (en) * | 1990-06-01 | 1996-01-29 | Du Pont | Composite orthopedic implant with varying modulus of elasticity |
US5226152A (en) * | 1990-12-07 | 1993-07-06 | Motorola, Inc. | Functional lockstep arrangement for redundant processors |
US5295259A (en) * | 1991-02-05 | 1994-03-15 | Advanced Micro Devices, Inc. | Data cache and method for handling memory errors during copy-back |
US5339404A (en) | 1991-05-28 | 1994-08-16 | International Business Machines Corporation | Asynchronous TMR processing system |
JPH05128080A (en) * | 1991-10-14 | 1993-05-25 | Mitsubishi Electric Corp | Information processor |
US5613127A (en) | 1992-08-17 | 1997-03-18 | Honeywell Inc. | Separately clocked processor synchronization improvement |
US6233702B1 (en) | 1992-12-17 | 2001-05-15 | Compaq Computer Corporation | Self-checked, lock step processor pairs |
US5535397A (en) | 1993-06-30 | 1996-07-09 | Intel Corporation | Method and apparatus for providing a context switch in response to an interrupt in a computer process |
US5572620A (en) | 1993-07-29 | 1996-11-05 | Honeywell Inc. | Fault-tolerant voter system for output data from a plurality of non-synchronized redundant processors |
US5504859A (en) | 1993-11-09 | 1996-04-02 | International Business Machines Corporation | Data processor with enhanced error recovery |
EP0986007A3 (en) | 1993-12-01 | 2001-11-07 | Marathon Technologies Corporation | Method of isolating I/O requests |
JP3481737B2 (en) * | 1995-08-07 | 2003-12-22 | 富士通株式会社 | Dump collection device and dump collection method |
US6449730B2 (en) | 1995-10-24 | 2002-09-10 | Seachange Technology, Inc. | Loosely coupled mass storage computer cluster |
US5999933A (en) * | 1995-12-14 | 1999-12-07 | Compaq Computer Corporation | Process and apparatus for collecting a data structure of a memory dump into a logical table |
US5850555A (en) | 1995-12-19 | 1998-12-15 | Advanced Micro Devices, Inc. | System and method for validating interrupts before presentation to a CPU |
US6141769A (en) | 1996-05-16 | 2000-10-31 | Resilience Corporation | Triple modular redundant computer system and associated method |
GB9617033D0 (en) * | 1996-08-14 | 1996-09-25 | Int Computers Ltd | Diagnostic memory access |
US5790397A (en) | 1996-09-17 | 1998-08-04 | Marathon Technologies Corporation | Fault resilient/fault tolerant computing |
US5796939A (en) | 1997-03-10 | 1998-08-18 | Digital Equipment Corporation | High frequency sampling of processor performance counters |
US5903717A (en) * | 1997-04-02 | 1999-05-11 | General Dynamics Information Systems, Inc. | Fault tolerant computer system |
US5896523A (en) | 1997-06-04 | 1999-04-20 | Marathon Technologies Corporation | Loosely-coupled, synchronized execution |
WO1999026133A2 (en) | 1997-11-14 | 1999-05-27 | Marathon Technologies Corporation | Method for maintaining the synchronized execution in fault resilient/fault tolerant computer systems |
US6173356B1 (en) * | 1998-02-20 | 2001-01-09 | Silicon Aquarius, Inc. | Multi-port DRAM with integrated SRAM and systems and methods using the same |
US6141635A (en) * | 1998-06-12 | 2000-10-31 | Unisys Corporation | Method of diagnosing faults in an emulated computer system via a heterogeneous diagnostic program |
US5991900A (en) * | 1998-06-15 | 1999-11-23 | Sun Microsystems, Inc. | Bus controller |
US6223304B1 (en) | 1998-06-18 | 2001-04-24 | Telefonaktiebolaget Lm Ericsson (Publ) | Synchronization of processors in a fault tolerant multi-processor system |
US6314501B1 (en) * | 1998-07-23 | 2001-11-06 | Unisys Corporation | Computer system and method for operating multiple operating systems in different partitions of the computer system and for allowing the different partitions to communicate with one another through shared memory |
US6195715B1 (en) | 1998-11-13 | 2001-02-27 | Creative Technology Ltd. | Interrupt control for multiple programs communicating with a common interrupt by associating programs to GP registers, defining interrupt register, polling GP registers, and invoking callback routine associated with defined interrupt register |
US6263373B1 (en) * | 1998-12-04 | 2001-07-17 | International Business Machines Corporation | Data processing system and method for remotely controlling execution of a processor utilizing a test access port |
US6393582B1 (en) | 1998-12-10 | 2002-05-21 | Compaq Computer Corporation | Error self-checking and recovery using lock-step processor pair architecture |
US6449732B1 (en) | 1998-12-18 | 2002-09-10 | Triconex Corporation | Method and apparatus for processing control using a multiple redundant processor control system |
US6543010B1 (en) * | 1999-02-24 | 2003-04-01 | Hewlett-Packard Development Company, L.P. | Method and apparatus for accelerating a memory dump |
US6397365B1 (en) | 1999-05-18 | 2002-05-28 | Hewlett-Packard Company | Memory error correction using redundant sliced memory and standard ECC mechanisms |
US6658654B1 (en) | 2000-07-06 | 2003-12-02 | International Business Machines Corporation | Method and system for low-overhead measurement of per-thread performance information in a multithreaded environment |
EP1213650A3 (en) * | 2000-08-21 | 2006-08-30 | Texas Instruments France | Priority arbitration based on current task and MMU |
US6604177B1 (en) | 2000-09-29 | 2003-08-05 | Hewlett-Packard Development Company, L.P. | Communication of dissimilar data between lock-stepped processors |
US6604717B2 (en) * | 2000-11-15 | 2003-08-12 | Stanfield Mccoy J. | Bag holder |
US7017073B2 (en) * | 2001-02-28 | 2006-03-21 | International Business Machines Corporation | Method and apparatus for fault-tolerance via dual thread crosschecking |
US7065672B2 (en) * | 2001-03-28 | 2006-06-20 | Stratus Technologies Bermuda Ltd. | Apparatus and methods for fault-tolerant computing using a switching fabric |
US6971043B2 (en) | 2001-04-11 | 2005-11-29 | Stratus Technologies Bermuda Ltd | Apparatus and method for accessing a mass storage device in a fault-tolerant server |
US7207041B2 (en) * | 2001-06-28 | 2007-04-17 | Tranzeo Wireless Technologies, Inc. | Open platform architecture for shared resource access management |
US7076510B2 (en) * | 2001-07-12 | 2006-07-11 | Brown William P | Software raid methods and apparatuses including server usage based write delegation |
US6754763B2 (en) * | 2001-07-30 | 2004-06-22 | Axis Systems, Inc. | Multi-board connection system for use in electronic design automation |
US7194671B2 (en) * | 2001-12-31 | 2007-03-20 | Intel Corporation | Mechanism handling race conditions in FRC-enabled processors |
US6687799B2 (en) * | 2002-01-31 | 2004-02-03 | Hewlett-Packard Development Company, L.P. | Expedited memory dumping and reloading of computer processors |
US7076397B2 (en) | 2002-10-17 | 2006-07-11 | Bmc Software, Inc. | System and method for statistical performance monitoring |
US6983337B2 (en) | 2002-12-18 | 2006-01-03 | Intel Corporation | Method, system, and program for handling device interrupts |
US7526757B2 (en) | 2004-01-14 | 2009-04-28 | International Business Machines Corporation | Method and apparatus for maintaining performance monitoring structures in a page table for use in monitoring performance of a computer program |
JP2005259030A (en) | 2004-03-15 | 2005-09-22 | Sharp Corp | Performance evaluation device, performance evaluation method, program, and computer-readable storage medium |
US7162666B2 (en) | 2004-03-26 | 2007-01-09 | Emc Corporation | Multi-processor system having a watchdog for interrupting the multiple processors and deferring preemption until release of spinlocks |
US20050240806A1 (en) | 2004-03-30 | 2005-10-27 | Hewlett-Packard Development Company, L.P. | Diagnostic memory dump method in a redundant processor |
US7308605B2 (en) | 2004-07-20 | 2007-12-11 | Hewlett-Packard Development Company, L.P. | Latent error detection |
US7380171B2 (en) * | 2004-12-06 | 2008-05-27 | Microsoft Corporation | Controlling software failure data reporting and responses |
US7328331B2 (en) | 2005-01-25 | 2008-02-05 | Hewlett-Packard Development Company, L.P. | Method and system of aligning execution point of duplicate copies of a user program by copying memory stores |
-
2004
- 2004-09-28 US US10/953,242 patent/US20050240806A1/en not_active Abandoned
- 2004-11-16 US US10/990,151 patent/US7890706B2/en active Active
-
2005
- 2005-01-25 US US11/042,981 patent/US7434098B2/en not_active Expired - Fee Related
- 2005-01-27 US US11/045,401 patent/US20050246581A1/en not_active Abandoned
- 2005-03-04 US US11/071,944 patent/US20050223275A1/en not_active Abandoned
- 2005-03-30 CN CN200510079205.4A patent/CN1690970A/en active Pending
- 2005-03-30 CN CN200510079206.9A patent/CN100472456C/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5423024A (en) * | 1991-05-06 | 1995-06-06 | Stratus Computer, Inc. | Fault tolerant processing section with dynamically reconfigurable voting |
US6199171B1 (en) * | 1998-06-26 | 2001-03-06 | International Business Machines Corporation | Time-lag duplexing techniques |
US6247143B1 (en) * | 1998-06-30 | 2001-06-12 | Sun Microsystems, Inc. | I/O handling for a multiprocessor computer system |
US6948092B2 (en) * | 1998-12-10 | 2005-09-20 | Hewlett-Packard Development Company, L.P. | System recovery from errors for processor and associated components |
US6820213B1 (en) * | 2000-04-13 | 2004-11-16 | Stratus Technologies Bermuda, Ltd. | Fault-tolerant computer system with voter delay buffer |
US6704887B2 (en) * | 2001-03-08 | 2004-03-09 | The United States Of America As Represented By The Secretary Of The Air Force | Method and apparatus for improved security in distributed-environment voting |
US20020152418A1 (en) * | 2001-04-11 | 2002-10-17 | Gerry Griffin | Apparatus and method for two computing elements in a fault-tolerant server to execute instructions in lockstep |
US6928583B2 (en) * | 2001-04-11 | 2005-08-09 | Stratus Technologies Bermuda Ltd. | Apparatus and method for two computing elements in a fault-tolerant server to execute instructions in lockstep |
US20030196025A1 (en) * | 2001-10-01 | 2003-10-16 | International Business Machines Corporation | Synchronizing processing of commands invoked against duplexed coupling facility structures |
US6615373B2 (en) * | 2001-10-01 | 2003-09-02 | International Business Machines Corporation | Method, system and program products for resolving potential deadlocks |
US20030149920A1 (en) * | 2001-10-01 | 2003-08-07 | International Business Machines Corporation | Method, system and program products for resolving potential deadlocks |
US20030149909A1 (en) * | 2001-10-01 | 2003-08-07 | International Business Machines Corporation | Halting execution of duplexed commands |
US7231543B2 (en) * | 2004-01-14 | 2007-06-12 | Hewlett-Packard Development Company, L.P. | Systems and methods for fault-tolerant processing with processor regrouping based on connectivity conditions |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7308605B2 (en) * | 2004-07-20 | 2007-12-11 | Hewlett-Packard Development Company, L.P. | Latent error detection |
US20060020850A1 (en) * | 2004-07-20 | 2006-01-26 | Jardine Robert L | Latent error detection |
US7047440B1 (en) * | 2004-07-27 | 2006-05-16 | Freydel Lev R | Dual/triple redundant computer system |
US20070283061A1 (en) * | 2004-08-06 | 2007-12-06 | Robert Bosch Gmbh | Method for Delaying Accesses to Date and/or Instructions of a Two-Computer System, and Corresponding Delay Unit |
US20060212763A1 (en) * | 2005-03-17 | 2006-09-21 | Fujitsu Limited | Error notification method and information processing apparatus |
US7584388B2 (en) * | 2005-03-17 | 2009-09-01 | Fujitsu Limited | Error notification method and information processing apparatus |
US20070174746A1 (en) * | 2005-12-20 | 2007-07-26 | Juerg Haefliger | Tuning core voltages of processors |
US7516358B2 (en) | 2005-12-20 | 2009-04-07 | Hewlett-Packard Development Company, L.P. | Tuning core voltages of processors |
US20070220369A1 (en) * | 2006-02-21 | 2007-09-20 | International Business Machines Corporation | Fault isolation and availability mechanism for multi-processor system |
US7765383B2 (en) * | 2006-08-11 | 2010-07-27 | Fujitsu Semiconductor Limited | Data processing unit and data processing apparatus using data processing unit |
US20080040582A1 (en) * | 2006-08-11 | 2008-02-14 | Fujitsu Limited | Data processing unit and data processing apparatus using data processing unit |
US20080165521A1 (en) * | 2007-01-09 | 2008-07-10 | Kerry Bernstein | Three-dimensional architecture for self-checking and self-repairing integrated circuits |
CN103645953A (en) * | 2008-08-08 | 2014-03-19 | 亚马逊技术有限公司 | Providing executing programs with reliable access to non-local block data storage |
US20100205607A1 (en) * | 2009-02-11 | 2010-08-12 | Hewlett-Packard Development Company, L.P. | Method and system for scheduling tasks in a multi processor computing system |
US8875142B2 (en) * | 2009-02-11 | 2014-10-28 | Hewlett-Packard Development Company, L.P. | Job scheduling on a multiprocessing system based on reliability and performance rankings of processors and weighted effect of detected errors |
US20100275065A1 (en) * | 2009-04-27 | 2010-10-28 | Honeywell International Inc. | Dual-dual lockstep processor assemblies and modules |
US7979746B2 (en) * | 2009-04-27 | 2011-07-12 | Honeywell International Inc. | Dual-dual lockstep processor assemblies and modules |
US20120137163A1 (en) * | 2009-08-19 | 2012-05-31 | Kentaro Sasagawa | Multi-core system, method of controlling multi-core system, and multiprocessor |
US8719628B2 (en) * | 2009-08-19 | 2014-05-06 | Nec Corporation | Multi-core system, method of controlling multi-core system, and multiprocessor |
US20120317576A1 (en) * | 2009-12-15 | 2012-12-13 | Bernd Mueller | method for operating an arithmetic unit |
US8959392B2 (en) * | 2010-03-23 | 2015-02-17 | Continental Teves Ag & Co. Ohg | Redundant two-processor controller and control method |
US20130007513A1 (en) * | 2010-03-23 | 2013-01-03 | Adrian Traskov | Redundant two-processor controller and control method |
US9170907B2 (en) * | 2011-11-10 | 2015-10-27 | Ge Aviation Systems Llc | Method of providing high integrity processing |
US8924780B2 (en) * | 2011-11-10 | 2014-12-30 | Ge Aviation Systems Llc | Method of providing high integrity processing |
US20150106657A1 (en) * | 2011-11-10 | 2015-04-16 | Ge Aviation Systems Llc | Method of providing high integrity processing |
US20130124922A1 (en) * | 2011-11-10 | 2013-05-16 | Ge Aviation Systems Llc | Method of providing high integrity processing |
US20140373028A1 (en) * | 2013-06-18 | 2014-12-18 | Advanced Micro Devices, Inc. | Software Only Inter-Compute Unit Redundant Multithreading for GPUs |
US9274904B2 (en) * | 2013-06-18 | 2016-03-01 | Advanced Micro Devices, Inc. | Software only inter-compute unit redundant multithreading for GPUs |
US9367372B2 (en) | 2013-06-18 | 2016-06-14 | Advanced Micro Devices, Inc. | Software only intra-compute unit redundant multithreading for GPUs |
US20180074888A1 (en) * | 2016-09-09 | 2018-03-15 | The Charles Stark Draper Laboratory, Inc. | Methods and systems for achieving trusted fault tolerance of a system of untrusted subsystems |
US20190034301A1 (en) * | 2017-07-31 | 2019-01-31 | Oracle International Corporation | System recovery using a failover processor |
US10474549B2 (en) * | 2017-07-31 | 2019-11-12 | Oracle International Corporation | System recovery using a failover processor |
US11163654B2 (en) | 2017-07-31 | 2021-11-02 | Oracle International Corporation | System recovery using a failover processor |
US11599433B2 (en) | 2017-07-31 | 2023-03-07 | Oracle International Corporation | System recovery using a failover processor |
US11372981B2 (en) | 2020-01-09 | 2022-06-28 | Rockwell Collins, Inc. | Profile-based monitoring for dual redundant systems |
Also Published As
Publication number | Publication date |
---|---|
US7890706B2 (en) | 2011-02-15 |
US20050246587A1 (en) | 2005-11-03 |
US20050223178A1 (en) | 2005-10-06 |
CN1690970A (en) | 2005-11-02 |
US20050240806A1 (en) | 2005-10-27 |
US7434098B2 (en) | 2008-10-07 |
US20050223275A1 (en) | 2005-10-06 |
CN100472456C (en) | 2009-03-25 |
CN1696903A (en) | 2005-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20050246581A1 (en) | Error handling system in a redundant processor | |
Bernick et al. | NonStop/spl reg/advanced architecture | |
US6948092B2 (en) | System recovery from errors for processor and associated components | |
US6260159B1 (en) | Tracking memory page modification in a bridge for a multi-processor system | |
US6496940B1 (en) | Multiple processor system with standby sparing | |
US6802023B2 (en) | Redundant controller data storage system having hot insertion system and method | |
US4916704A (en) | Interface of non-fault tolerant components to fault tolerant system | |
US5226152A (en) | Functional lockstep arrangement for redundant processors | |
US5239641A (en) | Method and apparatus for synchronizing a plurality of processors | |
US5255367A (en) | Fault tolerant, synchronized twin computer system with error checking of I/O communication | |
US6393582B1 (en) | Error self-checking and recovery using lock-step processor pair architecture | |
US7296181B2 (en) | Lockstep error signaling | |
US6587961B1 (en) | Multi-processor system bridge with controlled access | |
US20020133740A1 (en) | Redundant controller data storage system having system and method for handling controller resets | |
JP2500038B2 (en) | Multiprocessor computer system, fault tolerant processing method and data processing system | |
US6223230B1 (en) | Direct memory access in a bridge for a multi-processor system | |
US6173351B1 (en) | Multi-processor system bridge | |
JPH0792765B2 (en) | Input / output controller | |
CN1729456A (en) | On-die mechanism for high-reliability processor | |
KR20000011835A (en) | Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applicatons in a network | |
EP0779579B1 (en) | Bus error handler on dual bus system | |
CN101714108A (en) | Synchronization control apparatus, information processing apparatus, and synchronization management method | |
US7631226B2 (en) | Computer system, bus controller, and bus fault handling method used in the same computer system and bus controller | |
US5905875A (en) | Multiprocessor system connected by a duplicated system bus having a bus status notification line | |
US7243257B2 (en) | Computer system for preventing inter-node fault propagation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P., TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JARDINE, ROBERT L.;KLECKA, JAMES S.;BRUCKERT, WILLIAM F.;AND OTHERS;REEL/FRAME:016236/0182;SIGNING DATES FROM 20050126 TO 20050127 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION |