US20020152425A1 - Distributed restart in a multiple processor system - Google Patents

Distributed restart in a multiple processor system Download PDF

Info

Publication number
US20020152425A1
US20020152425A1 US09/834,524 US83452401A US2002152425A1 US 20020152425 A1 US20020152425 A1 US 20020152425A1 US 83452401 A US83452401 A US 83452401A US 2002152425 A1 US2002152425 A1 US 2002152425A1
Authority
US
United States
Prior art keywords
processor
restart
node
failed
processors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US09/834,524
Inventor
David Chaiken
Mark Foster
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agile TV Corp
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US09/834,524 priority Critical patent/US20020152425A1/en
Assigned to AGILE TV CORPORATION reassignment AGILE TV CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHAIKEN, DAVID, FOSTER, MARK J.
Assigned to AGILETV CORPORATION reassignment AGILETV CORPORATION REASSIGNMENT AND RELEASE OF SECURITY INTEREST Assignors: INSIGHT COMMUNICATIONS COMPANY, INC.
Publication of US20020152425A1 publication Critical patent/US20020152425A1/en
Assigned to LAUDER PARTNERS LLC, AS AGENT reassignment LAUDER PARTNERS LLC, AS AGENT SECURITY AGREEMENT Assignors: AGILETV CORPORATION
Assigned to AGILETV CORPORATION reassignment AGILETV CORPORATION REASSIGNMENT AND RELEASE OF SECURITY INTEREST Assignors: LAUDER PARTNERS LLC AS COLLATERAL AGENT FOR ITSELF AND CERTAIN OTHER LENDERS
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0721Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
    • G06F11/0724Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU] in a multiprocessor or a multi-core unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2002Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant
    • G06F11/2005Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant using redundant communication controllers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2015Redundant power supplies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration

Definitions

  • the invention relates to restarting a computer system after a system failure. More particularly, the invention relates to a mechanism for distributed restart in a multiple processor computer system.
  • Transient failures in computing systems result from component level hardware or software failures. It is often the case that a system can recover from such failures. Recovering from a failure might require toggling the power supply, i.e. a cold restart, toggling a reset signal, i.e. a warm restart, or terminating and rebooting system software, i.e. a software restart.
  • the invention provides a technique that allows software or hardware on one node or processor in a system with multiple processors or nodes to perform a cold or a warm restart on one or more other processors or nodes.
  • fault tolerance mechanisms are provided in a computer architecture to allow it to continue functioning when individual components, such as chips, printed circuit boards, network links, fans, or power supplies fail.
  • One aspect of the invention provides multiple processors having self-contained operating systems.
  • Each processor preferably comprises any of redundant network links; redundant power supplies; redundant links to input/output devices; and software fault detection, adaptation, and recovery algorithms.
  • the system attempts to recover from the failure by restarting the failed processor.
  • the preferred system is constructed as a set of self-contained processing units, it is possible to restart the system at a number of granularities, e.g. a chip, a printed circuit board (node), a subset of processors (PLEX), or an entire architecture.
  • Each of these restarts can be any of a cold, warm, and/or software restart, and can be invoked by any of hardware, e.g. by a watchdog timer, software, e.g. by fault recovery algorithms, or by a human operator.
  • FIG. 1 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for processors within an individual system according to the invention
  • FIG. 2 is a block schematic diagram that shows a distributed restart mechanism in accordance with the invention
  • FIG. 3 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for nodes within an architecture according to the invention.
  • FIGS. 4 a and 4 b comprise a block schematic diagrams that show the implementation of a distributed restart in a multiprocessor system for processors contained within nodes that are arranged within a backplane according to the invention
  • a multiprocessor computing architecture such as the AgileTV engine developed by AgileTV of Menlo Park, Calif. (see, for example, [inventor, title], U.S. patent application Ser. No. ______, filed ______, attorney docket no. AGLE0003, is a computing architecture comprised of multiple processors. Such multiprocessor computing architecture is designed to continue operation when individual components suffer hardware or software failures.
  • the following fault tolerance mechanisms are provided in the computer architecture to allow it to continue functioning when individual components, such as chips, printed circuit boards, network links, fans, or power supplies fail.
  • individual components such as chips, printed circuit boards, network links, fans, or power supplies fail.
  • one aspect of the invention provides multiple processors having self-contained operating systems.
  • Each processor preferably comprises:
  • Software fault detection for example, a ‘ping’ and corresponding application level diagnostics
  • adaptation for example, rerouting in the network and reassigning tasks
  • recovery algorithms for example, replicating job, i.e. software, state so that jobs can be restarted on correctly functioning processors or restarting failed processors or nodes.
  • a processor in the system has failed, it is important to attempt to recover from the failure by restarting the failed processor.
  • the preferred system is constructed as a set of self-contained processing units, it is possible to restart the system at a number of granularities, e.g. a chip, a printed circuit board (node), a subset of processors (PLEX), or an entire architecture.
  • Each of these restarts can be any of a cold, warm, and/or software restart, and can be invoked by any of hardware, e.g. by a watchdog timer.
  • Watchdog timers are known to those skilled in the art, for example, Watchdog timers can be used to detect the liveness of the system and will cause a full system reset if software does not prove its correct operation by regularly resetting the timer.
  • Software e.g. by fault recovery algorithms (discussed above) or by a human operator.
  • connecting a reset signal from each node to one or more other nodes permits fault recovery software to warm restart one or more failed nodes.
  • This distributed restart technique eliminates the need for human intervention after the occurrence of any of many types of software and hardware faults, such as livelock in the operating system scheduler, failure at one or more communication links, transistor-level lockup in the processor, or failure of software reset.
  • Connecting a power supply enable signal to one node from another allows system software at a node to cold restart another node (e.g. turn the power supply of the failed processor off and back on), thereby allowing recovery from classes of failures not covered by a warm restart, such as transistor-level lockup in the processor, errors in state machines in the processor.
  • Such control over the power supply also allows selective node shutdown of processors or nodes, which can be used to adapt to elevated temperature conditions (see, for example, [inventor, title], U.S. patent application Ser. No. _, filed ______, attorney docket no. AGLE0023.
  • FIG. 1 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for processors within an individual system according to the invention.
  • FIG. 1 The embodiment of the invention that is shown in FIG. 1 is a system comprised of one or more nodes 10 , 11 , 12 having multiple processing units, e.g. dual processing node comprises two processors 13 , 14 . Each of these processing units is responsible for some part of the functioning of the system.
  • the processing units are interconnected by a network, such as Ethernet. Each processing unit is preferably assigned a task within the system, such as controlling disks or communicating with another network.
  • a network can include bi-directional, i.e. FIFO, interfaces.
  • processors can talk to each other.
  • processors can talk to each other.
  • FIG. 1 consider the three processors: processor A 20 , processor B 21 , and processor C 22 .
  • Processor A is connected, for example, to a disk (not shown), processor B could also be connected to the disk, and processors A, B, and C are configured to communicate with each other.
  • processors A, B, and C are configured to communicate with each other.
  • such arrangement is referred to as a fault-tolerant configuration.
  • processor A fails, i.e. it suffers either a hardware fault or a software fault, then processor B can assume its functions.
  • the preferred embodiment of the invention is especially well suited for multiple processor systems that provide redundancy for fault tolerance.
  • processor A fails and does not restart, and then processor B also fails, resulting system performance degradation is likely to be noticed by, and objectionable to, a user.
  • This is of particular concern in a consumer installation, such as a Web server, telecommunications system, or cable television application, where degradation in performance of the system results in diminished user satisfaction, i.e. loss of sales.
  • a key aspect of the invention herein is the ability for one processor, e.g. processor B to reset another processor, e.g. processor A, i.e. to say “start yourself from scratch.”
  • the invention provides a number of different ways to do that. For example, with regard to the power supply that is supplying power to processor A, it is possible to turn that power supply off and then turn the power supply back on to reboot the processor.
  • each processor includes internal circuits, e.g. chips, that have reset signals.
  • each processor includes for the purposes of the discussion herein, a cold reset is a power-supply-level reset and a warm reset is a processor-level-reset. Reset lines provide different levels of reset.
  • processors there is direct communication from one processor to another processor's reset line. For example, if two processors are on the same node or on the same card, and if both of these processors fail, then a third processor, e.g. processor C, could reset the whole card.
  • a third processor e.g. processor C
  • the invention is preferably comprised of a system that already has fault tolerance built in so that when one processor fails, a mechanism is invoked that allows that processor to recover from that failure, i.e. the fault tolerance is intended to allow continued operation in the event of a failure of one processor, but in the invention the fault tolerance is surprisingly used to provide an opportunity for an operating processor to reset a failed processor before additional processors can fail.
  • software on a node that is operable can reset software on a node that has failed.
  • processor B can toggle a signal on processor A.
  • there is a unique power supply or power regulator for processor A there is a unique power supply for processor B.
  • processor B could turn the power off for processor A and then turn the power back on, thereby effecting a cold restart of processor A.
  • FIG. 2 is a block schematic diagram that shows a distributed restart mechanism in accordance with the invention.
  • the fault tolerance mechanism 22 establishes a communicative interconnection between the processors or nodes within the architecture.
  • a fault detection module 21 executes a fault detection strategy and, upon detection of a faulty processor or node, communicates with a restart module 20 to initiate a restart procedure, as discussed herein.
  • the system must first determine that a processor failure, in fact, has occurred.
  • the system knows that there is a processor failure because the faulty processor, for example, has stopped responding to requests.
  • one way to detect a failure is to provide a heartbeat in the system, such that the processors all talk to each other over time.
  • Each processor pings each other processor in a predetermined way. If a processor does not return a ping, or alternatively, does not issue a ping, then a fault is reported for that processor by its corresponding processor(s).
  • Such heartbeat mechanism may be implemented in any appropriate way, as is known in the art
  • Another method for detecting a processor failure is that of running application diagnostics.
  • the system takes a known sample and feeds it into a processor that is running a speech recognition application.
  • the diagnostic supplies a known input and determines if a correct result is returned by the processor, e.g. making sure that the right text comes for an input phrase, such as “Take me to your leader.”
  • every processor in this embodiment has a diagnostic routine built in, which is typically a simple routine (as is known to those skilled in the art) that sends a string out, for example, or a piece of data out, and then looks for a sum to come back to show that the processor is operable and can therefore provide a correct response.
  • Each processor periodically runs this routine, as with a heartbeat, only it is an intelligent heartbeat.
  • a processor is primarily executing, for example, speech recognition it is very important that the processor use a diagnostic which is representative of the work that it is actually performing. In this way, it is possible to catch the vast majority of failures that are relevant to the specific task being performed by the processor.
  • the diagnostics sends it an HTTP stream and makes sure that the processor outputs the right data.
  • Another method for detecting a processor failure is the detection of excessive communication errors on any one link beyond some threshold, which indicates non-functionality of the processor.
  • Those skilled in the art will appreciate that other methods of fault detection may be applied in connection with the operation and practice of the invention. For example, in an I/O processor, where the system is pinging a gateway, communications failures beyond a threshold are a good indication of processor failure.
  • a network is some mechanism that allows the processors to communicate, i.e. they are communicatively coupled.
  • a network could be a backplane, an Ethernet, a serial link, a token ring, or any other such arrangement.
  • the processors must be able to talk to each other and that every processor be aware of at least one other processor, and sometimes maybe even more processors, depending on how they are configured, so that processors can check with each other to determine proper operation of a corresponding processor(s).
  • Another important aspect of the invention is that of reset hierarchy.
  • nodes which correspond to printed circuit boards, and each node has multiple processors on it, e.g. two or more in the presently preferred system.
  • These nodes, or rather the printed circuit cards on which they reside, are plugged into a backplane.
  • a functioning chip on a node may be able to reset the non-functioning chip, and vice versa (see, for example, the node 11 and corresponding processors 15 , 16 , 17 in FIG. 1). Communications are still active to one of the chips on the node, and the functioning processor can reset the faulty processor. However, in the event of a communications interruption, for example, on a bus lockup on the node card, all of the chips on that node card may fail. If all of the chips fail for some reason, then the invention provides the ability to reset effectively the entire node, such that a processor on one node can reset the other node.
  • an important design constraint is to reduce the number of faulty paths.
  • the special purpose processor is more likely to have a connection to the outside world, and the two other processors only have connections within the node itself. In this situation, only the special purpose processor can reset the generic processors. If that special purpose processor fails, then it can be reset by a neighboring node. Thus, it is important to maintain a reset hierarchy in a way that does not propagate undue management options, i.e.
  • I/O node for example, might fail locally and it could be reset locally, but it is deemed better to go up to the network in that case and get a peer to reset it, and in that way restrict access to the outside bus.
  • master processor that is the subject of higher level fault correction, while the other processors in the node (or at this level) are reset by this processor.
  • every chip or every node must be able to reset every other node.
  • each node can reset two other nodes and the resets are linked in a chain that goes through the system, which corresponds to the way that cards are plugged into the system. If one plugs in a basic four-card array, then all four processors are able to reset each other, and if more cards are plugged in then they are linked in a chain. Thus, at this level the notion of hierarchy may be less important as is the notion of extensibility. Both aspects of a system design are considered when determining the appropriate level of interprocessor communication and reset capability.
  • the invention provides a reporting mechanism or supervisor function, such that if there is a failure the failure is logged and reported. For example, if processor B resets processor A even once a day, then the system places a service call because processor A probably needs to be replaced.
  • FIG. 3 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for nodes A-P within an architecture according to the invention.
  • FIGS. 4 a and 4 b comprise a block schematic diagrams that show the implementation of a distributed restart in a multiprocessor system for processors contained within nodes that are arranged within a backplane according to the invention.
  • FIG. 4 a shows several nodes and, for example, a first node A 30
  • FIG. 4 b shows node A arranged in a backplane.

Abstract

Software or hardware on one node or processor in a system with multiple processors or nodes performs a cold or a warm restart on one or more other processors. Fault tolerance mechanisms are provided in a computing architecture to allow it to continue functioning when individual components, such as chips, printed circuit boards, network links, fans, or power supplies fail. One aspect of the invention provides multiple processors having self-contained operating systems. Each processor preferably comprises any of redundant network links; redundant power supplies; redundant links to input/output devices; and software fault detection, adaptation, and recovery algorithms. Once a processor in the system has failed, the system attempts to recover from the failure by restarting a failed processor. Because the preferred system is constructed as a set of self-contained processing units, it is possible to restart the system at a number of granularities, e.g. a chip, a printed circuit board (node), a subset of processors, or an entire engine. Each of these restarts can be any of a cold, warm, and/or software restart, and can be invoked by any of hardware, e.g. by a watchdog timer, software, e.g. by fault recovery algorithms, or by a human operator.

Description

    BACKGROUND OF THE INVENTION
  • 1. Technical Field [0001]
  • The invention relates to restarting a computer system after a system failure. More particularly, the invention relates to a mechanism for distributed restart in a multiple processor computer system. [0002]
  • 2. Description of the Prior Art [0003]
  • Transient failures in computing systems result from component level hardware or software failures. It is often the case that a system can recover from such failures. Recovering from a failure might require toggling the power supply, i.e. a cold restart, toggling a reset signal, i.e. a warm restart, or terminating and rebooting system software, i.e. a software restart. [0004]
  • There are many well known techniques for automatically restarting computer systems, including hardware watchdogs, software power-down and reset mechanisms, and physical switches. Smart, uninterruptible power supplies are also know that have the capability for remote control of computer system power. [0005]
  • None of the known restart mechanisms provide for intelligent intercession by an operating node in a multiprocessor system. It would be advantageous to allow software or hardware on one node in a system with multiple processors to perform a cold or a warm restart on another processor. [0006]
  • SUMMARY OF THE INVENTION
  • The invention provides a technique that allows software or hardware on one node or processor in a system with multiple processors or nodes to perform a cold or a warm restart on one or more other processors or nodes. [0007]
  • In the presently preferred embodiment of the invention, fault tolerance mechanisms are provided in a computer architecture to allow it to continue functioning when individual components, such as chips, printed circuit boards, network links, fans, or power supplies fail. [0008]
  • One aspect of the invention provides multiple processors having self-contained operating systems. Each processor preferably comprises any of redundant network links; redundant power supplies; redundant links to input/output devices; and software fault detection, adaptation, and recovery algorithms. Once a processor in the system has failed, the system attempts to recover from the failure by restarting the failed processor. Because the preferred system is constructed as a set of self-contained processing units, it is possible to restart the system at a number of granularities, e.g. a chip, a printed circuit board (node), a subset of processors (PLEX), or an entire architecture. Each of these restarts can be any of a cold, warm, and/or software restart, and can be invoked by any of hardware, e.g. by a watchdog timer, software, e.g. by fault recovery algorithms, or by a human operator.[0009]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for processors within an individual system according to the invention; [0010]
  • FIG. 2 is a block schematic diagram that shows a distributed restart mechanism in accordance with the invention; [0011]
  • FIG. 3 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for nodes within an architecture according to the invention; and [0012]
  • FIGS. 4[0013] a and 4 b comprise a block schematic diagrams that show the implementation of a distributed restart in a multiprocessor system for processors contained within nodes that are arranged within a backplane according to the invention
  • DETAILED DESCRIPTION OF THE INVENTION
  • A multiprocessor computing architecture, such as the AgileTV engine developed by AgileTV of Menlo Park, Calif. (see, for example, [inventor, title], U.S. patent application Ser. No. ______, filed ______, attorney docket no. AGLE0003, is a computing architecture comprised of multiple processors. Such multiprocessor computing architecture is designed to continue operation when individual components suffer hardware or software failures. [0014]
  • In the presently preferred embodiment of the invention, the following fault tolerance mechanisms are provided in the computer architecture to allow it to continue functioning when individual components, such as chips, printed circuit boards, network links, fans, or power supplies fail. Thus, one aspect of the invention provides multiple processors having self-contained operating systems. [0015]
  • Each processor preferably comprises: [0016]
  • Redundant network links; [0017]
  • Redundant power supplies; [0018]
  • Redundant links to input/output devices; and [0019]
  • Software fault detection (for example, a ‘ping’ and corresponding application level diagnostics), adaptation (for example, rerouting in the network and reassigning tasks), and recovery algorithms (for example, replicating job, i.e. software, state so that jobs can be restarted on correctly functioning processors or restarting failed processors or nodes). [0020]
  • While the invention is described herein in terms of a presently preferred embodiment, i.e. a multiprocessor, fault tolerant computer architecture, those skilled in the art will appreciate that the invention is readily applied to other system architectures and that the system described herein is provided only for purposes of example and not to limit the scope of the invention. [0021]
  • Once a processor in the system has failed, it is important to attempt to recover from the failure by restarting the failed processor. Because the preferred system is constructed as a set of self-contained processing units, it is possible to restart the system at a number of granularities, e.g. a chip, a printed circuit board (node), a subset of processors (PLEX), or an entire architecture. Each of these restarts can be any of a cold, warm, and/or software restart, and can be invoked by any of hardware, e.g. by a watchdog timer. Watchdog timers are known to those skilled in the art, for example, Watchdog timers can be used to detect the liveness of the system and will cause a full system reset if software does not prove its correct operation by regularly resetting the timer. Software, e.g. by fault recovery algorithms (discussed above) or by a human operator. [0022]
  • Thus, the variety of alternatives for restarting a failed processor provides considerable flexibility in applying restart strategies that are characterized by the type of restart (cold, warm, software), the granularity of restart (chip, node, PLEX, engine), and the source of the restart (software, hardware, human). [0023]
  • For example, connecting a reset signal from each node to one or more other nodes permits fault recovery software to warm restart one or more failed nodes. This distributed restart technique eliminates the need for human intervention after the occurrence of any of many types of software and hardware faults, such as livelock in the operating system scheduler, failure at one or more communication links, transistor-level lockup in the processor, or failure of software reset. [0024]
  • Connecting a power supply enable signal to one node from another allows system software at a node to cold restart another node (e.g. turn the power supply of the failed processor off and back on), thereby allowing recovery from classes of failures not covered by a warm restart, such as transistor-level lockup in the processor, errors in state machines in the processor. Such control over the power supply also allows selective node shutdown of processors or nodes, which can be used to adapt to elevated temperature conditions (see, for example, [inventor, title], U.S. patent application Ser. No. _______, filed ______, attorney docket no. AGLE0023. [0025]
  • FIG. 1 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for processors within an individual system according to the invention. [0026]
  • The embodiment of the invention that is shown in FIG. 1 is a system comprised of one or [0027] more nodes 10, 11, 12 having multiple processing units, e.g. dual processing node comprises two processors 13, 14. Each of these processing units is responsible for some part of the functioning of the system.
  • The processing units are interconnected by a network, such as Ethernet. Each processing unit is preferably assigned a task within the system, such as controlling disks or communicating with another network. For purposes of the discussion herein, a network can include bi-directional, i.e. FIFO, interfaces. [0028]
  • In the preferred embodiment of the invention, the processors can talk to each other. With regard to FIG. 1, consider the three processors: [0029] processor A 20, processor B 21, and processor C 22. Processor A is connected, for example, to a disk (not shown), processor B could also be connected to the disk, and processors A, B, and C are configured to communicate with each other. For purposes of the discussion herein, such arrangement is referred to as a fault-tolerant configuration. Thus, if processor A fails, i.e. it suffers either a hardware fault or a software fault, then processor B can assume its functions. From the perspective of a user in the external world, such a failure goes unnoticed by the user and is of no great concern to the user as long as there is enough aggregate performance in the system that the user does not notice the failure. Thus, the preferred embodiment of the invention is especially well suited for multiple processor systems that provide redundancy for fault tolerance.
  • However, in such system if processor A fails and does not restart, and then processor B also fails, resulting system performance degradation is likely to be noticed by, and objectionable to, a user. This is of particular concern in a consumer installation, such as a Web server, telecommunications system, or cable television application, where degradation in performance of the system results in diminished user satisfaction, i.e. loss of sales. [0030]
  • A key aspect of the invention herein is the ability for one processor, e.g. processor B to reset another processor, e.g. processor A, i.e. to say “start yourself from scratch.” As discussed above, the invention provides a number of different ways to do that. For example, with regard to the power supply that is supplying power to processor A, it is possible to turn that power supply off and then turn the power supply back on to reboot the processor. [0031]
  • In the presently preferred embodiment of the invention, it is contemplated that the multiprocessor system incorporates two or more processing units that are running the same operating system. Thus, a failure in one processor is likely to eventually occur in another processor. In this embodiment of the invention, the ability to get back to an initial condition is very important. One aspect of this embodiment of the invention provides that each processor includes internal circuits, e.g. chips, that have reset signals. In particular, each processor includes for the purposes of the discussion herein, a cold reset is a power-supply-level reset and a warm reset is a processor-level-reset. Reset lines provide different levels of reset. [0032]
  • In this embodiment, there is direct communication from one processor to another processor's reset line. For example, if two processors are on the same node or on the same card, and if both of these processors fail, then a third processor, e.g. processor C, could reset the whole card. [0033]
  • Thus, the invention is preferably comprised of a system that already has fault tolerance built in so that when one processor fails, a mechanism is invoked that allows that processor to recover from that failure, i.e. the fault tolerance is intended to allow continued operation in the event of a failure of one processor, but in the invention the fault tolerance is surprisingly used to provide an opportunity for an operating processor to reset a failed processor before additional processors can fail. Thus, software on a node that is operable can reset software on a node that has failed. [0034]
  • One presently preferred technique for restarting a failed processor works by toggling a signal. For example, processor B can toggle a signal on processor A. In an alternative and equally preferred embodiment, there is a unique power supply or power regulator for processor A, and there is a unique power supply for processor B. In the event of a failure in processor A, processor B could turn the power off for processor A and then turn the power back on, thereby effecting a cold restart of processor A. [0035]
  • FIG. 2 is a block schematic diagram that shows a distributed restart mechanism in accordance with the invention. The fault tolerance mechanism [0036] 22 establishes a communicative interconnection between the processors or nodes within the architecture. A fault detection module 21 executes a fault detection strategy and, upon detection of a faulty processor or node, communicates with a restart module 20 to initiate a restart procedure, as discussed herein.
  • To operate, the system must first determine that a processor failure, in fact, has occurred. In the preferred embodiment, the system knows that there is a processor failure because the faulty processor, for example, has stopped responding to requests. Thus, one way to detect a failure is to provide a heartbeat in the system, such that the processors all talk to each other over time. Each processor pings each other processor in a predetermined way. If a processor does not return a ping, or alternatively, does not issue a ping, then a fault is reported for that processor by its corresponding processor(s). Such heartbeat mechanism may be implemented in any appropriate way, as is known in the art [0037]
  • Another method for detecting a processor failure is that of running application diagnostics. For example, in a speech recognition application, the system takes a known sample and feeds it into a processor that is running a speech recognition application. The diagnostic supplies a known input and determines if a correct result is returned by the processor, e.g. making sure that the right text comes for an input phrase, such as “Take me to your leader.” Thus, every processor in this embodiment has a diagnostic routine built in, which is typically a simple routine (as is known to those skilled in the art) that sends a string out, for example, or a piece of data out, and then looks for a sum to come back to show that the processor is operable and can therefore provide a correct response. [0038]
  • Each processor periodically runs this routine, as with a heartbeat, only it is an intelligent heartbeat. One important thing is that if a processor is primarily executing, for example, speech recognition it is very important that the processor use a diagnostic which is representative of the work that it is actually performing. In this way, it is possible to catch the vast majority of failures that are relevant to the specific task being performed by the processor. As another example, if the processor is a Web browser, the diagnostics sends it an HTTP stream and makes sure that the processor outputs the right data. [0039]
  • Another method for detecting a processor failure is the detection of excessive communication errors on any one link beyond some threshold, which indicates non-functionality of the processor. Those skilled in the art will appreciate that other methods of fault detection may be applied in connection with the operation and practice of the invention. For example, in an I/O processor, where the system is pinging a gateway, communications failures beyond a threshold are a good indication of processor failure. [0040]
  • For purpose of implementing the invention, it is important that the processors be able to talk to each other through a network. It does not matter what the network is—for purposes of the discussion herein, a network is some mechanism that allows the processors to communicate, i.e. they are communicatively coupled. Thus, a network could be a backplane, an Ethernet, a serial link, a token ring, or any other such arrangement. A key point is that the processors must be able to talk to each other and that every processor be aware of at least one other processor, and sometimes maybe even more processors, depending on how they are configured, so that processors can check with each other to determine proper operation of a corresponding processor(s). [0041]
  • Another important aspect of the invention is that of reset hierarchy. In a preferred system there are nodes which correspond to printed circuit boards, and each node has multiple processors on it, e.g. two or more in the presently preferred system. These nodes, or rather the printed circuit cards on which they reside, are plugged into a backplane. [0042]
  • If one chip on a node fails, then a functioning chip on a node may be able to reset the non-functioning chip, and vice versa (see, for example, the [0043] node 11 and corresponding processors 15, 16, 17 in FIG. 1). Communications are still active to one of the chips on the node, and the functioning processor can reset the faulty processor. However, in the event of a communications interruption, for example, on a bus lockup on the node card, all of the chips on that node card may fail. If all of the chips fail for some reason, then the invention provides the ability to reset effectively the entire node, such that a processor on one node can reset the other node.
  • In connection with this aspect of the invention, it is desirable not to use too many signals to implement this strategy, as well as not to introduce too many failure paths. Thus, an important design constraint is to reduce the number of faulty paths. For example, in an important I/O node where there is one processor that is a special purpose processor and two other processors that are more general purpose processors, the special purpose processor is more likely to have a connection to the outside world, and the two other processors only have connections within the node itself. In this situation, only the special purpose processor can reset the generic processors. If that special purpose processor fails, then it can be reset by a neighboring node. Thus, it is important to maintain a reset hierarchy in a way that does not propagate undue management options, i.e. by adding only the minimum number of new failure modes. One tradeoff to this approach is that an I/O node, for example, might fail locally and it could be reset locally, but it is deemed better to go up to the network in that case and get a peer to reset it, and in that way restrict access to the outside bus. Thus, at each level within the hierarchy there is preferably a master processor that is the subject of higher level fault correction, while the other processors in the node (or at this level) are reset by this processor. [0044]
  • In the invention, it is not necessarily true that every chip or every node must be able to reset every other node. In the preferred embodiment, each node can reset two other nodes and the resets are linked in a chain that goes through the system, which corresponds to the way that cards are plugged into the system. If one plugs in a basic four-card array, then all four processors are able to reset each other, and if more cards are plugged in then they are linked in a chain. Thus, at this level the notion of hierarchy may be less important as is the notion of extensibility. Both aspects of a system design are considered when determining the appropriate level of interprocessor communication and reset capability. [0045]
  • Finally, the invention provides a reporting mechanism or supervisor function, such that if there is a failure the failure is logged and reported. For example, if processor B resets processor A even once a day, then the system places a service call because processor A probably needs to be replaced. [0046]
  • FIG. 3 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for nodes A-P within an architecture according to the invention. [0047]
  • FIGS. 4[0048] a and 4 b comprise a block schematic diagrams that show the implementation of a distributed restart in a multiprocessor system for processors contained within nodes that are arranged within a backplane according to the invention. FIG. 4a shows several nodes and, for example, a first node A 30, while FIG. 4b shows node A arranged in a backplane.
  • Although the invention is described herein with reference to the preferred embodiment, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the claims included below. [0049]

Claims (29)

1. An apparatus for restarting a failed processor or node, comprising:
a restart module associated with a node or processor in a system having multiple processors or nodes for performing any of a cold or a warm restart on one or more other processors or nodes; and
a fault detection module associated with said restart module for detecting failure of one or more of said other processors or nodes, wherein said restart module is invoked when a failure is detected by said fault detection module to restart said one or more other processors or nodes that have failed.
2. The apparatus of claim 1, wherein said system further comprises:
a fault tolerance mechanism for allowing said system to continue functioning when individual components of said system fail.
3. The apparatus of claim 1, wherein said system further comprises:
a multiple processor architecture in which each processor has a self-contained operating system.
4. The apparatus of claim 3, wherein each processor comprises any of redundant network links, redundant power supplies, redundant links to input/output devices, and software fault detection, adaptation, and recovery algorithms.
5. The apparatus of claim 1, wherein said restart module attempts to recover from a failure by restarting a failed system at any of a number of granularities which may comprise a chip, a printed circuit board (node), a subset of processors, or an entire processing system.
6. The apparatus of claim 1, wherein said restart module attempts to recover from a failure by any of a cold, warm, and/or software restart.
7. The apparatus of claim 1, wherein said restart module is invoked by any of hardware, software, or a human operator.
8. In a multiprocessor computing architecture comprised of multiple processors, an apparatus for restarting a failed component thereof, comprising:
a restart module associated with a node or processor in said architecture for performing any of a cold or a warm restart on one or more failed components;
a fault detection module associated with said restart module for detecting failure of one or more of said failed components, wherein said restart module is invoked when a failure is detected by said fault detection module to restart said one or more failed components; and
a fault tolerance mechanism for allowing said architecture to continue functioning when individual components thereof fail.
9. The apparatus of claim 8, wherein said fault tolerance mechanism comprises any of multiple processors having self-contained operating systems, redundant network links, redundant power supplies, redundant links to input/output devices, and software fault detection, adaptation, and recovery algorithms.
10. The apparatus of claim 8, wherein said restart module is adapted to restart a system at a number of granularities.
11. The apparatus of claim 10, wherein said restart module can perform any of a cold, warm, and/or software restart.
12. The apparatus of claim 10, wherein said restart module can be invoked by any of hardware, software, or a human operator.
13. The apparatus of claim 10, wherein said restart module comprises:
a distributed restart mechanism that eliminates a need for human intervention after occurrence of a software and/or hardware fault by connecting a reset signal of at least one node in said architecture to one or more other nodes therein;
wherein said distributed restart mechanism performs a warm restart of said one or more failed nodes.
14. The apparatus of claim 10, wherein said restart module comprises:
a distributed restart mechanism for connecting a power supply enable signal to a failed node in said architecture from another node therein;
wherein said distributed restart mechanism performs a cold restart of a failed node, thereby allowing recovery from classes of failures not covered by a warm restart.
15. The apparatus of claim 10, wherein said restart module comprises:
a distributed restart mechanism that allows one processor or node in said architecture to reset another processor or node therein.
16. The apparatus of claim 10, wherein said restart module comprises:
a distributed restart mechanism that allows one processor or node in said architecture to turn a failed processor or node's power supply off and then turn said power supply back on to reboot said processor or node.
17. The apparatus of claim 10, wherein each processor includes any of cold and warm reset lines which provide different levels of reset; and
wherein there is a direct communication from at least one processor to another processor's reset line.
18. An apparatus for allowing a component within a fault tolerant, multiprocessor system to recover from a failure, comprising:
a fault tolerance mechanism for allowing continued system operation in the event of a failure of one processor or node; and
said fault tolerance mechanism further comprising a fault recovery module for resetting a failed processor before additional processors can fail;
wherein a node that is operable can reset a node that has failed.
19. The apparatus of claim 18, further comprising:
a fault detection module for sending requests to processors or nodes within said system, wherein said fault detection module identifies a processor or node failure when a processor or node stops responding to said requests.
20. The apparatus of claim 19, said fault detection module comprising a heartbeat in said system, wherein each processor or node pings each other processor or node in a predetermined way, and wherein if a processor or node does not return a ping, or alternatively, does not issue a ping, then a fault is reported for that processor or node by a corresponding processor or node.
21. The apparatus of claim 19, said fault detection module comprising an application diagnostic routine within every processor or node that sends an application-level input out and then looks for an applications-level response to come back to show that the processor or node is operable and can therefore provide a correct response.
22. The apparatus of claim 21, wherein said application diagnostic routine is representative of work that said processor or node performs.
23. The apparatus of claim 19, said fault detection module comprising a mechanism for detecting excessive communication errors on any one link beyond a predetermined threshold, which indicates non-functionality of a processor or node associated with said link.
24. The apparatus of claim 18, said fault recovery module further comprising:
a reset hierarchy wherein, at each level within the hierarchy, there is preferably a master processor or node that is the subject of higher level fault correction, while other processors or nodes at this level are reset by said processor or node.
25. The apparatus of claim 18, further comprising:
any of a reporting mechanism and supervisor function, wherein if there is a failure said failure is logged and reported.
26. A method for restarting a failed processor or node, comprising the steps of:
providing a restart module associated with a node or processor in a system having multiple processors or nodes for performing any of a cold or a warm restart on one or more other processors or nodes; and
providing a fault detection module associated with said restart module for detecting failure of one or more of said other processors or nodes, wherein said restart module is invoked when a failure is detected by said fault detection module to restart said one or more other processors or nodes that have failed.
27. The method of claim 26, further comprising the step of:
providing a fault tolerance mechanism for allowing said system to continue functioning when individual components of said system fail.
28. In a multiprocessor computing architecture comprised of multiple processors, a method for restarting a failed component thereof, comprising the steps of:
performing any of a cold or a warm restart on one or more failed components;
detecting failure of one or more of said failed components, wherein said cold or a warm restart is invoked when a failure is detected to restart said one or more failed components; and
allowing said architecture to continue functioning when individual components thereof fail.
29. A method for allowing a component within a fault tolerant, multiprocessor system to recover from a failure, comprising the steps of:
allowing continued system operation in the event of a failure of one processor or node; and
resetting a failed processor before additional processors can fail;
wherein a node that is operable can reset a node that has failed.
US09/834,524 2001-04-12 2001-04-12 Distributed restart in a multiple processor system Abandoned US20020152425A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/834,524 US20020152425A1 (en) 2001-04-12 2001-04-12 Distributed restart in a multiple processor system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US09/834,524 US20020152425A1 (en) 2001-04-12 2001-04-12 Distributed restart in a multiple processor system

Publications (1)

Publication Number Publication Date
US20020152425A1 true US20020152425A1 (en) 2002-10-17

Family

ID=25267122

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/834,524 Abandoned US20020152425A1 (en) 2001-04-12 2001-04-12 Distributed restart in a multiple processor system

Country Status (1)

Country Link
US (1) US20020152425A1 (en)

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020184295A1 (en) * 2001-05-24 2002-12-05 Ibm Corporation Method for mutual computer process monitoring and restart
US20040030881A1 (en) * 2002-08-08 2004-02-12 International Business Machines Corp. Method, system, and computer program product for improved reboot capability
US6854054B1 (en) * 2001-12-06 2005-02-08 Ciena Corporation System and method of memory management for providing data storage across a reboot
US20060181857A1 (en) * 2005-02-16 2006-08-17 Belady Christian L Redundant power beneath circuit board
US20060224728A1 (en) * 2005-04-04 2006-10-05 Hitachi, Ltd. Failover method in a cluster computer system
US20070067673A1 (en) * 2005-08-19 2007-03-22 Algirdas Avizienis Hierarchical configurations in error-correcting computer systems
US20070277070A1 (en) * 2006-01-13 2007-11-29 Infineon Technologies Ag Apparatus and method for checking an error detection functionality of a data processor
US20080059783A1 (en) * 2006-09-01 2008-03-06 Benq Corporation Multimedia player and auto recovery method therefor
US20080082850A1 (en) * 2006-09-14 2008-04-03 Fujitsu Limited Method and apparatus for monitoring power failure
US7444551B1 (en) * 2002-12-16 2008-10-28 Nvidia Corporation Method and apparatus for system status monitoring, testing and restoration
US20090013221A1 (en) * 2007-06-25 2009-01-08 Hitachi Industrial Equipment System Co., Ltd. Multi-component system
US20090059810A1 (en) * 2006-03-10 2009-03-05 Fujitsu Limited Network system
EP1814031A3 (en) * 2005-12-22 2009-10-14 NRC International Inc. Power control interface for a self-service apparatus
US20090293072A1 (en) * 2006-07-21 2009-11-26 Sony Service Centre (Europe) N.V. System having plurality of hardware blocks and method of operating the same
US20100011242A1 (en) * 2008-07-10 2010-01-14 Hitachi, Ltd. Failover method and system for a computer system having clustering configuration
US20100088542A1 (en) * 2008-10-06 2010-04-08 Texas Instruments Incorporated Lockup recovery for processors
US20110202797A1 (en) * 2010-02-12 2011-08-18 Evgeny Mezhibovsky Method and system for resetting a subsystem of a communication device
US20130003310A1 (en) * 2011-06-28 2013-01-03 Oracle International Corporation Chip package to support high-frequency processors
EP2642390A1 (en) * 2012-03-20 2013-09-25 BlackBerry Limited Fault recovery
US20130254586A1 (en) * 2012-03-20 2013-09-26 Research In Motion Limited Fault recovery
US20160161743A1 (en) * 2014-12-03 2016-06-09 Osterhout Group, Inc. See-through computer display systems
US20160283336A1 (en) * 2015-03-27 2016-09-29 Facebook, Inc. Power fail circuit for multi-storage-device arrays
TWI561026B (en) * 2014-12-17 2016-12-01 Wistron Neweb Corp Electronic device with reset function and reset method thereof
US20170075745A1 (en) * 2015-09-16 2017-03-16 Microsoft Technology Licensing, Llc Handling crashes of a device's peripheral subsystems
CN107402834A (en) * 2017-06-20 2017-11-28 公牛集团有限公司 A kind of embedded system electrifying startup self checking method and device
US9898360B1 (en) * 2014-02-25 2018-02-20 Google Llc Preventing unnecessary data recovery
US20180137007A1 (en) * 2016-11-17 2018-05-17 Ricoh Company, Ltd. Reboot system, information processing apparatus, and method for rebooting
US10101588B2 (en) 2014-04-25 2018-10-16 Osterhout Group, Inc. Speaker assembly for headworn computer
US10120760B2 (en) * 2014-07-17 2018-11-06 Continental Automotive Gmbh Vehicle infotainment system
US10197801B2 (en) 2014-12-03 2019-02-05 Osterhout Group, Inc. Head worn computer display systems
USD840395S1 (en) 2016-10-17 2019-02-12 Osterhout Group, Inc. Head-worn computer
EP3347793A4 (en) * 2015-09-10 2019-03-06 Manufacturing Resources International, Inc. System and method for systemic detection of display errors
USD864959S1 (en) 2017-01-04 2019-10-29 Mentor Acquisition One, Llc Computer glasses
US10466492B2 (en) 2014-04-25 2019-11-05 Mentor Acquisition One, Llc Ear horn assembly for headworn computer
US20190391868A1 (en) * 2019-08-30 2019-12-26 Intel Corporation Power error monitoring and reporting within a system on chip for functional safety
CN110630552A (en) * 2019-09-21 2019-12-31 苏州浪潮智能科技有限公司 System, method and device for detecting fan link fault
US10690936B2 (en) 2016-08-29 2020-06-23 Mentor Acquisition One, Llc Adjustable nose bridge assembly for headworn computer
GB2580727A (en) * 2018-07-30 2020-07-29 Honeywell Int Inc Method and apparatus for detecting and remedying single event effects
US10732434B2 (en) 2014-04-25 2020-08-04 Mentor Acquisition One, Llc Temple and ear horn assembly for headworn computer
US10908863B2 (en) 2018-07-12 2021-02-02 Manufacturing Resources International, Inc. System and method for providing access to co-located operations data for an electronic display
US10933486B2 (en) 2013-02-28 2021-03-02 Illinois Tool Works Inc. Remote master reset of machine
US11137847B2 (en) 2019-02-25 2021-10-05 Manufacturing Resources International, Inc. Monitoring the status of a touchscreen
US11402940B2 (en) 2019-02-25 2022-08-02 Manufacturing Resources International, Inc. Monitoring the status of a touchscreen
US11921010B2 (en) 2021-07-28 2024-03-05 Manufacturing Resources International, Inc. Display assemblies with differential pressure sensors

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4377000A (en) * 1980-05-05 1983-03-15 Westinghouse Electric Corp. Automatic fault detection and recovery system which provides stability and continuity of operation in an industrial multiprocessor control
US5746203A (en) * 1996-09-26 1998-05-05 Johnson & Johnson Medical, Inc. Failsafe supervisor system for a patient monitor
US5790850A (en) * 1996-09-30 1998-08-04 Intel Corporation Fault resilient booting for multiprocessor computer systems
US5898828A (en) * 1995-12-29 1999-04-27 Emc Corporation Reduction of power used by transceivers in a data transmission loop
US6266781B1 (en) * 1998-07-20 2001-07-24 Academia Sinica Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network
US6392990B1 (en) * 1999-07-23 2002-05-21 Glenayre Electronics, Inc. Method for implementing interface redundancy in a computer network
US6581166B1 (en) * 1999-03-02 2003-06-17 The Foxboro Company Network fault detection and recovery
US6622261B1 (en) * 1998-04-09 2003-09-16 Compaq Information Technologies Group, L.P. Process pair protection for complex applications

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4377000A (en) * 1980-05-05 1983-03-15 Westinghouse Electric Corp. Automatic fault detection and recovery system which provides stability and continuity of operation in an industrial multiprocessor control
US5898828A (en) * 1995-12-29 1999-04-27 Emc Corporation Reduction of power used by transceivers in a data transmission loop
US5746203A (en) * 1996-09-26 1998-05-05 Johnson & Johnson Medical, Inc. Failsafe supervisor system for a patient monitor
US5790850A (en) * 1996-09-30 1998-08-04 Intel Corporation Fault resilient booting for multiprocessor computer systems
US6622261B1 (en) * 1998-04-09 2003-09-16 Compaq Information Technologies Group, L.P. Process pair protection for complex applications
US6266781B1 (en) * 1998-07-20 2001-07-24 Academia Sinica Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network
US6581166B1 (en) * 1999-03-02 2003-06-17 The Foxboro Company Network fault detection and recovery
US6392990B1 (en) * 1999-07-23 2002-05-21 Glenayre Electronics, Inc. Method for implementing interface redundancy in a computer network

Cited By (86)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6928585B2 (en) * 2001-05-24 2005-08-09 International Business Machines Corporation Method for mutual computer process monitoring and restart
US20020184295A1 (en) * 2001-05-24 2002-12-05 Ibm Corporation Method for mutual computer process monitoring and restart
US6854054B1 (en) * 2001-12-06 2005-02-08 Ciena Corporation System and method of memory management for providing data storage across a reboot
US20040030881A1 (en) * 2002-08-08 2004-02-12 International Business Machines Corp. Method, system, and computer program product for improved reboot capability
US7802147B1 (en) 2002-12-16 2010-09-21 Nvidia Corporation Method and apparatus for system status monitoring, testing and restoration
US7444551B1 (en) * 2002-12-16 2008-10-28 Nvidia Corporation Method and apparatus for system status monitoring, testing and restoration
US7627787B1 (en) 2002-12-16 2009-12-01 Nvidia Corporation Method and apparatus for system status monitoring, testing and restoration
US7791889B2 (en) * 2005-02-16 2010-09-07 Hewlett-Packard Development Company, L.P. Redundant power beneath circuit board
US20060181857A1 (en) * 2005-02-16 2006-08-17 Belady Christian L Redundant power beneath circuit board
US20060224728A1 (en) * 2005-04-04 2006-10-05 Hitachi, Ltd. Failover method in a cluster computer system
US7467322B2 (en) * 2005-04-04 2008-12-16 Hitachi, Ltd. Failover method in a cluster computer system
US7861106B2 (en) * 2005-08-19 2010-12-28 A. Avizienis And Associates, Inc. Hierarchical configurations in error-correcting computer systems
US20070067673A1 (en) * 2005-08-19 2007-03-22 Algirdas Avizienis Hierarchical configurations in error-correcting computer systems
EP1814031A3 (en) * 2005-12-22 2009-10-14 NRC International Inc. Power control interface for a self-service apparatus
US20070277070A1 (en) * 2006-01-13 2007-11-29 Infineon Technologies Ag Apparatus and method for checking an error detection functionality of a data processor
US8918679B2 (en) * 2006-01-13 2014-12-23 Infineon Technologies Ag Apparatus and method for checking an error detection functionality of a data processor
US20090059810A1 (en) * 2006-03-10 2009-03-05 Fujitsu Limited Network system
US8018867B2 (en) * 2006-03-10 2011-09-13 Fujitsu Limited Network system for monitoring operation of monitored node
US20090293072A1 (en) * 2006-07-21 2009-11-26 Sony Service Centre (Europe) N.V. System having plurality of hardware blocks and method of operating the same
US20090327676A1 (en) * 2006-07-21 2009-12-31 Sony Service Centre (Europe) N.V. Demodulator device and method of operating the same
US8161276B2 (en) * 2006-07-21 2012-04-17 Sony Service Centre (Europe) N.V. Demodulator device and method of operating the same
US20080059783A1 (en) * 2006-09-01 2008-03-06 Benq Corporation Multimedia player and auto recovery method therefor
US7676693B2 (en) * 2006-09-14 2010-03-09 Fujitsu Limited Method and apparatus for monitoring power failure
US20080082850A1 (en) * 2006-09-14 2008-04-03 Fujitsu Limited Method and apparatus for monitoring power failure
US20090013221A1 (en) * 2007-06-25 2009-01-08 Hitachi Industrial Equipment System Co., Ltd. Multi-component system
US7861115B2 (en) * 2007-06-25 2010-12-28 Hitachi Industrial Equipment Systems Co., Ltd. Multi-component system
US11506912B2 (en) 2008-01-02 2022-11-22 Mentor Acquisition One, Llc Temple and ear horn assembly for headworn computer
US20100011242A1 (en) * 2008-07-10 2010-01-14 Hitachi, Ltd. Failover method and system for a computer system having clustering configuration
US20110179307A1 (en) * 2008-07-10 2011-07-21 Tsunehiko Baba Failover method and system for a computer system having clustering configuration
US7925922B2 (en) * 2008-07-10 2011-04-12 Hitachi, Ltd. Failover method and system for a computer system having clustering configuration
US20100088542A1 (en) * 2008-10-06 2010-04-08 Texas Instruments Incorporated Lockup recovery for processors
US8495422B2 (en) 2010-02-12 2013-07-23 Research In Motion Limited Method and system for resetting a subsystem of a communication device
US20110202797A1 (en) * 2010-02-12 2011-08-18 Evgeny Mezhibovsky Method and system for resetting a subsystem of a communication device
US20130003310A1 (en) * 2011-06-28 2013-01-03 Oracle International Corporation Chip package to support high-frequency processors
US8982563B2 (en) * 2011-06-28 2015-03-17 Oracle International Corporation Chip package to support high-frequency processors
EP2642390A1 (en) * 2012-03-20 2013-09-25 BlackBerry Limited Fault recovery
US20130254586A1 (en) * 2012-03-20 2013-09-26 Research In Motion Limited Fault recovery
US9026842B2 (en) * 2012-03-20 2015-05-05 Blackberry Limited Selective fault recovery of subsystems
US10933486B2 (en) 2013-02-28 2021-03-02 Illinois Tool Works Inc. Remote master reset of machine
US9898360B1 (en) * 2014-02-25 2018-02-20 Google Llc Preventing unnecessary data recovery
US10732434B2 (en) 2014-04-25 2020-08-04 Mentor Acquisition One, Llc Temple and ear horn assembly for headworn computer
US10466492B2 (en) 2014-04-25 2019-11-05 Mentor Acquisition One, Llc Ear horn assembly for headworn computer
US11474360B2 (en) 2014-04-25 2022-10-18 Mentor Acquisition One, Llc Speaker assembly for headworn computer
US11809022B2 (en) 2014-04-25 2023-11-07 Mentor Acquisition One, Llc Temple and ear horn assembly for headworn computer
US10634922B2 (en) 2014-04-25 2020-04-28 Mentor Acquisition One, Llc Speaker assembly for headworn computer
US11880041B2 (en) 2014-04-25 2024-01-23 Mentor Acquisition One, Llc Speaker assembly for headworn computer
US10101588B2 (en) 2014-04-25 2018-10-16 Osterhout Group, Inc. Speaker assembly for headworn computer
US10120760B2 (en) * 2014-07-17 2018-11-06 Continental Automotive Gmbh Vehicle infotainment system
US10197801B2 (en) 2014-12-03 2019-02-05 Osterhout Group, Inc. Head worn computer display systems
US11809628B2 (en) 2014-12-03 2023-11-07 Mentor Acquisition One, Llc See-through computer display systems
US11262846B2 (en) 2014-12-03 2022-03-01 Mentor Acquisition One, Llc See-through computer display systems
US20160161743A1 (en) * 2014-12-03 2016-06-09 Osterhout Group, Inc. See-through computer display systems
US10684687B2 (en) * 2014-12-03 2020-06-16 Mentor Acquisition One, Llc See-through computer display systems
TWI561026B (en) * 2014-12-17 2016-12-01 Wistron Neweb Corp Electronic device with reset function and reset method thereof
US20160283336A1 (en) * 2015-03-27 2016-09-29 Facebook, Inc. Power fail circuit for multi-storage-device arrays
US10229019B2 (en) 2015-03-27 2019-03-12 Facebook, Inc. Power fail circuit for multi-storage-device arrays
US9710343B2 (en) * 2015-03-27 2017-07-18 Facebook, Inc. Power fail circuit for multi-storage-device arrays
US10353785B2 (en) 2015-09-10 2019-07-16 Manufacturing Resources International, Inc. System and method for systemic detection of display errors
EP3347793A4 (en) * 2015-09-10 2019-03-06 Manufacturing Resources International, Inc. System and method for systemic detection of display errors
US11093355B2 (en) 2015-09-10 2021-08-17 Manufacturing Resources International, Inc. System and method for detection of display errors
US10013299B2 (en) * 2015-09-16 2018-07-03 Microsoft Technology Licensing, Llc Handling crashes of a device's peripheral subsystems
US20170075745A1 (en) * 2015-09-16 2017-03-16 Microsoft Technology Licensing, Llc Handling crashes of a device's peripheral subsystems
US10690936B2 (en) 2016-08-29 2020-06-23 Mentor Acquisition One, Llc Adjustable nose bridge assembly for headworn computer
US11409128B2 (en) 2016-08-29 2022-08-09 Mentor Acquisition One, Llc Adjustable nose bridge assembly for headworn computer
USD840395S1 (en) 2016-10-17 2019-02-12 Osterhout Group, Inc. Head-worn computer
US10606702B2 (en) * 2016-11-17 2020-03-31 Ricoh Company, Ltd. System, information processing apparatus, and method for rebooting a part corresponding to a cause identified
US20180137007A1 (en) * 2016-11-17 2018-05-17 Ricoh Company, Ltd. Reboot system, information processing apparatus, and method for rebooting
USD947186S1 (en) 2017-01-04 2022-03-29 Mentor Acquisition One, Llc Computer glasses
USD918905S1 (en) 2017-01-04 2021-05-11 Mentor Acquisition One, Llc Computer glasses
USD864959S1 (en) 2017-01-04 2019-10-29 Mentor Acquisition One, Llc Computer glasses
CN107402834A (en) * 2017-06-20 2017-11-28 公牛集团有限公司 A kind of embedded system electrifying startup self checking method and device
US10908863B2 (en) 2018-07-12 2021-02-02 Manufacturing Resources International, Inc. System and method for providing access to co-located operations data for an electronic display
US11614911B2 (en) 2018-07-12 2023-03-28 Manufacturing Resources International, Inc. System and method for providing access to co-located operations data for an electronic display
US11243733B2 (en) 2018-07-12 2022-02-08 Manufacturing Resources International, Inc. System and method for providing access to co-located operations data for an electronic display
US11928380B2 (en) 2018-07-12 2024-03-12 Manufacturing Resources International, Inc. System and method for providing access to co-located operations data for an electronic display
US11455138B2 (en) 2018-07-12 2022-09-27 Manufacturing Resources International, Inc. System and method for providing access to co-located operations data for an electronic display
US11188421B2 (en) * 2018-07-30 2021-11-30 Honeywell International Inc. Method and apparatus for detecting and remedying single event effects
GB2580727A (en) * 2018-07-30 2020-07-29 Honeywell Int Inc Method and apparatus for detecting and remedying single event effects
GB2580727B (en) * 2018-07-30 2022-08-31 Honeywell Int Inc Method and apparatus for detecting and remedying single event effects
US11137847B2 (en) 2019-02-25 2021-10-05 Manufacturing Resources International, Inc. Monitoring the status of a touchscreen
US11402940B2 (en) 2019-02-25 2022-08-02 Manufacturing Resources International, Inc. Monitoring the status of a touchscreen
US11644921B2 (en) 2019-02-25 2023-05-09 Manufacturing Resources International, Inc. Monitoring the status of a touchscreen
US11669385B2 (en) * 2019-08-30 2023-06-06 Intel Corporation Power error monitoring and reporting within a system on chip for functional safety
US20190391868A1 (en) * 2019-08-30 2019-12-26 Intel Corporation Power error monitoring and reporting within a system on chip for functional safety
CN110630552A (en) * 2019-09-21 2019-12-31 苏州浪潮智能科技有限公司 System, method and device for detecting fan link fault
US11921010B2 (en) 2021-07-28 2024-03-05 Manufacturing Resources International, Inc. Display assemblies with differential pressure sensors

Similar Documents

Publication Publication Date Title
US20020152425A1 (en) Distributed restart in a multiple processor system
US7222268B2 (en) System resource availability manager
US6691244B1 (en) System and method for comprehensive availability management in a high-availability computer system
US7787388B2 (en) Method of and a system for autonomously identifying which node in a two-node system has failed
EP1703401A2 (en) Information processing apparatus and control method therefor
US7093013B1 (en) High availability system for network elements
WO2002003195A2 (en) Method for upgrading a computer system
TWI529624B (en) Method and system of fault tolerance for multiple servers
CN100362481C (en) Main-standby protection method for multi-processor device units
JP4655718B2 (en) Computer system and control method thereof
US20030177224A1 (en) Clustered/fail-over remote hardware management system
EP1782202A2 (en) Computing system redundancy and fault tolerance
US7627774B2 (en) Redundant manager modules to perform management tasks with respect to an interconnect structure and power supplies
JPH11261663A (en) Communication processing control means and information processor having the control means
Hunter et al. Availability modeling and analysis of a two node cluster
Hughes-Fenchel A flexible clustered approach to high availability
JP2003186578A (en) Method and apparatus for supplying redundant power
JP6654662B2 (en) Server device and server system
JP2839664B2 (en) Computer system
US20230244550A1 (en) Computer device and management method
US11042443B2 (en) Fault tolerant computer systems and methods establishing consensus for which processing system should be the prime string
KR100388965B1 (en) Apparatus for cross duplication of each processor board in exchange
KR960010879B1 (en) Bus duplexing control of multiple processor
JPH0630069B2 (en) Multiplexing system
KR100249800B1 (en) Management method for fault diagnosis utilities

Legal Events

Date Code Title Description
AS Assignment

Owner name: AGILE TV CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHAIKEN, DAVID;FOSTER, MARK J.;REEL/FRAME:011709/0945

Effective date: 20010412

AS Assignment

Owner name: AGILETV CORPORATION, CALIFORNIA

Free format text: REASSIGNMENT AND RELEASE OF SECURITY INTEREST;ASSIGNOR:INSIGHT COMMUNICATIONS COMPANY, INC.;REEL/FRAME:012747/0141

Effective date: 20020131

AS Assignment

Owner name: LAUDER PARTNERS LLC, AS AGENT, NEW YORK

Free format text: SECURITY AGREEMENT;ASSIGNOR:AGILETV CORPORATION;REEL/FRAME:014782/0717

Effective date: 20031209

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: AGILETV CORPORATION, CALIFORNIA

Free format text: REASSIGNMENT AND RELEASE OF SECURITY INTEREST;ASSIGNOR:LAUDER PARTNERS LLC AS COLLATERAL AGENT FOR ITSELF AND CERTAIN OTHER LENDERS;REEL/FRAME:015991/0795

Effective date: 20050511