US20020152425A1 - Distributed restart in a multiple processor system - Google Patents
Distributed restart in a multiple processor system Download PDFInfo
- Publication number
- US20020152425A1 US20020152425A1 US09/834,524 US83452401A US2002152425A1 US 20020152425 A1 US20020152425 A1 US 20020152425A1 US 83452401 A US83452401 A US 83452401A US 2002152425 A1 US2002152425 A1 US 2002152425A1
- Authority
- US
- United States
- Prior art keywords
- processor
- restart
- node
- failed
- processors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 230000007246 mechanism Effects 0.000 claims abstract description 27
- 238000001514 detection method Methods 0.000 claims abstract description 21
- 238000011084 recovery Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 10
- 235000019580 granularity Nutrition 0.000 claims abstract description 6
- 230000006978 adaptation Effects 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 12
- 238000004891 communication Methods 0.000 claims description 9
- 238000012937 correction Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 8
- 230000015556 catabolic process Effects 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000003292 diminished effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000003362 replicative effect Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1438—Restarting or rejuvenating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0721—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
- G06F11/0724—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU] in a multiprocessor or a multi-core unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2002—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant
- G06F11/2005—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant using redundant communication controllers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2015—Redundant power supplies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/203—Failover techniques using migration
Definitions
- the invention relates to restarting a computer system after a system failure. More particularly, the invention relates to a mechanism for distributed restart in a multiple processor computer system.
- Transient failures in computing systems result from component level hardware or software failures. It is often the case that a system can recover from such failures. Recovering from a failure might require toggling the power supply, i.e. a cold restart, toggling a reset signal, i.e. a warm restart, or terminating and rebooting system software, i.e. a software restart.
- the invention provides a technique that allows software or hardware on one node or processor in a system with multiple processors or nodes to perform a cold or a warm restart on one or more other processors or nodes.
- fault tolerance mechanisms are provided in a computer architecture to allow it to continue functioning when individual components, such as chips, printed circuit boards, network links, fans, or power supplies fail.
- One aspect of the invention provides multiple processors having self-contained operating systems.
- Each processor preferably comprises any of redundant network links; redundant power supplies; redundant links to input/output devices; and software fault detection, adaptation, and recovery algorithms.
- the system attempts to recover from the failure by restarting the failed processor.
- the preferred system is constructed as a set of self-contained processing units, it is possible to restart the system at a number of granularities, e.g. a chip, a printed circuit board (node), a subset of processors (PLEX), or an entire architecture.
- Each of these restarts can be any of a cold, warm, and/or software restart, and can be invoked by any of hardware, e.g. by a watchdog timer, software, e.g. by fault recovery algorithms, or by a human operator.
- FIG. 1 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for processors within an individual system according to the invention
- FIG. 2 is a block schematic diagram that shows a distributed restart mechanism in accordance with the invention
- FIG. 3 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for nodes within an architecture according to the invention.
- FIGS. 4 a and 4 b comprise a block schematic diagrams that show the implementation of a distributed restart in a multiprocessor system for processors contained within nodes that are arranged within a backplane according to the invention
- a multiprocessor computing architecture such as the AgileTV engine developed by AgileTV of Menlo Park, Calif. (see, for example, [inventor, title], U.S. patent application Ser. No. ______, filed ______, attorney docket no. AGLE0003, is a computing architecture comprised of multiple processors. Such multiprocessor computing architecture is designed to continue operation when individual components suffer hardware or software failures.
- the following fault tolerance mechanisms are provided in the computer architecture to allow it to continue functioning when individual components, such as chips, printed circuit boards, network links, fans, or power supplies fail.
- individual components such as chips, printed circuit boards, network links, fans, or power supplies fail.
- one aspect of the invention provides multiple processors having self-contained operating systems.
- Each processor preferably comprises:
- Software fault detection for example, a ‘ping’ and corresponding application level diagnostics
- adaptation for example, rerouting in the network and reassigning tasks
- recovery algorithms for example, replicating job, i.e. software, state so that jobs can be restarted on correctly functioning processors or restarting failed processors or nodes.
- a processor in the system has failed, it is important to attempt to recover from the failure by restarting the failed processor.
- the preferred system is constructed as a set of self-contained processing units, it is possible to restart the system at a number of granularities, e.g. a chip, a printed circuit board (node), a subset of processors (PLEX), or an entire architecture.
- Each of these restarts can be any of a cold, warm, and/or software restart, and can be invoked by any of hardware, e.g. by a watchdog timer.
- Watchdog timers are known to those skilled in the art, for example, Watchdog timers can be used to detect the liveness of the system and will cause a full system reset if software does not prove its correct operation by regularly resetting the timer.
- Software e.g. by fault recovery algorithms (discussed above) or by a human operator.
- connecting a reset signal from each node to one or more other nodes permits fault recovery software to warm restart one or more failed nodes.
- This distributed restart technique eliminates the need for human intervention after the occurrence of any of many types of software and hardware faults, such as livelock in the operating system scheduler, failure at one or more communication links, transistor-level lockup in the processor, or failure of software reset.
- Connecting a power supply enable signal to one node from another allows system software at a node to cold restart another node (e.g. turn the power supply of the failed processor off and back on), thereby allowing recovery from classes of failures not covered by a warm restart, such as transistor-level lockup in the processor, errors in state machines in the processor.
- Such control over the power supply also allows selective node shutdown of processors or nodes, which can be used to adapt to elevated temperature conditions (see, for example, [inventor, title], U.S. patent application Ser. No. _, filed ______, attorney docket no. AGLE0023.
- FIG. 1 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for processors within an individual system according to the invention.
- FIG. 1 The embodiment of the invention that is shown in FIG. 1 is a system comprised of one or more nodes 10 , 11 , 12 having multiple processing units, e.g. dual processing node comprises two processors 13 , 14 . Each of these processing units is responsible for some part of the functioning of the system.
- the processing units are interconnected by a network, such as Ethernet. Each processing unit is preferably assigned a task within the system, such as controlling disks or communicating with another network.
- a network can include bi-directional, i.e. FIFO, interfaces.
- processors can talk to each other.
- processors can talk to each other.
- FIG. 1 consider the three processors: processor A 20 , processor B 21 , and processor C 22 .
- Processor A is connected, for example, to a disk (not shown), processor B could also be connected to the disk, and processors A, B, and C are configured to communicate with each other.
- processors A, B, and C are configured to communicate with each other.
- such arrangement is referred to as a fault-tolerant configuration.
- processor A fails, i.e. it suffers either a hardware fault or a software fault, then processor B can assume its functions.
- the preferred embodiment of the invention is especially well suited for multiple processor systems that provide redundancy for fault tolerance.
- processor A fails and does not restart, and then processor B also fails, resulting system performance degradation is likely to be noticed by, and objectionable to, a user.
- This is of particular concern in a consumer installation, such as a Web server, telecommunications system, or cable television application, where degradation in performance of the system results in diminished user satisfaction, i.e. loss of sales.
- a key aspect of the invention herein is the ability for one processor, e.g. processor B to reset another processor, e.g. processor A, i.e. to say “start yourself from scratch.”
- the invention provides a number of different ways to do that. For example, with regard to the power supply that is supplying power to processor A, it is possible to turn that power supply off and then turn the power supply back on to reboot the processor.
- each processor includes internal circuits, e.g. chips, that have reset signals.
- each processor includes for the purposes of the discussion herein, a cold reset is a power-supply-level reset and a warm reset is a processor-level-reset. Reset lines provide different levels of reset.
- processors there is direct communication from one processor to another processor's reset line. For example, if two processors are on the same node or on the same card, and if both of these processors fail, then a third processor, e.g. processor C, could reset the whole card.
- a third processor e.g. processor C
- the invention is preferably comprised of a system that already has fault tolerance built in so that when one processor fails, a mechanism is invoked that allows that processor to recover from that failure, i.e. the fault tolerance is intended to allow continued operation in the event of a failure of one processor, but in the invention the fault tolerance is surprisingly used to provide an opportunity for an operating processor to reset a failed processor before additional processors can fail.
- software on a node that is operable can reset software on a node that has failed.
- processor B can toggle a signal on processor A.
- there is a unique power supply or power regulator for processor A there is a unique power supply for processor B.
- processor B could turn the power off for processor A and then turn the power back on, thereby effecting a cold restart of processor A.
- FIG. 2 is a block schematic diagram that shows a distributed restart mechanism in accordance with the invention.
- the fault tolerance mechanism 22 establishes a communicative interconnection between the processors or nodes within the architecture.
- a fault detection module 21 executes a fault detection strategy and, upon detection of a faulty processor or node, communicates with a restart module 20 to initiate a restart procedure, as discussed herein.
- the system must first determine that a processor failure, in fact, has occurred.
- the system knows that there is a processor failure because the faulty processor, for example, has stopped responding to requests.
- one way to detect a failure is to provide a heartbeat in the system, such that the processors all talk to each other over time.
- Each processor pings each other processor in a predetermined way. If a processor does not return a ping, or alternatively, does not issue a ping, then a fault is reported for that processor by its corresponding processor(s).
- Such heartbeat mechanism may be implemented in any appropriate way, as is known in the art
- Another method for detecting a processor failure is that of running application diagnostics.
- the system takes a known sample and feeds it into a processor that is running a speech recognition application.
- the diagnostic supplies a known input and determines if a correct result is returned by the processor, e.g. making sure that the right text comes for an input phrase, such as “Take me to your leader.”
- every processor in this embodiment has a diagnostic routine built in, which is typically a simple routine (as is known to those skilled in the art) that sends a string out, for example, or a piece of data out, and then looks for a sum to come back to show that the processor is operable and can therefore provide a correct response.
- Each processor periodically runs this routine, as with a heartbeat, only it is an intelligent heartbeat.
- a processor is primarily executing, for example, speech recognition it is very important that the processor use a diagnostic which is representative of the work that it is actually performing. In this way, it is possible to catch the vast majority of failures that are relevant to the specific task being performed by the processor.
- the diagnostics sends it an HTTP stream and makes sure that the processor outputs the right data.
- Another method for detecting a processor failure is the detection of excessive communication errors on any one link beyond some threshold, which indicates non-functionality of the processor.
- Those skilled in the art will appreciate that other methods of fault detection may be applied in connection with the operation and practice of the invention. For example, in an I/O processor, where the system is pinging a gateway, communications failures beyond a threshold are a good indication of processor failure.
- a network is some mechanism that allows the processors to communicate, i.e. they are communicatively coupled.
- a network could be a backplane, an Ethernet, a serial link, a token ring, or any other such arrangement.
- the processors must be able to talk to each other and that every processor be aware of at least one other processor, and sometimes maybe even more processors, depending on how they are configured, so that processors can check with each other to determine proper operation of a corresponding processor(s).
- Another important aspect of the invention is that of reset hierarchy.
- nodes which correspond to printed circuit boards, and each node has multiple processors on it, e.g. two or more in the presently preferred system.
- These nodes, or rather the printed circuit cards on which they reside, are plugged into a backplane.
- a functioning chip on a node may be able to reset the non-functioning chip, and vice versa (see, for example, the node 11 and corresponding processors 15 , 16 , 17 in FIG. 1). Communications are still active to one of the chips on the node, and the functioning processor can reset the faulty processor. However, in the event of a communications interruption, for example, on a bus lockup on the node card, all of the chips on that node card may fail. If all of the chips fail for some reason, then the invention provides the ability to reset effectively the entire node, such that a processor on one node can reset the other node.
- an important design constraint is to reduce the number of faulty paths.
- the special purpose processor is more likely to have a connection to the outside world, and the two other processors only have connections within the node itself. In this situation, only the special purpose processor can reset the generic processors. If that special purpose processor fails, then it can be reset by a neighboring node. Thus, it is important to maintain a reset hierarchy in a way that does not propagate undue management options, i.e.
- I/O node for example, might fail locally and it could be reset locally, but it is deemed better to go up to the network in that case and get a peer to reset it, and in that way restrict access to the outside bus.
- master processor that is the subject of higher level fault correction, while the other processors in the node (or at this level) are reset by this processor.
- every chip or every node must be able to reset every other node.
- each node can reset two other nodes and the resets are linked in a chain that goes through the system, which corresponds to the way that cards are plugged into the system. If one plugs in a basic four-card array, then all four processors are able to reset each other, and if more cards are plugged in then they are linked in a chain. Thus, at this level the notion of hierarchy may be less important as is the notion of extensibility. Both aspects of a system design are considered when determining the appropriate level of interprocessor communication and reset capability.
- the invention provides a reporting mechanism or supervisor function, such that if there is a failure the failure is logged and reported. For example, if processor B resets processor A even once a day, then the system places a service call because processor A probably needs to be replaced.
- FIG. 3 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for nodes A-P within an architecture according to the invention.
- FIGS. 4 a and 4 b comprise a block schematic diagrams that show the implementation of a distributed restart in a multiprocessor system for processors contained within nodes that are arranged within a backplane according to the invention.
- FIG. 4 a shows several nodes and, for example, a first node A 30
- FIG. 4 b shows node A arranged in a backplane.
Abstract
Description
- 1. Technical Field
- The invention relates to restarting a computer system after a system failure. More particularly, the invention relates to a mechanism for distributed restart in a multiple processor computer system.
- 2. Description of the Prior Art
- Transient failures in computing systems result from component level hardware or software failures. It is often the case that a system can recover from such failures. Recovering from a failure might require toggling the power supply, i.e. a cold restart, toggling a reset signal, i.e. a warm restart, or terminating and rebooting system software, i.e. a software restart.
- There are many well known techniques for automatically restarting computer systems, including hardware watchdogs, software power-down and reset mechanisms, and physical switches. Smart, uninterruptible power supplies are also know that have the capability for remote control of computer system power.
- None of the known restart mechanisms provide for intelligent intercession by an operating node in a multiprocessor system. It would be advantageous to allow software or hardware on one node in a system with multiple processors to perform a cold or a warm restart on another processor.
- The invention provides a technique that allows software or hardware on one node or processor in a system with multiple processors or nodes to perform a cold or a warm restart on one or more other processors or nodes.
- In the presently preferred embodiment of the invention, fault tolerance mechanisms are provided in a computer architecture to allow it to continue functioning when individual components, such as chips, printed circuit boards, network links, fans, or power supplies fail.
- One aspect of the invention provides multiple processors having self-contained operating systems. Each processor preferably comprises any of redundant network links; redundant power supplies; redundant links to input/output devices; and software fault detection, adaptation, and recovery algorithms. Once a processor in the system has failed, the system attempts to recover from the failure by restarting the failed processor. Because the preferred system is constructed as a set of self-contained processing units, it is possible to restart the system at a number of granularities, e.g. a chip, a printed circuit board (node), a subset of processors (PLEX), or an entire architecture. Each of these restarts can be any of a cold, warm, and/or software restart, and can be invoked by any of hardware, e.g. by a watchdog timer, software, e.g. by fault recovery algorithms, or by a human operator.
- FIG. 1 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for processors within an individual system according to the invention;
- FIG. 2 is a block schematic diagram that shows a distributed restart mechanism in accordance with the invention;
- FIG. 3 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for nodes within an architecture according to the invention; and
- FIGS. 4a and 4 b comprise a block schematic diagrams that show the implementation of a distributed restart in a multiprocessor system for processors contained within nodes that are arranged within a backplane according to the invention
- A multiprocessor computing architecture, such as the AgileTV engine developed by AgileTV of Menlo Park, Calif. (see, for example, [inventor, title], U.S. patent application Ser. No. ______, filed ______, attorney docket no. AGLE0003, is a computing architecture comprised of multiple processors. Such multiprocessor computing architecture is designed to continue operation when individual components suffer hardware or software failures.
- In the presently preferred embodiment of the invention, the following fault tolerance mechanisms are provided in the computer architecture to allow it to continue functioning when individual components, such as chips, printed circuit boards, network links, fans, or power supplies fail. Thus, one aspect of the invention provides multiple processors having self-contained operating systems.
- Each processor preferably comprises:
- Redundant network links;
- Redundant power supplies;
- Redundant links to input/output devices; and
- Software fault detection (for example, a ‘ping’ and corresponding application level diagnostics), adaptation (for example, rerouting in the network and reassigning tasks), and recovery algorithms (for example, replicating job, i.e. software, state so that jobs can be restarted on correctly functioning processors or restarting failed processors or nodes).
- While the invention is described herein in terms of a presently preferred embodiment, i.e. a multiprocessor, fault tolerant computer architecture, those skilled in the art will appreciate that the invention is readily applied to other system architectures and that the system described herein is provided only for purposes of example and not to limit the scope of the invention.
- Once a processor in the system has failed, it is important to attempt to recover from the failure by restarting the failed processor. Because the preferred system is constructed as a set of self-contained processing units, it is possible to restart the system at a number of granularities, e.g. a chip, a printed circuit board (node), a subset of processors (PLEX), or an entire architecture. Each of these restarts can be any of a cold, warm, and/or software restart, and can be invoked by any of hardware, e.g. by a watchdog timer. Watchdog timers are known to those skilled in the art, for example, Watchdog timers can be used to detect the liveness of the system and will cause a full system reset if software does not prove its correct operation by regularly resetting the timer. Software, e.g. by fault recovery algorithms (discussed above) or by a human operator.
- Thus, the variety of alternatives for restarting a failed processor provides considerable flexibility in applying restart strategies that are characterized by the type of restart (cold, warm, software), the granularity of restart (chip, node, PLEX, engine), and the source of the restart (software, hardware, human).
- For example, connecting a reset signal from each node to one or more other nodes permits fault recovery software to warm restart one or more failed nodes. This distributed restart technique eliminates the need for human intervention after the occurrence of any of many types of software and hardware faults, such as livelock in the operating system scheduler, failure at one or more communication links, transistor-level lockup in the processor, or failure of software reset.
- Connecting a power supply enable signal to one node from another allows system software at a node to cold restart another node (e.g. turn the power supply of the failed processor off and back on), thereby allowing recovery from classes of failures not covered by a warm restart, such as transistor-level lockup in the processor, errors in state machines in the processor. Such control over the power supply also allows selective node shutdown of processors or nodes, which can be used to adapt to elevated temperature conditions (see, for example, [inventor, title], U.S. patent application Ser. No. _______, filed ______, attorney docket no. AGLE0023.
- FIG. 1 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for processors within an individual system according to the invention.
- The embodiment of the invention that is shown in FIG. 1 is a system comprised of one or
more nodes processors - The processing units are interconnected by a network, such as Ethernet. Each processing unit is preferably assigned a task within the system, such as controlling disks or communicating with another network. For purposes of the discussion herein, a network can include bi-directional, i.e. FIFO, interfaces.
- In the preferred embodiment of the invention, the processors can talk to each other. With regard to FIG. 1, consider the three processors:
processor A 20,processor B 21, and processor C 22. Processor A is connected, for example, to a disk (not shown), processor B could also be connected to the disk, and processors A, B, and C are configured to communicate with each other. For purposes of the discussion herein, such arrangement is referred to as a fault-tolerant configuration. Thus, if processor A fails, i.e. it suffers either a hardware fault or a software fault, then processor B can assume its functions. From the perspective of a user in the external world, such a failure goes unnoticed by the user and is of no great concern to the user as long as there is enough aggregate performance in the system that the user does not notice the failure. Thus, the preferred embodiment of the invention is especially well suited for multiple processor systems that provide redundancy for fault tolerance. - However, in such system if processor A fails and does not restart, and then processor B also fails, resulting system performance degradation is likely to be noticed by, and objectionable to, a user. This is of particular concern in a consumer installation, such as a Web server, telecommunications system, or cable television application, where degradation in performance of the system results in diminished user satisfaction, i.e. loss of sales.
- A key aspect of the invention herein is the ability for one processor, e.g. processor B to reset another processor, e.g. processor A, i.e. to say “start yourself from scratch.” As discussed above, the invention provides a number of different ways to do that. For example, with regard to the power supply that is supplying power to processor A, it is possible to turn that power supply off and then turn the power supply back on to reboot the processor.
- In the presently preferred embodiment of the invention, it is contemplated that the multiprocessor system incorporates two or more processing units that are running the same operating system. Thus, a failure in one processor is likely to eventually occur in another processor. In this embodiment of the invention, the ability to get back to an initial condition is very important. One aspect of this embodiment of the invention provides that each processor includes internal circuits, e.g. chips, that have reset signals. In particular, each processor includes for the purposes of the discussion herein, a cold reset is a power-supply-level reset and a warm reset is a processor-level-reset. Reset lines provide different levels of reset.
- In this embodiment, there is direct communication from one processor to another processor's reset line. For example, if two processors are on the same node or on the same card, and if both of these processors fail, then a third processor, e.g. processor C, could reset the whole card.
- Thus, the invention is preferably comprised of a system that already has fault tolerance built in so that when one processor fails, a mechanism is invoked that allows that processor to recover from that failure, i.e. the fault tolerance is intended to allow continued operation in the event of a failure of one processor, but in the invention the fault tolerance is surprisingly used to provide an opportunity for an operating processor to reset a failed processor before additional processors can fail. Thus, software on a node that is operable can reset software on a node that has failed.
- One presently preferred technique for restarting a failed processor works by toggling a signal. For example, processor B can toggle a signal on processor A. In an alternative and equally preferred embodiment, there is a unique power supply or power regulator for processor A, and there is a unique power supply for processor B. In the event of a failure in processor A, processor B could turn the power off for processor A and then turn the power back on, thereby effecting a cold restart of processor A.
- FIG. 2 is a block schematic diagram that shows a distributed restart mechanism in accordance with the invention. The fault tolerance mechanism22 establishes a communicative interconnection between the processors or nodes within the architecture. A
fault detection module 21 executes a fault detection strategy and, upon detection of a faulty processor or node, communicates with arestart module 20 to initiate a restart procedure, as discussed herein. - To operate, the system must first determine that a processor failure, in fact, has occurred. In the preferred embodiment, the system knows that there is a processor failure because the faulty processor, for example, has stopped responding to requests. Thus, one way to detect a failure is to provide a heartbeat in the system, such that the processors all talk to each other over time. Each processor pings each other processor in a predetermined way. If a processor does not return a ping, or alternatively, does not issue a ping, then a fault is reported for that processor by its corresponding processor(s). Such heartbeat mechanism may be implemented in any appropriate way, as is known in the art
- Another method for detecting a processor failure is that of running application diagnostics. For example, in a speech recognition application, the system takes a known sample and feeds it into a processor that is running a speech recognition application. The diagnostic supplies a known input and determines if a correct result is returned by the processor, e.g. making sure that the right text comes for an input phrase, such as “Take me to your leader.” Thus, every processor in this embodiment has a diagnostic routine built in, which is typically a simple routine (as is known to those skilled in the art) that sends a string out, for example, or a piece of data out, and then looks for a sum to come back to show that the processor is operable and can therefore provide a correct response.
- Each processor periodically runs this routine, as with a heartbeat, only it is an intelligent heartbeat. One important thing is that if a processor is primarily executing, for example, speech recognition it is very important that the processor use a diagnostic which is representative of the work that it is actually performing. In this way, it is possible to catch the vast majority of failures that are relevant to the specific task being performed by the processor. As another example, if the processor is a Web browser, the diagnostics sends it an HTTP stream and makes sure that the processor outputs the right data.
- Another method for detecting a processor failure is the detection of excessive communication errors on any one link beyond some threshold, which indicates non-functionality of the processor. Those skilled in the art will appreciate that other methods of fault detection may be applied in connection with the operation and practice of the invention. For example, in an I/O processor, where the system is pinging a gateway, communications failures beyond a threshold are a good indication of processor failure.
- For purpose of implementing the invention, it is important that the processors be able to talk to each other through a network. It does not matter what the network is—for purposes of the discussion herein, a network is some mechanism that allows the processors to communicate, i.e. they are communicatively coupled. Thus, a network could be a backplane, an Ethernet, a serial link, a token ring, or any other such arrangement. A key point is that the processors must be able to talk to each other and that every processor be aware of at least one other processor, and sometimes maybe even more processors, depending on how they are configured, so that processors can check with each other to determine proper operation of a corresponding processor(s).
- Another important aspect of the invention is that of reset hierarchy. In a preferred system there are nodes which correspond to printed circuit boards, and each node has multiple processors on it, e.g. two or more in the presently preferred system. These nodes, or rather the printed circuit cards on which they reside, are plugged into a backplane.
- If one chip on a node fails, then a functioning chip on a node may be able to reset the non-functioning chip, and vice versa (see, for example, the
node 11 andcorresponding processors - In connection with this aspect of the invention, it is desirable not to use too many signals to implement this strategy, as well as not to introduce too many failure paths. Thus, an important design constraint is to reduce the number of faulty paths. For example, in an important I/O node where there is one processor that is a special purpose processor and two other processors that are more general purpose processors, the special purpose processor is more likely to have a connection to the outside world, and the two other processors only have connections within the node itself. In this situation, only the special purpose processor can reset the generic processors. If that special purpose processor fails, then it can be reset by a neighboring node. Thus, it is important to maintain a reset hierarchy in a way that does not propagate undue management options, i.e. by adding only the minimum number of new failure modes. One tradeoff to this approach is that an I/O node, for example, might fail locally and it could be reset locally, but it is deemed better to go up to the network in that case and get a peer to reset it, and in that way restrict access to the outside bus. Thus, at each level within the hierarchy there is preferably a master processor that is the subject of higher level fault correction, while the other processors in the node (or at this level) are reset by this processor.
- In the invention, it is not necessarily true that every chip or every node must be able to reset every other node. In the preferred embodiment, each node can reset two other nodes and the resets are linked in a chain that goes through the system, which corresponds to the way that cards are plugged into the system. If one plugs in a basic four-card array, then all four processors are able to reset each other, and if more cards are plugged in then they are linked in a chain. Thus, at this level the notion of hierarchy may be less important as is the notion of extensibility. Both aspects of a system design are considered when determining the appropriate level of interprocessor communication and reset capability.
- Finally, the invention provides a reporting mechanism or supervisor function, such that if there is a failure the failure is logged and reported. For example, if processor B resets processor A even once a day, then the system places a service call because processor A probably needs to be replaced.
- FIG. 3 is a block schematic diagram that shows the implementation of a distributed restart in a multiprocessor system for nodes A-P within an architecture according to the invention.
- FIGS. 4a and 4 b comprise a block schematic diagrams that show the implementation of a distributed restart in a multiprocessor system for processors contained within nodes that are arranged within a backplane according to the invention. FIG. 4a shows several nodes and, for example, a
first node A 30, while FIG. 4b shows node A arranged in a backplane. - Although the invention is described herein with reference to the preferred embodiment, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the claims included below.
Claims (29)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/834,524 US20020152425A1 (en) | 2001-04-12 | 2001-04-12 | Distributed restart in a multiple processor system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/834,524 US20020152425A1 (en) | 2001-04-12 | 2001-04-12 | Distributed restart in a multiple processor system |
Publications (1)
Publication Number | Publication Date |
---|---|
US20020152425A1 true US20020152425A1 (en) | 2002-10-17 |
Family
ID=25267122
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/834,524 Abandoned US20020152425A1 (en) | 2001-04-12 | 2001-04-12 | Distributed restart in a multiple processor system |
Country Status (1)
Country | Link |
---|---|
US (1) | US20020152425A1 (en) |
Cited By (44)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020184295A1 (en) * | 2001-05-24 | 2002-12-05 | Ibm Corporation | Method for mutual computer process monitoring and restart |
US20040030881A1 (en) * | 2002-08-08 | 2004-02-12 | International Business Machines Corp. | Method, system, and computer program product for improved reboot capability |
US6854054B1 (en) * | 2001-12-06 | 2005-02-08 | Ciena Corporation | System and method of memory management for providing data storage across a reboot |
US20060181857A1 (en) * | 2005-02-16 | 2006-08-17 | Belady Christian L | Redundant power beneath circuit board |
US20060224728A1 (en) * | 2005-04-04 | 2006-10-05 | Hitachi, Ltd. | Failover method in a cluster computer system |
US20070067673A1 (en) * | 2005-08-19 | 2007-03-22 | Algirdas Avizienis | Hierarchical configurations in error-correcting computer systems |
US20070277070A1 (en) * | 2006-01-13 | 2007-11-29 | Infineon Technologies Ag | Apparatus and method for checking an error detection functionality of a data processor |
US20080059783A1 (en) * | 2006-09-01 | 2008-03-06 | Benq Corporation | Multimedia player and auto recovery method therefor |
US20080082850A1 (en) * | 2006-09-14 | 2008-04-03 | Fujitsu Limited | Method and apparatus for monitoring power failure |
US7444551B1 (en) * | 2002-12-16 | 2008-10-28 | Nvidia Corporation | Method and apparatus for system status monitoring, testing and restoration |
US20090013221A1 (en) * | 2007-06-25 | 2009-01-08 | Hitachi Industrial Equipment System Co., Ltd. | Multi-component system |
US20090059810A1 (en) * | 2006-03-10 | 2009-03-05 | Fujitsu Limited | Network system |
EP1814031A3 (en) * | 2005-12-22 | 2009-10-14 | NRC International Inc. | Power control interface for a self-service apparatus |
US20090293072A1 (en) * | 2006-07-21 | 2009-11-26 | Sony Service Centre (Europe) N.V. | System having plurality of hardware blocks and method of operating the same |
US20100011242A1 (en) * | 2008-07-10 | 2010-01-14 | Hitachi, Ltd. | Failover method and system for a computer system having clustering configuration |
US20100088542A1 (en) * | 2008-10-06 | 2010-04-08 | Texas Instruments Incorporated | Lockup recovery for processors |
US20110202797A1 (en) * | 2010-02-12 | 2011-08-18 | Evgeny Mezhibovsky | Method and system for resetting a subsystem of a communication device |
US20130003310A1 (en) * | 2011-06-28 | 2013-01-03 | Oracle International Corporation | Chip package to support high-frequency processors |
EP2642390A1 (en) * | 2012-03-20 | 2013-09-25 | BlackBerry Limited | Fault recovery |
US20130254586A1 (en) * | 2012-03-20 | 2013-09-26 | Research In Motion Limited | Fault recovery |
US20160161743A1 (en) * | 2014-12-03 | 2016-06-09 | Osterhout Group, Inc. | See-through computer display systems |
US20160283336A1 (en) * | 2015-03-27 | 2016-09-29 | Facebook, Inc. | Power fail circuit for multi-storage-device arrays |
TWI561026B (en) * | 2014-12-17 | 2016-12-01 | Wistron Neweb Corp | Electronic device with reset function and reset method thereof |
US20170075745A1 (en) * | 2015-09-16 | 2017-03-16 | Microsoft Technology Licensing, Llc | Handling crashes of a device's peripheral subsystems |
CN107402834A (en) * | 2017-06-20 | 2017-11-28 | 公牛集团有限公司 | A kind of embedded system electrifying startup self checking method and device |
US9898360B1 (en) * | 2014-02-25 | 2018-02-20 | Google Llc | Preventing unnecessary data recovery |
US20180137007A1 (en) * | 2016-11-17 | 2018-05-17 | Ricoh Company, Ltd. | Reboot system, information processing apparatus, and method for rebooting |
US10101588B2 (en) | 2014-04-25 | 2018-10-16 | Osterhout Group, Inc. | Speaker assembly for headworn computer |
US10120760B2 (en) * | 2014-07-17 | 2018-11-06 | Continental Automotive Gmbh | Vehicle infotainment system |
US10197801B2 (en) | 2014-12-03 | 2019-02-05 | Osterhout Group, Inc. | Head worn computer display systems |
USD840395S1 (en) | 2016-10-17 | 2019-02-12 | Osterhout Group, Inc. | Head-worn computer |
EP3347793A4 (en) * | 2015-09-10 | 2019-03-06 | Manufacturing Resources International, Inc. | System and method for systemic detection of display errors |
USD864959S1 (en) | 2017-01-04 | 2019-10-29 | Mentor Acquisition One, Llc | Computer glasses |
US10466492B2 (en) | 2014-04-25 | 2019-11-05 | Mentor Acquisition One, Llc | Ear horn assembly for headworn computer |
US20190391868A1 (en) * | 2019-08-30 | 2019-12-26 | Intel Corporation | Power error monitoring and reporting within a system on chip for functional safety |
CN110630552A (en) * | 2019-09-21 | 2019-12-31 | 苏州浪潮智能科技有限公司 | System, method and device for detecting fan link fault |
US10690936B2 (en) | 2016-08-29 | 2020-06-23 | Mentor Acquisition One, Llc | Adjustable nose bridge assembly for headworn computer |
GB2580727A (en) * | 2018-07-30 | 2020-07-29 | Honeywell Int Inc | Method and apparatus for detecting and remedying single event effects |
US10732434B2 (en) | 2014-04-25 | 2020-08-04 | Mentor Acquisition One, Llc | Temple and ear horn assembly for headworn computer |
US10908863B2 (en) | 2018-07-12 | 2021-02-02 | Manufacturing Resources International, Inc. | System and method for providing access to co-located operations data for an electronic display |
US10933486B2 (en) | 2013-02-28 | 2021-03-02 | Illinois Tool Works Inc. | Remote master reset of machine |
US11137847B2 (en) | 2019-02-25 | 2021-10-05 | Manufacturing Resources International, Inc. | Monitoring the status of a touchscreen |
US11402940B2 (en) | 2019-02-25 | 2022-08-02 | Manufacturing Resources International, Inc. | Monitoring the status of a touchscreen |
US11921010B2 (en) | 2021-07-28 | 2024-03-05 | Manufacturing Resources International, Inc. | Display assemblies with differential pressure sensors |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4377000A (en) * | 1980-05-05 | 1983-03-15 | Westinghouse Electric Corp. | Automatic fault detection and recovery system which provides stability and continuity of operation in an industrial multiprocessor control |
US5746203A (en) * | 1996-09-26 | 1998-05-05 | Johnson & Johnson Medical, Inc. | Failsafe supervisor system for a patient monitor |
US5790850A (en) * | 1996-09-30 | 1998-08-04 | Intel Corporation | Fault resilient booting for multiprocessor computer systems |
US5898828A (en) * | 1995-12-29 | 1999-04-27 | Emc Corporation | Reduction of power used by transceivers in a data transmission loop |
US6266781B1 (en) * | 1998-07-20 | 2001-07-24 | Academia Sinica | Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network |
US6392990B1 (en) * | 1999-07-23 | 2002-05-21 | Glenayre Electronics, Inc. | Method for implementing interface redundancy in a computer network |
US6581166B1 (en) * | 1999-03-02 | 2003-06-17 | The Foxboro Company | Network fault detection and recovery |
US6622261B1 (en) * | 1998-04-09 | 2003-09-16 | Compaq Information Technologies Group, L.P. | Process pair protection for complex applications |
-
2001
- 2001-04-12 US US09/834,524 patent/US20020152425A1/en not_active Abandoned
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4377000A (en) * | 1980-05-05 | 1983-03-15 | Westinghouse Electric Corp. | Automatic fault detection and recovery system which provides stability and continuity of operation in an industrial multiprocessor control |
US5898828A (en) * | 1995-12-29 | 1999-04-27 | Emc Corporation | Reduction of power used by transceivers in a data transmission loop |
US5746203A (en) * | 1996-09-26 | 1998-05-05 | Johnson & Johnson Medical, Inc. | Failsafe supervisor system for a patient monitor |
US5790850A (en) * | 1996-09-30 | 1998-08-04 | Intel Corporation | Fault resilient booting for multiprocessor computer systems |
US6622261B1 (en) * | 1998-04-09 | 2003-09-16 | Compaq Information Technologies Group, L.P. | Process pair protection for complex applications |
US6266781B1 (en) * | 1998-07-20 | 2001-07-24 | Academia Sinica | Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network |
US6581166B1 (en) * | 1999-03-02 | 2003-06-17 | The Foxboro Company | Network fault detection and recovery |
US6392990B1 (en) * | 1999-07-23 | 2002-05-21 | Glenayre Electronics, Inc. | Method for implementing interface redundancy in a computer network |
Cited By (86)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6928585B2 (en) * | 2001-05-24 | 2005-08-09 | International Business Machines Corporation | Method for mutual computer process monitoring and restart |
US20020184295A1 (en) * | 2001-05-24 | 2002-12-05 | Ibm Corporation | Method for mutual computer process monitoring and restart |
US6854054B1 (en) * | 2001-12-06 | 2005-02-08 | Ciena Corporation | System and method of memory management for providing data storage across a reboot |
US20040030881A1 (en) * | 2002-08-08 | 2004-02-12 | International Business Machines Corp. | Method, system, and computer program product for improved reboot capability |
US7802147B1 (en) | 2002-12-16 | 2010-09-21 | Nvidia Corporation | Method and apparatus for system status monitoring, testing and restoration |
US7444551B1 (en) * | 2002-12-16 | 2008-10-28 | Nvidia Corporation | Method and apparatus for system status monitoring, testing and restoration |
US7627787B1 (en) | 2002-12-16 | 2009-12-01 | Nvidia Corporation | Method and apparatus for system status monitoring, testing and restoration |
US7791889B2 (en) * | 2005-02-16 | 2010-09-07 | Hewlett-Packard Development Company, L.P. | Redundant power beneath circuit board |
US20060181857A1 (en) * | 2005-02-16 | 2006-08-17 | Belady Christian L | Redundant power beneath circuit board |
US20060224728A1 (en) * | 2005-04-04 | 2006-10-05 | Hitachi, Ltd. | Failover method in a cluster computer system |
US7467322B2 (en) * | 2005-04-04 | 2008-12-16 | Hitachi, Ltd. | Failover method in a cluster computer system |
US7861106B2 (en) * | 2005-08-19 | 2010-12-28 | A. Avizienis And Associates, Inc. | Hierarchical configurations in error-correcting computer systems |
US20070067673A1 (en) * | 2005-08-19 | 2007-03-22 | Algirdas Avizienis | Hierarchical configurations in error-correcting computer systems |
EP1814031A3 (en) * | 2005-12-22 | 2009-10-14 | NRC International Inc. | Power control interface for a self-service apparatus |
US20070277070A1 (en) * | 2006-01-13 | 2007-11-29 | Infineon Technologies Ag | Apparatus and method for checking an error detection functionality of a data processor |
US8918679B2 (en) * | 2006-01-13 | 2014-12-23 | Infineon Technologies Ag | Apparatus and method for checking an error detection functionality of a data processor |
US20090059810A1 (en) * | 2006-03-10 | 2009-03-05 | Fujitsu Limited | Network system |
US8018867B2 (en) * | 2006-03-10 | 2011-09-13 | Fujitsu Limited | Network system for monitoring operation of monitored node |
US20090293072A1 (en) * | 2006-07-21 | 2009-11-26 | Sony Service Centre (Europe) N.V. | System having plurality of hardware blocks and method of operating the same |
US20090327676A1 (en) * | 2006-07-21 | 2009-12-31 | Sony Service Centre (Europe) N.V. | Demodulator device and method of operating the same |
US8161276B2 (en) * | 2006-07-21 | 2012-04-17 | Sony Service Centre (Europe) N.V. | Demodulator device and method of operating the same |
US20080059783A1 (en) * | 2006-09-01 | 2008-03-06 | Benq Corporation | Multimedia player and auto recovery method therefor |
US7676693B2 (en) * | 2006-09-14 | 2010-03-09 | Fujitsu Limited | Method and apparatus for monitoring power failure |
US20080082850A1 (en) * | 2006-09-14 | 2008-04-03 | Fujitsu Limited | Method and apparatus for monitoring power failure |
US20090013221A1 (en) * | 2007-06-25 | 2009-01-08 | Hitachi Industrial Equipment System Co., Ltd. | Multi-component system |
US7861115B2 (en) * | 2007-06-25 | 2010-12-28 | Hitachi Industrial Equipment Systems Co., Ltd. | Multi-component system |
US11506912B2 (en) | 2008-01-02 | 2022-11-22 | Mentor Acquisition One, Llc | Temple and ear horn assembly for headworn computer |
US20100011242A1 (en) * | 2008-07-10 | 2010-01-14 | Hitachi, Ltd. | Failover method and system for a computer system having clustering configuration |
US20110179307A1 (en) * | 2008-07-10 | 2011-07-21 | Tsunehiko Baba | Failover method and system for a computer system having clustering configuration |
US7925922B2 (en) * | 2008-07-10 | 2011-04-12 | Hitachi, Ltd. | Failover method and system for a computer system having clustering configuration |
US20100088542A1 (en) * | 2008-10-06 | 2010-04-08 | Texas Instruments Incorporated | Lockup recovery for processors |
US8495422B2 (en) | 2010-02-12 | 2013-07-23 | Research In Motion Limited | Method and system for resetting a subsystem of a communication device |
US20110202797A1 (en) * | 2010-02-12 | 2011-08-18 | Evgeny Mezhibovsky | Method and system for resetting a subsystem of a communication device |
US20130003310A1 (en) * | 2011-06-28 | 2013-01-03 | Oracle International Corporation | Chip package to support high-frequency processors |
US8982563B2 (en) * | 2011-06-28 | 2015-03-17 | Oracle International Corporation | Chip package to support high-frequency processors |
EP2642390A1 (en) * | 2012-03-20 | 2013-09-25 | BlackBerry Limited | Fault recovery |
US20130254586A1 (en) * | 2012-03-20 | 2013-09-26 | Research In Motion Limited | Fault recovery |
US9026842B2 (en) * | 2012-03-20 | 2015-05-05 | Blackberry Limited | Selective fault recovery of subsystems |
US10933486B2 (en) | 2013-02-28 | 2021-03-02 | Illinois Tool Works Inc. | Remote master reset of machine |
US9898360B1 (en) * | 2014-02-25 | 2018-02-20 | Google Llc | Preventing unnecessary data recovery |
US10732434B2 (en) | 2014-04-25 | 2020-08-04 | Mentor Acquisition One, Llc | Temple and ear horn assembly for headworn computer |
US10466492B2 (en) | 2014-04-25 | 2019-11-05 | Mentor Acquisition One, Llc | Ear horn assembly for headworn computer |
US11474360B2 (en) | 2014-04-25 | 2022-10-18 | Mentor Acquisition One, Llc | Speaker assembly for headworn computer |
US11809022B2 (en) | 2014-04-25 | 2023-11-07 | Mentor Acquisition One, Llc | Temple and ear horn assembly for headworn computer |
US10634922B2 (en) | 2014-04-25 | 2020-04-28 | Mentor Acquisition One, Llc | Speaker assembly for headworn computer |
US11880041B2 (en) | 2014-04-25 | 2024-01-23 | Mentor Acquisition One, Llc | Speaker assembly for headworn computer |
US10101588B2 (en) | 2014-04-25 | 2018-10-16 | Osterhout Group, Inc. | Speaker assembly for headworn computer |
US10120760B2 (en) * | 2014-07-17 | 2018-11-06 | Continental Automotive Gmbh | Vehicle infotainment system |
US10197801B2 (en) | 2014-12-03 | 2019-02-05 | Osterhout Group, Inc. | Head worn computer display systems |
US11809628B2 (en) | 2014-12-03 | 2023-11-07 | Mentor Acquisition One, Llc | See-through computer display systems |
US11262846B2 (en) | 2014-12-03 | 2022-03-01 | Mentor Acquisition One, Llc | See-through computer display systems |
US20160161743A1 (en) * | 2014-12-03 | 2016-06-09 | Osterhout Group, Inc. | See-through computer display systems |
US10684687B2 (en) * | 2014-12-03 | 2020-06-16 | Mentor Acquisition One, Llc | See-through computer display systems |
TWI561026B (en) * | 2014-12-17 | 2016-12-01 | Wistron Neweb Corp | Electronic device with reset function and reset method thereof |
US20160283336A1 (en) * | 2015-03-27 | 2016-09-29 | Facebook, Inc. | Power fail circuit for multi-storage-device arrays |
US10229019B2 (en) | 2015-03-27 | 2019-03-12 | Facebook, Inc. | Power fail circuit for multi-storage-device arrays |
US9710343B2 (en) * | 2015-03-27 | 2017-07-18 | Facebook, Inc. | Power fail circuit for multi-storage-device arrays |
US10353785B2 (en) | 2015-09-10 | 2019-07-16 | Manufacturing Resources International, Inc. | System and method for systemic detection of display errors |
EP3347793A4 (en) * | 2015-09-10 | 2019-03-06 | Manufacturing Resources International, Inc. | System and method for systemic detection of display errors |
US11093355B2 (en) | 2015-09-10 | 2021-08-17 | Manufacturing Resources International, Inc. | System and method for detection of display errors |
US10013299B2 (en) * | 2015-09-16 | 2018-07-03 | Microsoft Technology Licensing, Llc | Handling crashes of a device's peripheral subsystems |
US20170075745A1 (en) * | 2015-09-16 | 2017-03-16 | Microsoft Technology Licensing, Llc | Handling crashes of a device's peripheral subsystems |
US10690936B2 (en) | 2016-08-29 | 2020-06-23 | Mentor Acquisition One, Llc | Adjustable nose bridge assembly for headworn computer |
US11409128B2 (en) | 2016-08-29 | 2022-08-09 | Mentor Acquisition One, Llc | Adjustable nose bridge assembly for headworn computer |
USD840395S1 (en) | 2016-10-17 | 2019-02-12 | Osterhout Group, Inc. | Head-worn computer |
US10606702B2 (en) * | 2016-11-17 | 2020-03-31 | Ricoh Company, Ltd. | System, information processing apparatus, and method for rebooting a part corresponding to a cause identified |
US20180137007A1 (en) * | 2016-11-17 | 2018-05-17 | Ricoh Company, Ltd. | Reboot system, information processing apparatus, and method for rebooting |
USD947186S1 (en) | 2017-01-04 | 2022-03-29 | Mentor Acquisition One, Llc | Computer glasses |
USD918905S1 (en) | 2017-01-04 | 2021-05-11 | Mentor Acquisition One, Llc | Computer glasses |
USD864959S1 (en) | 2017-01-04 | 2019-10-29 | Mentor Acquisition One, Llc | Computer glasses |
CN107402834A (en) * | 2017-06-20 | 2017-11-28 | 公牛集团有限公司 | A kind of embedded system electrifying startup self checking method and device |
US10908863B2 (en) | 2018-07-12 | 2021-02-02 | Manufacturing Resources International, Inc. | System and method for providing access to co-located operations data for an electronic display |
US11614911B2 (en) | 2018-07-12 | 2023-03-28 | Manufacturing Resources International, Inc. | System and method for providing access to co-located operations data for an electronic display |
US11243733B2 (en) | 2018-07-12 | 2022-02-08 | Manufacturing Resources International, Inc. | System and method for providing access to co-located operations data for an electronic display |
US11928380B2 (en) | 2018-07-12 | 2024-03-12 | Manufacturing Resources International, Inc. | System and method for providing access to co-located operations data for an electronic display |
US11455138B2 (en) | 2018-07-12 | 2022-09-27 | Manufacturing Resources International, Inc. | System and method for providing access to co-located operations data for an electronic display |
US11188421B2 (en) * | 2018-07-30 | 2021-11-30 | Honeywell International Inc. | Method and apparatus for detecting and remedying single event effects |
GB2580727A (en) * | 2018-07-30 | 2020-07-29 | Honeywell Int Inc | Method and apparatus for detecting and remedying single event effects |
GB2580727B (en) * | 2018-07-30 | 2022-08-31 | Honeywell Int Inc | Method and apparatus for detecting and remedying single event effects |
US11137847B2 (en) | 2019-02-25 | 2021-10-05 | Manufacturing Resources International, Inc. | Monitoring the status of a touchscreen |
US11402940B2 (en) | 2019-02-25 | 2022-08-02 | Manufacturing Resources International, Inc. | Monitoring the status of a touchscreen |
US11644921B2 (en) | 2019-02-25 | 2023-05-09 | Manufacturing Resources International, Inc. | Monitoring the status of a touchscreen |
US11669385B2 (en) * | 2019-08-30 | 2023-06-06 | Intel Corporation | Power error monitoring and reporting within a system on chip for functional safety |
US20190391868A1 (en) * | 2019-08-30 | 2019-12-26 | Intel Corporation | Power error monitoring and reporting within a system on chip for functional safety |
CN110630552A (en) * | 2019-09-21 | 2019-12-31 | 苏州浪潮智能科技有限公司 | System, method and device for detecting fan link fault |
US11921010B2 (en) | 2021-07-28 | 2024-03-05 | Manufacturing Resources International, Inc. | Display assemblies with differential pressure sensors |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20020152425A1 (en) | Distributed restart in a multiple processor system | |
US7222268B2 (en) | System resource availability manager | |
US6691244B1 (en) | System and method for comprehensive availability management in a high-availability computer system | |
US7787388B2 (en) | Method of and a system for autonomously identifying which node in a two-node system has failed | |
EP1703401A2 (en) | Information processing apparatus and control method therefor | |
US7093013B1 (en) | High availability system for network elements | |
WO2002003195A2 (en) | Method for upgrading a computer system | |
TWI529624B (en) | Method and system of fault tolerance for multiple servers | |
CN100362481C (en) | Main-standby protection method for multi-processor device units | |
JP4655718B2 (en) | Computer system and control method thereof | |
US20030177224A1 (en) | Clustered/fail-over remote hardware management system | |
EP1782202A2 (en) | Computing system redundancy and fault tolerance | |
US7627774B2 (en) | Redundant manager modules to perform management tasks with respect to an interconnect structure and power supplies | |
JPH11261663A (en) | Communication processing control means and information processor having the control means | |
Hunter et al. | Availability modeling and analysis of a two node cluster | |
Hughes-Fenchel | A flexible clustered approach to high availability | |
JP2003186578A (en) | Method and apparatus for supplying redundant power | |
JP6654662B2 (en) | Server device and server system | |
JP2839664B2 (en) | Computer system | |
US20230244550A1 (en) | Computer device and management method | |
US11042443B2 (en) | Fault tolerant computer systems and methods establishing consensus for which processing system should be the prime string | |
KR100388965B1 (en) | Apparatus for cross duplication of each processor board in exchange | |
KR960010879B1 (en) | Bus duplexing control of multiple processor | |
JPH0630069B2 (en) | Multiplexing system | |
KR100249800B1 (en) | Management method for fault diagnosis utilities |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: AGILE TV CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHAIKEN, DAVID;FOSTER, MARK J.;REEL/FRAME:011709/0945 Effective date: 20010412 |
|
AS | Assignment |
Owner name: AGILETV CORPORATION, CALIFORNIA Free format text: REASSIGNMENT AND RELEASE OF SECURITY INTEREST;ASSIGNOR:INSIGHT COMMUNICATIONS COMPANY, INC.;REEL/FRAME:012747/0141 Effective date: 20020131 |
|
AS | Assignment |
Owner name: LAUDER PARTNERS LLC, AS AGENT, NEW YORK Free format text: SECURITY AGREEMENT;ASSIGNOR:AGILETV CORPORATION;REEL/FRAME:014782/0717 Effective date: 20031209 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: AGILETV CORPORATION, CALIFORNIA Free format text: REASSIGNMENT AND RELEASE OF SECURITY INTEREST;ASSIGNOR:LAUDER PARTNERS LLC AS COLLATERAL AGENT FOR ITSELF AND CERTAIN OTHER LENDERS;REEL/FRAME:015991/0795 Effective date: 20050511 |