US8086910B1 - Monitoring software thread execution - Google Patents

Monitoring software thread execution Download PDF

Info

Publication number
US8086910B1
US8086910B1 US12/826,102 US82610210A US8086910B1 US 8086910 B1 US8086910 B1 US 8086910B1 US 82610210 A US82610210 A US 82610210A US 8086910 B1 US8086910 B1 US 8086910B1
Authority
US
United States
Prior art keywords
software thread
execution
information
thread
recording
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
US12/826,102
Other versions
US20110320858A1 (en
Inventor
Toby Koktan
Andre Poulin
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alcatel Lucent SAS
Original Assignee
Alcatel Lucent SAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alcatel Lucent SAS filed Critical Alcatel Lucent SAS
Priority to US12/826,102 priority Critical patent/US8086910B1/en
Assigned to ALCATEL-LUCENT CANADA, INC. reassignment ALCATEL-LUCENT CANADA, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KOKTAN, TOBY, POULIN, ANDRE
Assigned to ALCATEL LUCENT reassignment ALCATEL LUCENT ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ALCATEL-LUCENT CANADA INC.
Application granted granted Critical
Publication of US8086910B1 publication Critical patent/US8086910B1/en
Publication of US20110320858A1 publication Critical patent/US20110320858A1/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0715Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a system implementing multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • G06F11/3423Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time where the assessed time is active or idle time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/865Monitoring of software

Definitions

  • the invention is directed to monitoring execution of software threads, particularly so doing in a network processor by detecting lockup or stalling of such execution.
  • Network processors execute what is commonly referred to as microcode to perform data path packet processing functions.
  • a network processor typically has a set of software threads (also referred to as tasks) which are spawned to perform packet processing operations by executing specific pieces of microcode.
  • Memory content corruption for example a soft-error causing a memory bit to invert or “flip”, in a memory device used by the network processor may cause execution of one or more threads to lockup if the error corrupts a microcode instruction or a data structure used by the network processor. Additionally, a software bug or component defect in the network processor could interfere with normal processing, which could lead to thread execution lockups.
  • thread execution lockup is that the locked up thread will no longer continue to process data path traffic, which can lead to a communication service outage or silent failure of the network communications device.
  • ECC error correction coding
  • Hardware based ECC is not always feasible for various reasons, such as one or a combination of the following: added expense, insufficient space on the network processor to accommodate the extra hardware logic required for ECC codes, and performance degradation associated with the ECC hardware.
  • Embodiments of the invention are directed to monitoring execution of software threads, particularly by detecting a lockup or stall in execution of a software thread and initiating a remedial action in response.
  • Some embodiments of the invention automatically detect a lockup or stall in execution of a software thread by periodically sampling information corresponding to the thread, and, in accordance with a determination made using the information, initiate an attempt to recover from such a condition in execution without the need for manual intervention.
  • Some embodiments of the invention provide a method executed on a microprocessor to automatically detect a lockup or stall in execution of a software thread of a network processor, and initiate a remedial action to mitigate undesirable effects caused by the lockup or stall such as a prolonged communication service outage in data traffic carried by a communication system employing the network processor.
  • Other embodiments provide the method written in microcode and executed on the network processor, while other embodiments execute some steps of the method on the microprocessor and the remainder of the steps on the network processor.
  • some embodiments of the invention can be deployed in communications systems already in service by a software upgrade in the field, thereby avoiding the expense of hardware replacements in cases where ECC hardware was required.
  • a method of monitoring software thread execution comprises the steps of: detecting that execution of a software thread has timed-out; recording information corresponding to the software thread at a plurality of intervals over a duration of time; determining if some information so recorded remained unchanged for all of the plurality of intervals; and taking an action in accordance with the determination.
  • the step of recording may include recording microcode program counter information, software thread busy indication and originator information of the software thread.
  • the originator is typically the entity that initiated the software thread, e.g. a port or serial interface on the network processor. There may be a sequence number associated with the originator to prevent false positive thread stall detection in situations where the originator is often the same.
  • the method may additionally comprise: reporting a condition of the software thread responsive to an unchanging busy indication, the microcode program counter information having changed and the originator information having remained the same for all of the plurality of intervals.
  • the step of taking an action includes determining if the steps of recording and determining have already been performed twice with respect to the time-out interrupt; and resetting, responsive to said steps having already been performed twice, the network processor executing the software thread.
  • FIG. 1 is a flow chart illustrating a method of monitoring execution of software threads according to an embodiment of the invention.
  • FIG. 2 is a flow chart illustrating the steps in the method of FIG. 1 in greater detail.
  • a lockup is defined as follows: lockup is when a timer for thread execution has expired, such that a time-out interrupt or similar indication has been asserted or expiration of an allotted thread execution time has been otherwise detected, e.g. via polling elapsed execution time of the thread, an execution status of the thread indicates that execution is ongoing i.e. in a busy state, and a program counter of the thread's microcode is not changing.
  • a stall in execution is defined as when, like a lockup, an allotted time for thread execution has expired and the execution status of the thread is in a busy state; however unlike a lockup, the thread's microcode program counter is changing and the originator of the thread remains the same.
  • Embodiments of the invention are directed to reliably detecting software threads whose execution has locked up (i.e. stopped processing microcode) or stalled (i.e. not terminating but still appearing to be executing microcode) while monitoring execution of software threads in a manner that is not intrusive to time critical processing functions carried out by the software threads. Indeed, to address the latter, more intense monitoring of a given software thread starts when an initial indication of a potential, or an existing, lockup or stall condition in execution is detected with respect to the thread via a time-out indication corresponding thereto, such as a thread time-out interrupt.
  • staged escalation of monitoring states and examination of numerous pieces of information related to a software thread having a time-out indication are utilized before making a determination with regard to the condition of the thread's execution.
  • a debounce (i.e. double-check) monitoring pass may be performed.
  • memory bus utilization of the network processor executing the thread may be determined before proceeding to a lockup determination.
  • FIG. 1 is a flow chart of a method 100 of monitoring execution of software threads according to an embodiment of the invention.
  • the method 100 is executed on a general purpose microprocessor with access to the network processor.
  • the method 100 proceeds to detecting 104 that execution of a software thread has exceeded a predetermined execution time interval. Such detection could be facilitated by receiving a time-out interrupt with respect to the software thread.
  • detecting 104 could be facilitated by receiving a time-out interrupt with respect to the software thread.
  • other ways of determining that the software thread has exceeded its allotted execution time may be used instead of a time-out interrupt.
  • information corresponding to the timed-out thread, so indicated by the interrupt is periodically sampled 106 over a duration of time.
  • a determination 108 is made whether or not some of the information samples remained unchanged over the entire duration. In other words, the information samples are compared to each other to determine if any are different from the others. There are many ways to do this, however this embodiment simply compares the first information sample to each of the other information samples to detect if any are different from it.
  • An action is then taken 110 in accordance with the determination 108 , such as to initiate a remedial response if execution of the software thread is determined to be locked up or stalled.
  • FIG. 2 is a flow chart that shows the steps of the method 100 in greater detail.
  • the step of detecting 104 includes a polling function that periodically polls the status of interrupts.
  • the detecting step 104 comprises determining 202 if a thread time-out interrupt has been received, and in the negative case, i.e. a thread time-out interrupt has not been received, the detecting step 104 waits 204 for a predetermined interval of time, in this case five milliseconds, and returns to the step of determining 202 if a thread time-out interrupt has been received.
  • This detecting step 104 monitors multiple software threads being executed on the network processor for an occurrence of a time-out interrupt with respect to any one of these threads. In the affirmative case, i.e.
  • a thread time-out interrupt has been received or otherwise detected; a debounce flag is set 206 to false, the detecting step 104 exits and execution of the method 100 proceeds to the step of sampling 106 information corresponding to the timed-out software thread so indicated by the time-out interrupt.
  • the step 106 of sampling information corresponding to the timed-out software thread includes determining 208 if execution status of the timed-out software thread is busy. If it is not busy the method 100 is terminated, and would typically startup again so as to continually monitor for software threads that have locked up or stalled in their execution. Otherwise, if execution status of the timed-out software thread is busy, the sampling step 106 proceeds to record 210 the value of the microcode counter of the timed-out software thread. An indication of the originator of the timed-out software thread is also recorded 212 . These two recording steps 210 , 212 can be carried out in either order or could be done as one step. A timer is then checked to determine 214 if the duration, in this case four seconds, has elapsed. If the duration has not elapsed the sampling step 106 waits for the predetermined interval of time, in this case five milliseconds, and then returns to the step of determining 208 if execution status of the timed-out software thread is still busy.
  • the step 108 of determining if some of the information samples remained unchanged over the entire duration includes determining 218 if all recorded values of the microcode program counter are the same over the duration, i.e. if the microcode program counter of the timed-out software thread remained unchanged over the entire duration. If the microcode program counter did not change over the duration the step 108 of determining proceeds to the step 110 of taking an action, otherwise a determination 220 is made whether or not the originator of the timed-out software thread was the same for the entire duration. If the originator of the timed-out software thread changed at any time over the duration the method 100 is terminated, and as before, it may restart so as to continually monitor for locked-up or stalled software threads. If the originator of the timed-out software thread did not change for the duration, the step 108 of determining proceeds to the step 110 of taking an action, which in this case includes reporting 236 the condition, e.g. to an operator or another software program.
  • the software thread cannot be locked-up or stalled and the microcode program counter and originator information is ignored. If the microcode program counter changed at some point in the duration but the originator of the timed-out software thread remained the same, execution of the timed-out software thread may have stalled, e.g. as executing in an endless loop, but could also be operating normally.
  • the action taken under this logical combination is to report 236 the condition (e.g. potential stalled thread) to allow for further action such as analysis to be taken.
  • the step 108 of determining should exit and proceed to the step 110 of taking an action. That is because if the microcode program counter has not changed over the duration, execution of the timed-out software thread could be locked up, and remedial action may be necessary.
  • the step 110 of taking an action includes, in the case that the microcode program counter was the same over the duration, checking 224 the memory bus utilization of the network processor executing the timed-out software thread.
  • a determination 226 is made whether or not that utilization is at zero, or alternatively below some threshold accommodating for any small inaccuracy in determining the utilization but still indicates that there has been no memory bus utilization by the network processor of the timed-out software thread over the duration. If the network processor executing the timed-out software thread has not utilized the memory bus over the duration, crash debug information pertaining to the timed-out software thread is dumped 230 , or otherwise recorded, the network processor executing the timed-out software thread is reset 232 and its software is restarted. The method 100 terminates, and as before may restart automatically.
  • a determination 228 is made whether or not the debounce flag is true. In the affirmative case, i.e. the debounce flag is true indicating that the steps of sampling 106 and determining 108 have already been performed twice on this time-out interrupt for this timed-out software thread, then the steps of dumping 230 crash debug information and resetting 232 the network processor executing the timed-out software thread and restarting all of the network processor's software are performed. Otherwise in the negative case, i.e. the debounce flag is false, the debounce flag is set 234 to true, the step 110 of taking action ends and the method 100 proceeds to the step 106 of sampling information corresponding to the timed-out software thread.
  • checking 224 the memory bus utilization and determining 226 if the memory bus utilization is at zero or below a threshold are not necessary steps and may be omitted in some embodiments. These embodiments would be useful in cases where an indication of memory bus utilization is not available on a network processor. In embodiments where these steps of checking 224 and determining 226 are omitted, an affirmative determination 218 that the microcode counter has remained unchanged is followed by the step of determining 228 if the debounce flag is true. The remainder of the method in these embodiments is the same as previously described.
  • this software mechanism will effectively eliminate very undesirable customer service outages or “silent failures” that may occur due to network processor lockups from multiple possible causes with minimal network downtime.
  • ECC is typically not implemented across all memories in use by the Network processor and therefore memory corruption and a subsequent partial (one or a few software threads) or complete (all threads) lockup is always a possibility.
  • This software solution can be applied to existing products which are already deployed in customer networks (no new h/w needed). The solution provides an effective mitigation against worst-case network equipment failure scenarios and helps reduce potential product returns (and damage to customer perceived quality) following a network outage or silent failure. Overall, embodiments of the invention improve the robustness of telecom products by increasing the reliability of network processor based architectures.
  • embodiments of the invention have broad applicability in telecom and other high-reliability applications that are likely to use network processors whether or not ECC protection is a viable option. Such embodiments can improve on existing solutions.
  • This software upgradeable solution increases the reliability of communications systems in existing and future customer deployments without costly hardware swapping and/or re-designs.

Abstract

The invention is directed to monitoring execution of software threads, particularly by detecting a lockup or stall in execution of a software thread and initiating a remedial action in response. Advantageously, some embodiments of the invention automatically detect a lockup or stall in execution of a software thread by periodically sampling information corresponding to the thread, and, in accordance with a determination made using the information, initiate an attempt to recover from such a condition in execution without the need for manual intervention.

Description

FIELD OF THE INVENTION
The invention is directed to monitoring execution of software threads, particularly so doing in a network processor by detecting lockup or stalling of such execution.
BACKGROUND OF THE INVENTION
Network processors (NPs) are employed in many of today's communications products, as opposed to traditional application specific integrated circuits (ASICs) or field programmable gate array (FPGA) fixed hardware, primarily due to fact that the architecture of these processors provides the flexibility of a software based feature set solution with the high performance of ASICs. Network processors utilize parallel processing or serial pipelines and are programmable like general purpose microprocessors, but are optimized for packet processing operations required by data packet network communication devices.
Network processors execute what is commonly referred to as microcode to perform data path packet processing functions. A network processor typically has a set of software threads (also referred to as tasks) which are spawned to perform packet processing operations by executing specific pieces of microcode.
Memory content corruption, for example a soft-error causing a memory bit to invert or “flip”, in a memory device used by the network processor may cause execution of one or more threads to lockup if the error corrupts a microcode instruction or a data structure used by the network processor. Additionally, a software bug or component defect in the network processor could interfere with normal processing, which could lead to thread execution lockups.
The result of thread execution lockup is that the locked up thread will no longer continue to process data path traffic, which can lead to a communication service outage or silent failure of the network communications device.
Soft-errors (single bit flips) can be mitigated effectively with hardware based error correction coding (ECC) protection. However in many cases it is not practical or even feasible to have 100% ECC coverage across all memories of a given network processor. Furthermore, ECC does not protect against multi-bit corruption or microcode software defects that can also lead to memory corruption and subsequent network processor thread execution lockup.
Hardware based ECC is not always feasible for various reasons, such as one or a combination of the following: added expense, insufficient space on the network processor to accommodate the extra hardware logic required for ECC codes, and performance degradation associated with the ECC hardware.
Good hardware design and component quality can reduce but can not completely eliminate the possibility of memory corruption due to soft-errors. Similarly, good software development practices can reduce but can not completely eliminate the possibility of software bugs that escape development testing.
Therefore, a way of mitigating the undesirable effects of network processor thread execution lockups that does not require ECC hardware is desired.
SUMMARY
Embodiments of the invention are directed to monitoring execution of software threads, particularly by detecting a lockup or stall in execution of a software thread and initiating a remedial action in response.
Some embodiments of the invention automatically detect a lockup or stall in execution of a software thread by periodically sampling information corresponding to the thread, and, in accordance with a determination made using the information, initiate an attempt to recover from such a condition in execution without the need for manual intervention.
Some embodiments of the invention provide a method executed on a microprocessor to automatically detect a lockup or stall in execution of a software thread of a network processor, and initiate a remedial action to mitigate undesirable effects caused by the lockup or stall such as a prolonged communication service outage in data traffic carried by a communication system employing the network processor. Other embodiments provide the method written in microcode and executed on the network processor, while other embodiments execute some steps of the method on the microprocessor and the remainder of the steps on the network processor.
Advantageously, some embodiments of the invention can be deployed in communications systems already in service by a software upgrade in the field, thereby avoiding the expense of hardware replacements in cases where ECC hardware was required.
According to an aspect of the invention a method of monitoring software thread execution is provided. The method comprises the steps of: detecting that execution of a software thread has timed-out; recording information corresponding to the software thread at a plurality of intervals over a duration of time; determining if some information so recorded remained unchanged for all of the plurality of intervals; and taking an action in accordance with the determination.
Advantageously, the step of recording may include recording microcode program counter information, software thread busy indication and originator information of the software thread. The originator is typically the entity that initiated the software thread, e.g. a port or serial interface on the network processor. There may be a sequence number associated with the originator to prevent false positive thread stall detection in situations where the originator is often the same. Furthermore, the method may additionally comprise: reporting a condition of the software thread responsive to an unchanging busy indication, the microcode program counter information having changed and the originator information having remained the same for all of the plurality of intervals.
Advantageously, the step of taking an action includes determining if the steps of recording and determining have already been performed twice with respect to the time-out interrupt; and resetting, responsive to said steps having already been performed twice, the network processor executing the software thread.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of the preferred embodiments, as illustrated in the appended drawings, where:
FIG. 1 is a flow chart illustrating a method of monitoring execution of software threads according to an embodiment of the invention.
FIG. 2 is a flow chart illustrating the steps in the method of FIG. 1 in greater detail.
In the figures like features are denoted by like reference characters.
DETAILED DESCRIPTION
The foregoing has referred to the detection of a lockup or stall in execution of a software thread. Herein, a lockup is defined as follows: lockup is when a timer for thread execution has expired, such that a time-out interrupt or similar indication has been asserted or expiration of an allotted thread execution time has been otherwise detected, e.g. via polling elapsed execution time of the thread, an execution status of the thread indicates that execution is ongoing i.e. in a busy state, and a program counter of the thread's microcode is not changing. A stall in execution is defined as when, like a lockup, an allotted time for thread execution has expired and the execution status of the thread is in a busy state; however unlike a lockup, the thread's microcode program counter is changing and the originator of the thread remains the same. When these conditions indicating a stall in execution of a thread are met, there is a possibility that the thread is executing in an endless loop; however there is also a possibility that the thread is executing properly. Therefore, when a stall is detected according to the foregoing conditions an action is initiated that is in accordance with the existence of these two possibilities, as will be described later.
Embodiments of the invention are directed to reliably detecting software threads whose execution has locked up (i.e. stopped processing microcode) or stalled (i.e. not terminating but still appearing to be executing microcode) while monitoring execution of software threads in a manner that is not intrusive to time critical processing functions carried out by the software threads. Indeed, to address the latter, more intense monitoring of a given software thread starts when an initial indication of a potential, or an existing, lockup or stall condition in execution is detected with respect to the thread via a time-out indication corresponding thereto, such as a thread time-out interrupt.
In order to reduce the possibility of a false positive lockup declaration, staged escalation of monitoring states and examination of numerous pieces of information related to a software thread having a time-out indication are utilized before making a determination with regard to the condition of the thread's execution. Additionally, to mitigate the risk of a false positive lockup declaration, a debounce (i.e. double-check) monitoring pass may be performed. Additionally, memory bus utilization of the network processor executing the thread may be determined before proceeding to a lockup determination. Although embodiments of the invention find advantageous use in network processors, they can also be used in more general processing devices.
FIG. 1 is a flow chart of a method 100 of monitoring execution of software threads according to an embodiment of the invention. The method 100 is executed on a general purpose microprocessor with access to the network processor. After starting 102, the method 100 proceeds to detecting 104 that execution of a software thread has exceeded a predetermined execution time interval. Such detection could be facilitated by receiving a time-out interrupt with respect to the software thread. As mentioned previously, other ways of determining that the software thread has exceeded its allotted execution time may be used instead of a time-out interrupt. After the thread time-out interrupt is received, information corresponding to the timed-out thread, so indicated by the interrupt, is periodically sampled 106 over a duration of time. Several information samples are taken in a periodic manner, although other embodiments may sample the information in a non-periodic manner. The duration is long enough to allow several information samples to be taken, for example in this embodiment the information is sampled every five milliseconds for a duration of four seconds. A determination 108 is made whether or not some of the information samples remained unchanged over the entire duration. In other words, the information samples are compared to each other to determine if any are different from the others. There are many ways to do this, however this embodiment simply compares the first information sample to each of the other information samples to detect if any are different from it. An action is then taken 110 in accordance with the determination 108, such as to initiate a remedial response if execution of the software thread is determined to be locked up or stalled.
FIG. 2 is a flow chart that shows the steps of the method 100 in greater detail. The step of detecting 104 includes a polling function that periodically polls the status of interrupts. The detecting step 104 comprises determining 202 if a thread time-out interrupt has been received, and in the negative case, i.e. a thread time-out interrupt has not been received, the detecting step 104 waits 204 for a predetermined interval of time, in this case five milliseconds, and returns to the step of determining 202 if a thread time-out interrupt has been received. This detecting step 104 monitors multiple software threads being executed on the network processor for an occurrence of a time-out interrupt with respect to any one of these threads. In the affirmative case, i.e. a thread time-out interrupt has been received or otherwise detected; a debounce flag is set 206 to false, the detecting step 104 exits and execution of the method 100 proceeds to the step of sampling 106 information corresponding to the timed-out software thread so indicated by the time-out interrupt.
The step 106 of sampling information corresponding to the timed-out software thread includes determining 208 if execution status of the timed-out software thread is busy. If it is not busy the method 100 is terminated, and would typically startup again so as to continually monitor for software threads that have locked up or stalled in their execution. Otherwise, if execution status of the timed-out software thread is busy, the sampling step 106 proceeds to record 210 the value of the microcode counter of the timed-out software thread. An indication of the originator of the timed-out software thread is also recorded 212. These two recording steps 210, 212 can be carried out in either order or could be done as one step. A timer is then checked to determine 214 if the duration, in this case four seconds, has elapsed. If the duration has not elapsed the sampling step 106 waits for the predetermined interval of time, in this case five milliseconds, and then returns to the step of determining 208 if execution status of the timed-out software thread is still busy.
The step 108 of determining if some of the information samples remained unchanged over the entire duration includes determining 218 if all recorded values of the microcode program counter are the same over the duration, i.e. if the microcode program counter of the timed-out software thread remained unchanged over the entire duration. If the microcode program counter did not change over the duration the step 108 of determining proceeds to the step 110 of taking an action, otherwise a determination 220 is made whether or not the originator of the timed-out software thread was the same for the entire duration. If the originator of the timed-out software thread changed at any time over the duration the method 100 is terminated, and as before, it may restart so as to continually monitor for locked-up or stalled software threads. If the originator of the timed-out software thread did not change for the duration, the step 108 of determining proceeds to the step 110 of taking an action, which in this case includes reporting 236 the condition, e.g. to an operator or another software program.
The steps 218, 220, 208 of determining if either or all three of the thread busy indication, microcode program counter and originator remained the same for the duration over which the samples thereof where taken could be performed in either order as long as the actions taken on the logical combination of the results of the determinations 218, 220, 208 are the same as the foregoing description. That is, if either the thread busy indication, the microcode program counter or the originator of the timed-out software thread changed at some point in the duration, then the step 108 of determining should exit and the method 100 should terminate. That is because this logical combination indicates that execution of the timed-out software thread is neither locked up nor stalled. If the busy indication of the thread cleared at some point in the duration then the software thread cannot be locked-up or stalled and the microcode program counter and originator information is ignored. If the microcode program counter changed at some point in the duration but the originator of the timed-out software thread remained the same, execution of the timed-out software thread may have stalled, e.g. as executing in an endless loop, but could also be operating normally. The action taken under this logical combination is to report 236 the condition (e.g. potential stalled thread) to allow for further action such as analysis to be taken. However, if the microcode program counter and busy indication remained the same for the duration irrespective of changes or the lack thereof in the originator over the duration, then the step 108 of determining should exit and proceed to the step 110 of taking an action. That is because if the microcode program counter has not changed over the duration, execution of the timed-out software thread could be locked up, and remedial action may be necessary.
The step 110 of taking an action includes, in the case that the microcode program counter was the same over the duration, checking 224 the memory bus utilization of the network processor executing the timed-out software thread. A determination 226 is made whether or not that utilization is at zero, or alternatively below some threshold accommodating for any small inaccuracy in determining the utilization but still indicates that there has been no memory bus utilization by the network processor of the timed-out software thread over the duration. If the network processor executing the timed-out software thread has not utilized the memory bus over the duration, crash debug information pertaining to the timed-out software thread is dumped 230, or otherwise recorded, the network processor executing the timed-out software thread is reset 232 and its software is restarted. The method 100 terminates, and as before may restart automatically. If the memory bus utilization of the network processor executing the timed-out software thread is greater than zero, or greater than or equal to the threshold such that some bus utilization over the duration is indicated, a determination 228 is made whether or not the debounce flag is true. In the affirmative case, i.e. the debounce flag is true indicating that the steps of sampling 106 and determining 108 have already been performed twice on this time-out interrupt for this timed-out software thread, then the steps of dumping 230 crash debug information and resetting 232 the network processor executing the timed-out software thread and restarting all of the network processor's software are performed. Otherwise in the negative case, i.e. the debounce flag is false, the debounce flag is set 234 to true, the step 110 of taking action ends and the method 100 proceeds to the step 106 of sampling information corresponding to the timed-out software thread.
It should be noted that checking 224 the memory bus utilization and determining 226 if the memory bus utilization is at zero or below a threshold, are not necessary steps and may be omitted in some embodiments. These embodiments would be useful in cases where an indication of memory bus utilization is not available on a network processor. In embodiments where these steps of checking 224 and determining 226 are omitted, an affirmative determination 218 that the microcode counter has remained unchanged is followed by the step of determining 228 if the debounce flag is true. The remainder of the method in these embodiments is the same as previously described.
Advantageously, this software mechanism will effectively eliminate very undesirable customer service outages or “silent failures” that may occur due to network processor lockups from multiple possible causes with minimal network downtime. ECC is typically not implemented across all memories in use by the Network processor and therefore memory corruption and a subsequent partial (one or a few software threads) or complete (all threads) lockup is always a possibility. This software solution can be applied to existing products which are already deployed in customer networks (no new h/w needed). The solution provides an effective mitigation against worst-case network equipment failure scenarios and helps reduce potential product returns (and damage to customer perceived quality) following a network outage or silent failure. Overall, embodiments of the invention improve the robustness of telecom products by increasing the reliability of network processor based architectures.
Further advantageously, embodiments of the invention have broad applicability in telecom and other high-reliability applications that are likely to use network processors whether or not ECC protection is a viable option. Such embodiments can improve on existing solutions. This software upgradeable solution increases the reliability of communications systems in existing and future customer deployments without costly hardware swapping and/or re-designs.
Numerous modifications, variations and adaptations may be made to the embodiments of the invention described above without departing from the scope of the invention, which is defined in the claims.

Claims (20)

1. A method of monitoring software thread execution, comprising the steps of:
detecting that execution of a software thread has timed-out;
recording information corresponding to the software thread indicative of an execution problem at a plurality of intervals over a duration of time;
determining if some information so recorded remained unchanged for all of the plurality of intervals;
performing a debounce monitoring pass to prevent a false positive lockup declaration; and
taking an action in accordance with the determination.
2. The method of claim 1, wherein the step of recording comprises:
determining at each interval if execution of the software thread is in a busy state; and
terminating the method responsive to the execution of the software thread not being in the busy state.
3. The method of claim 2, wherein the step of recording further comprises:
recording microcode program counter information and originator information of the software thread, and
the action taken is responsive to the microcode program counter information having changed and the originator information having remained unchanged for all of the plurality of intervals.
4. The method of claim 3, wherein the action taken comprises:
reporting a condition of the software thread.
5. The method of claim 1, wherein the step of determining comprises:
terminating the method responsive to none of the information so recorded having remained unchanged for all of the plurality of intervals.
6. The method of claim 1, further comprising:
sampling the information corresponding to the software thread in a non-periodic manner.
7. The method of claim 1, further comprising:
sampling the information corresponding to the software thread for a duration long enough to allow several information samples to be taken.
8. The method of claim 1, further comprising:
periodically polling a status of interrupts.
9. The method of claim 1, further comprising:
setting a debounce flag to false after detection of a software thread time-out interrupt.
10. The method of claim 1, further comprising:
setting a debounce flag to true after the recording step and the determining step have been performed twice.
11. The method of claim 10, further comprising:
after setting the debounce flag to true, resetting the network processor executing the timed-out software thread.
12. A method of monitoring software thread execution, comprising the steps of:
detecting that execution of a software thread has timed-out;
recording information corresponding to the software thread indicative of an execution problem at a plurality of intervals over a duration of time;
determining if some information so recorded remained unchanged for all of the plurality of intervals;
performing a debounce monitoring pass to prevent a false positive lockup declaration; and
taking an action in accordance with the determination, wherein taking an action comprises:
checking if memory bus utilization of a network processor executing the software thread is below a threshold; and
resetting, responsive to said memory bus utilization being below the threshold, the network processor.
13. The method of claim 12, wherein the step of recording further comprises:
recording microcode program counter information of the software thread; and
the step of checking is responsive to the microcode program counter information having remained unchanged.
14. The method of claim 13, wherein the step of resetting comprises:
dumping crash debug information of the software thread.
15. The method of claim 12, wherein the step of determining comprises:
ascertaining, responsive to said memory bus utilization not being below the threshold, if the steps of recording and determining have been performed more than once with respect to the software thread; and
resetting, responsive to said steps having been performed more than once, the network processor.
16. The method of claim 15, further comprising:
repeating the steps of recording and determining if said steps have not been performed more than once with respect to the software thread.
17. The method of claim 12, wherein the step of detecting comprises:
receiving a time-out interrupt with respect to the software thread.
18. The method of claim 12, comprising:
executing the method on a microprocessor to automatically detect either a lockup or a stall in execution of the software thread being executed by the network processor.
19. A method of monitoring software thread execution, comprising:
detecting that execution of a software thread has timed-out;
recording information corresponding to the software thread indicative of an execution problem at a plurality of intervals over a duration of time;
determining if some information so recorded remained unchanged for all of the plurality of intervals; and
checking if memory bus utilization of a network processor executing the software thread is below a threshold.
20. A method of monitoring software thread execution, comprising:
detecting that execution of a software thread has timed-out;
recording information corresponding to the software thread indicative of an execution problem at a plurality of intervals over a duration of time;
determining if some information so recorded remained unchanged for all of the plurality of intervals; and
resetting a network processor that was executing the timed-out software thread.
US12/826,102 2010-06-29 2010-06-29 Monitoring software thread execution Expired - Fee Related US8086910B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/826,102 US8086910B1 (en) 2010-06-29 2010-06-29 Monitoring software thread execution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/826,102 US8086910B1 (en) 2010-06-29 2010-06-29 Monitoring software thread execution

Publications (2)

Publication Number Publication Date
US8086910B1 true US8086910B1 (en) 2011-12-27
US20110320858A1 US20110320858A1 (en) 2011-12-29

Family

ID=45349940

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/826,102 Expired - Fee Related US8086910B1 (en) 2010-06-29 2010-06-29 Monitoring software thread execution

Country Status (1)

Country Link
US (1) US8086910B1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110173483A1 (en) * 2010-01-14 2011-07-14 Juniper Networks Inc. Fast resource recovery after thread crash
CN102968352A (en) * 2012-12-14 2013-03-13 杨晓松 System and method for process monitoring and multi-stage recovery
US9400701B2 (en) * 2014-07-07 2016-07-26 International Business Machines Corporation Technology for stall detection
US20180018240A1 (en) * 2016-07-18 2018-01-18 American Megatrends, Inc. Obtaining state information of threads of a device
US10019392B2 (en) 2015-03-20 2018-07-10 International Business Machines Corporation Preventing software thread blocking due to interrupts
CN109716730A (en) * 2016-09-09 2019-05-03 微软技术许可有限责任公司 The automation performance adjustment of production application

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9063906B2 (en) * 2012-09-27 2015-06-23 International Business Machines Corporation Thread sparing between cores in a multi-threaded processor
EP3121724A1 (en) * 2015-07-24 2017-01-25 Thomson Licensing Method for monitoring a software program and corresponding electronic device, communication system, computer readable program product and computer readable storage medium
DE102016200777A1 (en) * 2016-01-21 2017-07-27 Robert Bosch Gmbh Method and apparatus for monitoring and controlling quasi-parallel execution threads in an event-oriented operating system
US10705843B2 (en) 2017-12-21 2020-07-07 International Business Machines Corporation Method and system for detection of thread stall

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6418542B1 (en) * 1998-04-27 2002-07-09 Sun Microsystems, Inc. Critical signal thread
US6457142B1 (en) * 1999-10-29 2002-09-24 Lucent Technologies Inc. Method and apparatus for target application program supervision
US20020162053A1 (en) * 1999-03-10 2002-10-31 Os Ron Van User transparent software malfunction detection and reporting
US20030120896A1 (en) * 2001-06-29 2003-06-26 Jason Gosior System on chip architecture
US20060200702A1 (en) * 2005-03-01 2006-09-07 Microsoft Corporation Method and system for recovering data from a hung application
US20070220513A1 (en) * 2006-03-15 2007-09-20 International Business Machines Corporation Automatic detection of hang, bottleneck and deadlock
US20080046782A1 (en) * 2003-07-31 2008-02-21 Michel Betancourt Automated Hang Detection in Java Thread Dumps
US7526682B1 (en) * 2008-06-20 2009-04-28 International Business Machines Corporation Effective diagnosis of software hangs
US7530072B1 (en) * 2008-05-07 2009-05-05 International Business Machines Corporation Method to segregate suspicious threads in a hosted environment to prevent CPU resource exhaustion from hung threads
US7624352B2 (en) * 2000-04-06 2009-11-24 Microsoft Corporation Responsive user interface to manage a non-responsive application
US20100077258A1 (en) * 2008-09-22 2010-03-25 International Business Machines Corporation Generate diagnostic data for overdue thread in a data processing system
US7739689B1 (en) * 2004-02-27 2010-06-15 Symantec Operating Corporation Internal monitoring of applications in a distributed management framework
US7823021B2 (en) * 2005-05-26 2010-10-26 United Parcel Service Of America, Inc. Software process monitor

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6418542B1 (en) * 1998-04-27 2002-07-09 Sun Microsystems, Inc. Critical signal thread
US20020162053A1 (en) * 1999-03-10 2002-10-31 Os Ron Van User transparent software malfunction detection and reporting
US6457142B1 (en) * 1999-10-29 2002-09-24 Lucent Technologies Inc. Method and apparatus for target application program supervision
US7624352B2 (en) * 2000-04-06 2009-11-24 Microsoft Corporation Responsive user interface to manage a non-responsive application
US20030120896A1 (en) * 2001-06-29 2003-06-26 Jason Gosior System on chip architecture
US20080046782A1 (en) * 2003-07-31 2008-02-21 Michel Betancourt Automated Hang Detection in Java Thread Dumps
US7739689B1 (en) * 2004-02-27 2010-06-15 Symantec Operating Corporation Internal monitoring of applications in a distributed management framework
US20060200702A1 (en) * 2005-03-01 2006-09-07 Microsoft Corporation Method and system for recovering data from a hung application
US7823021B2 (en) * 2005-05-26 2010-10-26 United Parcel Service Of America, Inc. Software process monitor
US20070220513A1 (en) * 2006-03-15 2007-09-20 International Business Machines Corporation Automatic detection of hang, bottleneck and deadlock
US7530072B1 (en) * 2008-05-07 2009-05-05 International Business Machines Corporation Method to segregate suspicious threads in a hosted environment to prevent CPU resource exhaustion from hung threads
US7526682B1 (en) * 2008-06-20 2009-04-28 International Business Machines Corporation Effective diagnosis of software hangs
US20100077258A1 (en) * 2008-09-22 2010-03-25 International Business Machines Corporation Generate diagnostic data for overdue thread in a data processing system

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110173483A1 (en) * 2010-01-14 2011-07-14 Juniper Networks Inc. Fast resource recovery after thread crash
US8365014B2 (en) * 2010-01-14 2013-01-29 Juniper Networks, Inc. Fast resource recovery after thread crash
US20130132773A1 (en) * 2010-01-14 2013-05-23 Juniper Networks, Inc. Fast resource recovery after thread crash
US8627142B2 (en) * 2010-01-14 2014-01-07 Juniper Networks, Inc. Fast resource recovery after thread crash
CN102968352A (en) * 2012-12-14 2013-03-13 杨晓松 System and method for process monitoring and multi-stage recovery
CN102968352B (en) * 2012-12-14 2015-07-22 杨晓松 System and method for process monitoring and multi-stage recovery
US9558058B2 (en) * 2014-07-07 2017-01-31 International Business Machines Corporation Technology for stall detection
US20160259678A1 (en) * 2014-07-07 2016-09-08 International Business Machines Corporation Technology for stall detection
US9400701B2 (en) * 2014-07-07 2016-07-26 International Business Machines Corporation Technology for stall detection
US10019392B2 (en) 2015-03-20 2018-07-10 International Business Machines Corporation Preventing software thread blocking due to interrupts
US10019391B2 (en) 2015-03-20 2018-07-10 International Business Machines Corporation Preventing software thread blocking due to interrupts
US10572411B2 (en) 2015-03-20 2020-02-25 International Business Machines Corporation Preventing software thread blocking due to interrupts
US20180018240A1 (en) * 2016-07-18 2018-01-18 American Megatrends, Inc. Obtaining state information of threads of a device
US10802901B2 (en) * 2016-07-18 2020-10-13 American Megatrends International, Llc Obtaining state information of threads of a device
CN109716730A (en) * 2016-09-09 2019-05-03 微软技术许可有限责任公司 The automation performance adjustment of production application
US20190196937A1 (en) * 2016-09-09 2019-06-27 Microsoft Technology Licensing, Llc Automated Performance Debugging of Production Applications
US10915425B2 (en) * 2016-09-09 2021-02-09 Microsoft Technology Licensing, Llc Automated performance debugging of production applications
CN109716730B (en) * 2016-09-09 2021-10-22 微软技术许可有限责任公司 Method and computing device for automated performance debugging of production applications

Also Published As

Publication number Publication date
US20110320858A1 (en) 2011-12-29

Similar Documents

Publication Publication Date Title
US8086910B1 (en) Monitoring software thread execution
US9529694B2 (en) Techniques for adaptive trace logging
US7702966B2 (en) Method and apparatus for managing software errors in a computer system
US7424666B2 (en) Method and apparatus to detect/manage faults in a system
US9798624B2 (en) Automated fault recovery
US8473789B2 (en) Memory leak monitoring system and associated methods
US9218893B2 (en) Memory testing in a data processing system
US20150234702A1 (en) Notification of address range including non-correctable error
US20170147422A1 (en) External software fault detection system for distributed multi-cpu architecture
Panda et al. {IASO}: A {Fail-Slow} Detection and Mitigation Framework for Distributed Storage Services
US9164823B2 (en) Resetting a peripheral driver and prohibiting writing into a register retaining data to be written into a peripheral on exceeding a predetermined time period
US11853150B2 (en) Method and device for detecting memory downgrade error
US20140122421A1 (en) Information processing apparatus, information processing method and computer-readable storage medium
US10921871B2 (en) BAS/HVAC control device automatic failure recovery
US9798625B2 (en) Agentless and/or pre-boot support, and field replaceable unit (FRU) isolation
US20080288828A1 (en) structures for interrupt management in a processing environment
US20190121985A1 (en) Detecting vulnerabilities in applications during execution
US20050033952A1 (en) Dynamic scheduling of diagnostic tests to be performed during a system boot process
US8880956B2 (en) Facilitating processing in a communications environment using stop signaling
JP5768503B2 (en) Information processing apparatus, log storage control program, and log storage control method
US20170052841A1 (en) Management apparatus, computer and non-transitory computer-readable recording medium having management program recorded therein
US8483234B2 (en) Monitoring resource congestion in a network processor
CN111209129A (en) Memory optimization method and device based on AMD platform
CN113535441B (en) Embedded system fault diagnosis device and method
US20080221833A1 (en) Method and apparatus for detecting dependability vulnerabilities

Legal Events

Date Code Title Description
AS Assignment

Owner name: ALCATEL-LUCENT CANADA, INC., CANADA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KOKTAN, TOBY;POULIN, ANDRE;SIGNING DATES FROM 20100628 TO 20100629;REEL/FRAME:024610/0540

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

AS Assignment

Owner name: ALCATEL LUCENT, FRANCE

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ALCATEL-LUCENT CANADA INC.;REEL/FRAME:027068/0967

Effective date: 20111013

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20191227