US20090164565A1 - Redundant systems management frameworks for network environments - Google Patents

Redundant systems management frameworks for network environments Download PDF

Info

Publication number
US20090164565A1
US20090164565A1 US12/242,796 US24279608A US2009164565A1 US 20090164565 A1 US20090164565 A1 US 20090164565A1 US 24279608 A US24279608 A US 24279608A US 2009164565 A1 US2009164565 A1 US 2009164565A1
Authority
US
United States
Prior art keywords
central server
active agent
agent
active
meta
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US12/242,796
Other versions
US8112518B2 (en
Inventor
William Roy Underhill
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: UNDERHILL, WILLIAM ROY
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: UNDERHILL, WILLIAM ROY
Publication of US20090164565A1 publication Critical patent/US20090164565A1/en
Application granted granted Critical
Publication of US8112518B2 publication Critical patent/US8112518B2/en
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/04Network management architectures or arrangements
    • H04L41/042Network management architectures or arrangements comprising distributed management centres cooperatively managing the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/40Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/04Network management architectures or arrangements
    • H04L41/046Network management architectures or arrangements comprising network management agents or mobile agents therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • H04L41/0661Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities by reconfiguring faulty entities

Definitions

  • the present invention relates to systems management frameworks for network environments.
  • an agent is placed on a target system to perform various management tasks with a central server instructing the agent.
  • the agent on the target system is left as a single point of failure.
  • the network path between the target system and the central servers may sometimes be a single point of failure. If communications are lost between the agent and the instructing central server, the central server may not be aware of the loss, or may not be able to restore communication with the target system. If there is a problem with the agent, the central server may not be able to detect the problem or fix it.
  • the present invention relates to systems management frameworks for network environments.
  • the invention provides a systems management framework for managing a target system.
  • the systems management framework Includes an active agent configured to receive instructions from an active central server to perform management tasks on the target system and a plurality of meta-agents provided on the target system.
  • Each meta-agent is an endpoint on the target system configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
  • each meta-agent is configured to monitor the active agent, and is further configured to revive or restart the active agent upon detecting a failure of the active agent.
  • each meta-agent is further configured to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
  • each central server includes at least one endpoint for monitoring the operation of a central server, each endpoint corresponding to another central server.
  • the central servers are configured to transfer management operations from a failed central server to a newly active central server.
  • the active agent is configured to login to and receive management instructions from the newly active central server.
  • the systems management framework further includes a redundant redirection layer between the active agent and the central servers such that communication between the active agent and the central servers may take place over alternate paths.
  • the invention is a method for managing a target system including providing an active agent configured to receive instructions from an active central server to perform management tasks on the target system and providing a plurality of meta-agents on the target system, where each meta-agent is an endpoint on the target system configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
  • the method further includes configuring each meta-agent to monitor the active agent, and to revive or restart the active agent upon detecting a failure of the active
  • the method further includes configuring each meta-agent to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
  • the method further includes providing at least one endpoint in each central server for monitoring the operation of the central server, each endpoint corresponding to another central server.
  • the method further includes configuring the central servers to transfer management operations from a failed central server to a newly active central server.
  • the method further includes configuring the active agent to log into and receive management instructions from the newly active central server.
  • the method further includes providing a redundant redirection layer between the active agent and the central servers such that communication between the active agent and central servers may take place over alternative paths.
  • a data processor readable medium storing data processor code that when loaded into a data processing device adapts the device to perform a method for managing a target system that includes configuring an active agent to receive instructions from an active central server to perform management tasks on the target system and configuring a plurality of meta-agents on the target system.
  • Each meta-agent is an endpoint on the target system and is configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
  • the data processor readable medium further includes code that adapts the device to configure each meta-agent to monitor the active agent, and to revive or restart the active agent upon detecting a failure of the active agent.
  • the data processor readable medium further includes code that adapts the device to configure each meta-agent to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
  • the data processor readable medium further includes code that adapts the device to provide at least one endpoint in each central server for monitoring the operation of the central server, each endpoint corresponding to another central server.
  • the data processor readable medium further includes code that adapts the device to configure the central servers to transfer management operations from, a failed central server to a newly active central server.
  • the data processor readable medium further Includes code that adapts the device to configure the active agent to log into and receive management instructions from the newly active central server.
  • the data processor readable medium further includes code that adapts the device to provide a redundant redirection layer between the active agent and the central servers such that communication between the active agent and central servers may take place over alternative paths.
  • FIG. 1 shows a generic data processing system that may provide a suitable operating environment
  • FIG. 2 shows a schematic block diagram of an illustrative topology for a systems management framework in accordance with an embodiment
  • FIG. 3 shows a more detailed block diagram of a plurality of central servers in the illustrative topology in FIG. 2 ;
  • FIG. 4 shows a schematic flowchart of an illustrative method in accordance with an embodiment.
  • the present invention relates to a systems management framework for networked environments.
  • FIG. 1 shows a generic data processing system 100 that may include a central processing unit (“CPU”) 102 connected to a storage unit 104 and to a random access memory 106 .
  • the CPU 102 may process an operating system 101 , application program 103 , and data 123 .
  • the operating system 101 , application program 103 , and data 123 may be stored in storage unit 104 and loaded into memory 106 , as may be required.
  • An operator 107 may interact with the data processing system 100 using a video display 108 connected by a video interface 105 , and various input/output devices such as a keyboard 110 , mouse 112 , and disk drive 114 connected by an I/O interface 109 .
  • the mouse 112 may be configured to control movement of a cursor in the video display 108 , and to operate various graphical user interface (GUI) controls appearing in the video display 108 with a mouse button.
  • GUI graphical user interface
  • the disk drive 114 may be configured to accept data processing system readable media 116 .
  • the data processing system 100 may form part of a network via a network interface 111 , allowing the data processing system 100 to communicate with other suitably configured data processing systems (not shown).
  • the particular configurations shown by way of example in this specification are not meant to be limiting.
  • this single point of failure may lead to situations where the failure cannot be easily repaired by the systems management framework.
  • this single point of failure may have been tolerated as an acceptable risk, or the single point of failure may have been removed by providing a redundant second agent on the target system.
  • this second agent is configured to duplicate the function of the first agent.
  • This approach may have some drawbacks including duplication of management functions and a doubling of the resources consumed by the redundant agents for managing a target system.
  • uncoordinated or poorly coordinated redundant agents may take conflicting or duplicate management actions to correct a problem that may have unintended consequences, potentially resulting in instability.
  • the invention presents a novel framework for systems management involving a hierarchical arrangement of agents on a target system.
  • the hierarchical arrangement of agents may include an active agent and a plurality of meta-agents that are configured to monitor and pass instructions to the active agent.
  • Each of the meta-agents is associated in a one-to-one configuration with a central server and may be configured to monitor its respective central server for ongoing operation.
  • FIG. 2 shown is a schematic block diagram of an illustrative framework 200 for systems management in accordance with an embodiment of the invention.
  • the FIG. 2 systems management framework 200 may include a target system 202 having a plurality of meta-agents 204 A- 204 N, and an active agent 206 .
  • Active agent 206 may be connected to a redirection layer 210 configured to connect active agent 206 to a plurality of central servers 220 A- 22 ON through redundant network paths.
  • Redirection layer 210 may be implemented in many different ways, and may employ proxies to coordinate the redirection between various alternative paths.
  • the purpose of the redirection layer is to make it possible for the active agent 206 to connect to the central servers 220 A- 220 N in more than one way, either by being able to change its connection to another one of the central servers 220 A- 220 N via their respective gateways 230 A- 230 N, or by being able to connect over more then one network path in the redirection layer 210 , or both.
  • meta-agents 204 A- 204 N on the target system 202 are each configured to monitor the active agent 206 to revive or restart the active agent 206 .
  • the meta-agents 204 A- 204 N may be implemented using endpoints that correspond to the central servers 220 A- 220 N with a one-to-one relationship. This allows each of the central servers 220 A- 220 N to have a dedicated agent on the target system 202 for monitoring the connection to the central server 220 A- 220 N and the active agent 206 .
  • the active agent 206 is configured to be the only agent on the target system 202 that is capable of taking direct management action on the target system 202 . All other agents (i.e. the meta-agents 204 A- 204 N) on the target system 202 can only monitor the active agent 206 , monitor the connection from the target system 202 to their respective central servers 220 A- 220 N, and pass instructions to the active agent 206 for execution of specific tasks to change which of the central servers 220 A- 220 N the active agent 206 is logged into.
  • All other agents i.e. the meta-agents 204 A- 204 N
  • the systems management framework provides redundancy at the level of the central servers 220 A- 220 N.
  • the plurality of central servers 220 A- 220 N introduced earlier in FIG. 2 may be configured to monitor each other for proper operation.
  • each central server 220 A- 220 N may contain a sufficient number of endpoints 310 A- 310 N to monitor every other central server 220 A- 220 N.
  • each central server 220 A- 22 ON may monitor every other central server 220 A- 220 N by using a sufficient number of endpoints on each central server 220 A- 220 N, each endpoint corresponding to another central server 220 A- 220 N.
  • central server 220 A includes endpoints 310 B and 310 N so that It can be monitored by central servers 220 B and 220 N;
  • central server 220 B includes endpoints 310 A and 310 N so that it can be monitored by central servers 220 A and 220 N;
  • central server 220 N includes endpoints 310 A and 310 B so that it can be monitored by central servers 220 A and 220 B.
  • each central server 220 A- 220 N may monitor only one or perhaps several designated central servers 220 A- 220 N such that failure or improper operation of a central server 220 A- 220 N will be detected by one or more monitoring central servers 220 A- 220 N. Any change of activity to a central server 220 A- 220 N may then be noted and suitably reflected by also changing the central server into which the active agent 206 is logged by sending appropriate instruction through the meta-agent 204 A- 204 N that corresponds to the new active central server.
  • central server 220 A has failed.
  • central server 220 B may notify its meta-agent 204 B on the target system 202 to reconfigure the active agent 206 to receive instructions from newly active central server 220 B rather than the non-operational central server 220 A.
  • the active agent 206 remains the only agent on the target system 202 that is actively performing management, tasks.
  • the systems management framework of the present invention also provides redundancy at the level of the active agent 206 .
  • the systems management framework of the present invention also provides redundancy at the level of the active agent 206 .
  • any problems with the active agent 206 itself may be quickly detected by one or more of the meta-agents 204 A- 204 N, and the active agent 206 may be revived or restarted as necessary to address the problem.
  • the active agent 206 By providing a single active agent 206 embodying a single endpoint that is capable of actively performing management tasks on the target system 202 , the possibility that two or more agents may execute conflicting or duplicate management tasks on target system 202 is removed.
  • the basic approach is to provide redundancy at the server level, as well as at the agent level for performing management tasks on a target system.
  • the agent level a hierarchical arrangement is provided with a single endpoint, called the active agent, which performs all management tasks on the target system.
  • Other agents called meta-agents, monitor the active agent and the connectivity to their respective central servers, but do not directly manage the target system themselves.
  • the meta agents on the target system may perform only limited monitoring of the central servers, and may not perform general purpose monitoring.
  • any particular meta-agent monitors its connection to its central server only sufficiently to ascertain whether it can currently receive management instructions from its central server. Any change in the identity of the active central server thus results In a corresponding change at the meta-agents level such that the active agent continues to receive instructions from the new active central server.
  • the systems management framework as described above removes the single points of failure at the central server level and at the agent level, while avoiding duplication of agents performing management tasks on the target server. Redundancy may also be provided at the redirection layer to provide the necessary network availability that is required for the business purpose of the target system. Because the active agent is supported and monitored by multiple meta-agents, failure of the active agent itself brings prompt corrective action from one of the monitoring meta-agents. In addition, because there can be multiple network paths in the redirection layer, network failures will not blind the central servers or the meta-agents provided on the target system.
  • FIG. 4 shown is a schematic flowchart 400 of an illustrative method in accordance with an embodiment.
  • the FIG. 4 method 400 may begin, and at block 402 may set up an active agent configured to take management actions on a target device or system. Method 400 may then proceed to block 404 , where method 400 may configure a plurality of meta-agents to monitor their respective central servers and the active agent.
  • Method 400 may then proceed to block 405 , where endpoints 310 A- 310 N on the central servers 220 A- 200 N are configured to monitor the central servers 220 A- 220 N.
  • Method 400 may then proceed to block 406 , where method 400 may monitor the active agent with at least one meta-agent.
  • Method 400 may then proceed to block 408 , where method 400 may send management instructions to the active agent from the currently active central server.
  • Method 400 may then proceed to decision block 410 , where method 400 checks for any indication of a failure or improper operation of the active agent. If no, method 400 may proceed directly to decision block 414 . If yes, method 400 may proceed to block 412 , where method 400 may revive or restart the active agent as necessary using one of the meta-agents. Method 400 may then proceed to block 414 .
  • method 400 may try to detect any Indication of a failure or improper operation of the central server. If no, method 400 loops back to block 406 to continue. If yes, method 400 may proceed to block 416 to initiate a transfer of management operations from the failed central server to a new central server. Method 400 may then proceed to block 418 , where the meta-agent corresponding to the new active central server may be selected to relay new central server login instructions to the active agent on the target device. Method 400 then loops back to block 406 to continue.
  • the meta-agents may be configured to relay other types of information, including management instructions.
  • the meta-agents may act as a go-between for all instructions between the active agent and the currently active central server. It will be appreciated, however, that this may require some additional overhead to operate the meta-agents.

Abstract

A redundant systems management framework and method for managing a target system includes an active agent configured to receive instructions from an active central server to perform management tasks on the target system and a plurality of meta-agents provided on the target system. Each meta-agent is an endpoint on the target system and is configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server. The central servers may also monitor each other for proper operation. Each meta-agent also is configured to monitor the active agent and to revive or restart the active agent upon detecting a failure of the active agent.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is based upon and claims the benefit of priority under 35 USC §119(a) from Canadian Patent Application No. 2616229, filed Dec. 21, 2007, the content of which is incorporated herein in its entirety.
  • COPYRIGHT NOTICE
  • A portion of the disclosure of this patent document contains material which Is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
  • BACKGROUND
  • The present invention relates to systems management frameworks for network environments.
  • Heretofore, various solutions have been proposed for management of systems connected in a network environment. In some systems management infrastructures, an agent is placed on a target system to perform various management tasks with a central server instructing the agent. However, in this configuration, the agent on the target system is left as a single point of failure. As well, the network path between the target system and the central servers may sometimes be a single point of failure. If communications are lost between the agent and the instructing central server, the central server may not be aware of the loss, or may not be able to restore communication with the target system. If there is a problem with the agent, the central server may not be able to detect the problem or fix it.
  • SUMMARY
  • The present invention relates to systems management frameworks for network environments.
  • In one aspect, the invention provides a systems management framework for managing a target system. The systems management framework Includes an active agent configured to receive instructions from an active central server to perform management tasks on the target system and a plurality of meta-agents provided on the target system. Each meta-agent is an endpoint on the target system configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
  • In one embodiment, each meta-agent is configured to monitor the active agent, and is further configured to revive or restart the active agent upon detecting a failure of the active agent.
  • In another embodiment, each meta-agent is further configured to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
  • In another embodiment, each central server includes at least one endpoint for monitoring the operation of a central server, each endpoint corresponding to another central server.
  • In another embodiment, the central servers are configured to transfer management operations from a failed central server to a newly active central server.
  • In another embodiment, the active agent is configured to login to and receive management instructions from the newly active central server.
  • In another embodiment, the systems management framework further includes a redundant redirection layer between the active agent and the central servers such that communication between the active agent and the central servers may take place over alternate paths.
  • In another aspect, the invention is a method for managing a target system including providing an active agent configured to receive instructions from an active central server to perform management tasks on the target system and providing a plurality of meta-agents on the target system, where each meta-agent is an endpoint on the target system configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
  • In one embodiment, the method further includes configuring each meta-agent to monitor the active agent, and to revive or restart the active agent upon detecting a failure of the active
  • In another embodiment, the method further includes configuring each meta-agent to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
  • In another embodiment, the method further includes providing at least one endpoint in each central server for monitoring the operation of the central server, each endpoint corresponding to another central server.
  • In another embodiment, the method further includes configuring the central servers to transfer management operations from a failed central server to a newly active central server.
  • In another embodiment, the method further includes configuring the active agent to log into and receive management instructions from the newly active central server.
  • In another embodiment, the method further includes providing a redundant redirection layer between the active agent and the central servers such that communication between the active agent and central servers may take place over alternative paths.
  • In another aspect, there is provided a data processor readable medium storing data processor code that when loaded into a data processing device adapts the device to perform a method for managing a target system that includes configuring an active agent to receive instructions from an active central server to perform management tasks on the target system and configuring a plurality of meta-agents on the target system. Each meta-agent is an endpoint on the target system and is configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
  • In an embodiment, the data processor readable medium further includes code that adapts the device to configure each meta-agent to monitor the active agent, and to revive or restart the active agent upon detecting a failure of the active agent.
  • In another embodiment, the data processor readable medium further includes code that adapts the device to configure each meta-agent to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
  • In another embodiment, the data processor readable medium further includes code that adapts the device to provide at least one endpoint in each central server for monitoring the operation of the central server, each endpoint corresponding to another central server.
  • In another embodiment, the data processor readable medium further includes code that adapts the device to configure the central servers to transfer management operations from, a failed central server to a newly active central server.
  • In another embodiment, the data processor readable medium further Includes code that adapts the device to configure the active agent to log into and receive management instructions from the newly active central server.
  • In another embodiment, the data processor readable medium further includes code that adapts the device to provide a redundant redirection layer between the active agent and the central servers such that communication between the active agent and central servers may take place over alternative paths.
  • These and other aspects of the invention will become apparent from the following more particular descriptions of exemplary embodiments.
  • BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING
  • In the figures which illustrate exemplary embodiments of the invention:
  • FIG. 1 shows a generic data processing system that may provide a suitable operating environment;
  • FIG. 2 shows a schematic block diagram of an illustrative topology for a systems management framework in accordance with an embodiment;
  • FIG. 3 shows a more detailed block diagram of a plurality of central servers in the illustrative topology in FIG. 2; and
  • FIG. 4 shows a schematic flowchart of an illustrative method in accordance with an embodiment.
  • DETAILED DESCRIPTION
  • As noted above, the present invention relates to a systems management framework for networked environments.
  • The invention may be practiced in various embodiments. A suitably configured data processing system, and associated communications networks, devices, software and firmware may provide a platform for enabling one or more embodiments. By way of example, FIG. 1 shows a generic data processing system 100 that may include a central processing unit (“CPU”) 102 connected to a storage unit 104 and to a random access memory 106. The CPU 102 may process an operating system 101, application program 103, and data 123. The operating system 101, application program 103, and data 123 may be stored in storage unit 104 and loaded into memory 106, as may be required. An operator 107 may interact with the data processing system 100 using a video display 108 connected by a video interface 105, and various input/output devices such as a keyboard 110, mouse 112, and disk drive 114 connected by an I/O interface 109.
  • In known manner, the mouse 112 may be configured to control movement of a cursor in the video display 108, and to operate various graphical user interface (GUI) controls appearing in the video display 108 with a mouse button. The disk drive 114 may be configured to accept data processing system readable media 116. The data processing system 100 may form part of a network via a network interface 111, allowing the data processing system 100 to communicate with other suitably configured data processing systems (not shown). The particular configurations shown by way of example in this specification are not meant to be limiting.
  • As noted above, providing a single agent on a target system to perform management functions may result in a single point of failure that may lead to situations where the failure cannot be easily repaired by the systems management framework. In prior art systems, this single point of failure may have been tolerated as an acceptable risk, or the single point of failure may have been removed by providing a redundant second agent on the target system. Commonly, this second agent is configured to duplicate the function of the first agent. This approach may have some drawbacks including duplication of management functions and a doubling of the resources consumed by the redundant agents for managing a target system. In addition, there is a potential that uncoordinated or poorly coordinated redundant agents may take conflicting or duplicate management actions to correct a problem that may have unintended consequences, potentially resulting in instability.
  • The invention presents a novel framework for systems management involving a hierarchical arrangement of agents on a target system. The hierarchical arrangement of agents may include an active agent and a plurality of meta-agents that are configured to monitor and pass instructions to the active agent. Each of the meta-agents is associated in a one-to-one configuration with a central server and may be configured to monitor its respective central server for ongoing operation. An illustrative systems management framework in accordance with an embodiment of the invention will now be described in more detail.
  • Referring to FIG. 2, shown is a schematic block diagram of an illustrative framework 200 for systems management in accordance with an embodiment of the invention. The FIG. 2 systems management framework 200 may include a target system 202 having a plurality of meta-agents 204A-204N, and an active agent 206. Active agent 206 may be connected to a redirection layer 210 configured to connect active agent 206 to a plurality of central servers 220A-22ON through redundant network paths. Redirection layer 210 may be implemented in many different ways, and may employ proxies to coordinate the redirection between various alternative paths. Regardless of the particular redirection implementation employed, the purpose of the redirection layer is to make it possible for the active agent 206 to connect to the central servers 220A-220N in more than one way, either by being able to change its connection to another one of the central servers 220A-220N via their respective gateways 230A-230N, or by being able to connect over more then one network path in the redirection layer 210, or both.
  • In an embodiment, meta-agents 204A-204N on the target system 202 are each configured to monitor the active agent 206 to revive or restart the active agent 206. The meta-agents 204A-204N may be implemented using endpoints that correspond to the central servers 220A-220N with a one-to-one relationship. This allows each of the central servers 220A-220N to have a dedicated agent on the target system 202 for monitoring the connection to the central server 220A-220N and the active agent 206.
  • The active agent 206 is configured to be the only agent on the target system 202 that is capable of taking direct management action on the target system 202. All other agents (i.e. the meta-agents 204A-204N) on the target system 202 can only monitor the active agent 206, monitor the connection from the target system 202 to their respective central servers 220A-220N, and pass instructions to the active agent 206 for execution of specific tasks to change which of the central servers 220A-220N the active agent 206 is logged into.
  • In an embodiment, the systems management framework provides redundancy at the level of the central servers 220A-220N. By way of illustration, as shown in FIG. 3, the plurality of central servers 220A-220N introduced earlier in FIG. 2 may be configured to monitor each other for proper operation. For this purpose, each central server 220A-220N may contain a sufficient number of endpoints 310A-310N to monitor every other central server 220A-220N.
  • In an embodiment, each central server 220A-22ON may monitor every other central server 220A-220N by using a sufficient number of endpoints on each central server 220A-220N, each endpoint corresponding to another central server 220A-220N. Thus, in the illustrative example shown in FIG. 3, central server 220A includes endpoints 310B and 310N so that It can be monitored by central servers 220B and 220N; central server 220B includes endpoints 310A and 310N so that it can be monitored by central servers 220A and 220N; and central server 220N includes endpoints 310A and 310B so that it can be monitored by central servers 220A and 220B.
  • Alternatively, each central server 220A-220N may monitor only one or perhaps several designated central servers 220A-220N such that failure or improper operation of a central server 220A-220N will be detected by one or more monitoring central servers 220A-220N. Any change of activity to a central server 220A-220N may then be noted and suitably reflected by also changing the central server into which the active agent 206 is logged by sending appropriate instruction through the meta-agent 204A-204N that corresponds to the new active central server.
  • As an illustrative example, referring back to FIG. 2, assume that central server 220A has failed. Upon detection by central server 220B that the afflicted central server 220A is not operational, central server 220B may notify its meta-agent 204B on the target system 202 to reconfigure the active agent 206 to receive instructions from newly active central server 220B rather than the non-operational central server 220A. The active agent 206 remains the only agent on the target system 202 that is actively performing management, tasks.
  • In another embodiment, the systems management framework of the present invention also provides redundancy at the level of the active agent 206. Referring back to the illustrative example in FIG. 2, by providing a plurality of meta-agents 204A-204N to monitor the status of the active agent 206, any problems with the active agent 206 itself may be quickly detected by one or more of the meta-agents 204A-204N, and the active agent 206 may be revived or restarted as necessary to address the problem. By providing a single active agent 206 embodying a single endpoint that is capable of actively performing management tasks on the target system 202, the possibility that two or more agents may execute conflicting or duplicate management tasks on target system 202 is removed.
  • In the illustrative examples described above with respect to FIG. 2 and FIG. 3, the basic approach is to provide redundancy at the server level, as well as at the agent level for performing management tasks on a target system. In summary, at the agent level, a hierarchical arrangement is provided with a single endpoint, called the active agent, which performs all management tasks on the target system. Other agents, called meta-agents, monitor the active agent and the connectivity to their respective central servers, but do not directly manage the target system themselves.
  • In an embodiment, the meta agents on the target system may perform only limited monitoring of the central servers, and may not perform general purpose monitoring. Thus, any particular meta-agent monitors its connection to its central server only sufficiently to ascertain whether it can currently receive management instructions from its central server. Any change in the identity of the active central server thus results In a corresponding change at the meta-agents level such that the active agent continues to receive instructions from the new active central server.
  • As will be appreciated, the systems management framework as described above removes the single points of failure at the central server level and at the agent level, while avoiding duplication of agents performing management tasks on the target server. Redundancy may also be provided at the redirection layer to provide the necessary network availability that is required for the business purpose of the target system. Because the active agent is supported and monitored by multiple meta-agents, failure of the active agent itself brings prompt corrective action from one of the monitoring meta-agents. In addition, because there can be multiple network paths in the redirection layer, network failures will not blind the central servers or the meta-agents provided on the target system.
  • While there may initially be some increased effort required at implementation to install and set up the meta-agents 204A-204N, active agent 206, and associated monitoring endpoints on each of the central servers 220A-220N, the ongoing efforts required to manage the systems management framework of the present invention are thought to be not significantly greater than the effort required for prior art systems. For example, updating of agent code and maintenance of profiles may be substantially the same because there is only one active agent 206 taking management action on the target system 202. As the systems management requirements for the target system 202 change, the requirements and implementation of the active agent 202 may also change, but the requirements and the implementation of the meta-agents 204A-204N need not change.
  • Now referring to FIG. 4, shown is a schematic flowchart 400 of an illustrative method in accordance with an embodiment. The FIG. 4 method 400 may begin, and at block 402 may set up an active agent configured to take management actions on a target device or system. Method 400 may then proceed to block 404, where method 400 may configure a plurality of meta-agents to monitor their respective central servers and the active agent.
  • Method 400 may then proceed to block 405, where endpoints 310A-310N on the central servers 220A-200N are configured to monitor the central servers 220A-220N.
  • Method 400 may then proceed to block 406, where method 400 may monitor the active agent with at least one meta-agent. Method 400 may then proceed to block 408, where method 400 may send management instructions to the active agent from the currently active central server. Method 400 may then proceed to decision block 410, where method 400 checks for any indication of a failure or improper operation of the active agent. If no, method 400 may proceed directly to decision block 414. If yes, method 400 may proceed to block 412, where method 400 may revive or restart the active agent as necessary using one of the meta-agents. Method 400 may then proceed to block 414.
  • At decision block 414, method 400 may try to detect any Indication of a failure or improper operation of the central server. If no, method 400 loops back to block 406 to continue. If yes, method 400 may proceed to block 416 to initiate a transfer of management operations from the failed central server to a new central server. Method 400 may then proceed to block 418, where the meta-agent corresponding to the new active central server may be selected to relay new central server login instructions to the active agent on the target device. Method 400 then loops back to block 406 to continue.
  • While various illustrative embodiments of the invention have been described above, it will be appreciated by those skilled in the art that variations and modifications may be made.
  • For example, rather than limiting the role of the meta-agents to that of relaying new central server login instructions to the active agent, the meta-agents may be configured to relay other types of information, including management instructions. In this case, the meta-agents may act as a go-between for all instructions between the active agent and the currently active central server. It will be appreciated, however, that this may require some additional overhead to operate the meta-agents.

Claims (21)

1. A systems management framework for managing a target system, the systems management framework comprising:
an active agent configured to receive instructions over a connection from an active central server to perform management tasks on the target system; and
a plurality of meta-agents provided on the target system, each meta-agent an endpoint on the target system configured to monitor a corresponding central server, and each meta-agent configured to reassign the active agent to receive instructions from another central server upon failure of the active central server.
2. The systems management framework of claim 1, wherein each meta-agent is configured to monitor the active agent, and is further configured to revive or restart the active agent upon detecting a failure of the active agent.
3. The systems management framework of claim 1 wherein each meta-agent is also configured to reassign the active agent to receive instructions from another central server upon failure of the connection from the target system to the active central server.
4. The systems management framework of claim 1, wherein each central server includes at least one endpoint for monitoring the operation of a central server, each endpoint corresponding to another central server.
5. The systems management framework of claim 4, wherein the central servers are configured to transfer management operations from a failed central server to a newly active central server by notifying a meta-agent on the target system.
6. The systems management framework of claim 5, wherein the active agent is configured to login to and receive management Instructions from the newly active central server.
7. The systems management framework of claim 1, further comprising a redundant redirection layer between the active agent and the central servers such that communication between the active agent and the central servers may take place over alternate paths.
8. A method for managing a target system, the method comprising:
providing an active agent configured to receive instructions from an active central server to perform management tasks on the target system; and
providing a plurality of meta-agents on the target system, each meta-agent an endpoint on the target system configured to monitor a corresponding central server, and each meta-agent configured to reassign the active agent to another central server upon failure of the active central server.
9. The method of claim 8, further comprising configuring each meta-agent to monitor the active agent, and to revive or restart the active agent upon detecting a failure of the active agent.
10. The method of claim 9, further comprising configuring each meta-agent to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
11. The method of claim 8, further comprising providing at least one endpoint in each central server for monitoring the operation of the central server, each endpoint corresponding to another central server.
12. The method of claim 11, further comprising configuring the central servers to transfer management operations from a failed central server to a newly active central server.
13. The method of claim 12, further comprising configuring the active agent to log into and receive management instructions from the newly active central server.
14. The method of claim 8, further comprising providing a redundant redirection layer between the active agent and the central servers such that communication between the active agent and central servers may take place over alternative paths.
15. A data processor readable medium storing data processor code that when loaded into a data processing device adapts the device to perform a method for managing a target system, the method comprising steps of:
configuring an active agent to receive instructions from an active central server to perform management tasks on the target system; and
configuring a plurality of meta-agents on the target system, each meta-agent an endpoint on the target system configured to monitor a corresponding central server, and each meta-agent configured to reassign the active agent to another central server upon failure of the active central server.
16. The data processor readable medium of claim 15, further comprising configuring each meta-agent to monitor the active agent, and to revive or restart the active agent upon detecting a failure of the active agent.
17. The data processor readable medium of claim 16, further comprising configuring each meta-agent to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
18. The data processor readable medium of claim 15, further comprising providing at least one endpoint in each central server for monitoring the operation of the central server, each endpoint corresponding to another central server.
19. The data processor readable medium of claim 18, further comprising configuring the central servers to transfer management operations from a failed central server to a newly active central server.
20. The data processor readable medium of claim 19, further comprising configuring the active agent to log into and receive management instructions from the newly active central server.
21. The data processor readable medium of claim 15, further comprising providing a redundant redirection layer between the active agent and the central, servers such that communication between the active agent and central servers may take place over alternative paths.
US12/242,796 2007-12-21 2008-09-30 Redundant systems management frameworks for network environments Expired - Fee Related US8112518B2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CACA2616229 2007-12-21
CA002616229A CA2616229A1 (en) 2007-12-21 2007-12-21 Redundant systems management frameworks for network environments
CA2616229 2007-12-21

Publications (2)

Publication Number Publication Date
US20090164565A1 true US20090164565A1 (en) 2009-06-25
US8112518B2 US8112518B2 (en) 2012-02-07

Family

ID=40789923

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/242,796 Expired - Fee Related US8112518B2 (en) 2007-12-21 2008-09-30 Redundant systems management frameworks for network environments

Country Status (2)

Country Link
US (1) US8112518B2 (en)
CA (1) CA2616229A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120166957A1 (en) * 2010-12-22 2012-06-28 International Business Machines Corporation Content presentation in management sessions for information technology systems
EP2645635A1 (en) * 2012-03-29 2013-10-02 Nec Corporation Cluster monitor, method for monitoring a cluster, and computer-readable recording medium
US9292547B1 (en) * 2010-01-26 2016-03-22 Hewlett Packard Enterprise Development Lp Computer data archive operations
US11042443B2 (en) * 2018-10-17 2021-06-22 California Institute Of Technology Fault tolerant computer systems and methods establishing consensus for which processing system should be the prime string
US11165639B2 (en) * 2011-01-10 2021-11-02 Snowflake Inc. Fail-over in cloud services
US20210392059A1 (en) * 2015-06-05 2021-12-16 Cisco Technology, Inc. Auto update of sensor configuration
US11936663B2 (en) 2015-06-05 2024-03-19 Cisco Technology, Inc. System for monitoring and managing datacenters

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8607158B2 (en) 2010-12-09 2013-12-10 International Business Machines Corporation Content presentation in remote monitoring sessions for information technology systems
US8806360B2 (en) 2010-12-22 2014-08-12 International Business Machines Corporation Computing resource management in information technology systems
US10225135B2 (en) 2013-01-30 2019-03-05 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Provision of management information and requests among management servers within a computing network

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5655081A (en) * 1995-03-08 1997-08-05 Bmc Software, Inc. System for monitoring and managing computer resources and applications across a distributed computing environment using an intelligent autonomous agent architecture
US6014686A (en) * 1996-06-21 2000-01-11 Telcordia Technologies, Inc. Apparatus and methods for highly available directory services in the distributed computing environment
US20040010586A1 (en) * 2002-07-11 2004-01-15 International Business Machines Corporation Apparatus and method for distributed monitoring of endpoints in a management region
US20040010716A1 (en) * 2002-07-11 2004-01-15 International Business Machines Corporation Apparatus and method for monitoring the health of systems management software components in an enterprise
US20040049572A1 (en) * 2002-09-06 2004-03-11 Hitachi, Ltd. Event notification in storage networks
US6996502B2 (en) * 2004-01-20 2006-02-07 International Business Machines Corporation Remote enterprise management of high availability systems
US7130909B2 (en) * 2003-10-07 2006-10-31 Hitachi, Ltd. Storage path control method
US20070180077A1 (en) * 2005-11-15 2007-08-02 Microsoft Corporation Heartbeat Heuristics
US7779109B2 (en) * 2007-01-31 2010-08-17 International Business Machines Corporation Facilitating synchronization of servers in a coordinated timing network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5655081A (en) * 1995-03-08 1997-08-05 Bmc Software, Inc. System for monitoring and managing computer resources and applications across a distributed computing environment using an intelligent autonomous agent architecture
US6014686A (en) * 1996-06-21 2000-01-11 Telcordia Technologies, Inc. Apparatus and methods for highly available directory services in the distributed computing environment
US20040010586A1 (en) * 2002-07-11 2004-01-15 International Business Machines Corporation Apparatus and method for distributed monitoring of endpoints in a management region
US20040010716A1 (en) * 2002-07-11 2004-01-15 International Business Machines Corporation Apparatus and method for monitoring the health of systems management software components in an enterprise
US20040049572A1 (en) * 2002-09-06 2004-03-11 Hitachi, Ltd. Event notification in storage networks
US7130909B2 (en) * 2003-10-07 2006-10-31 Hitachi, Ltd. Storage path control method
US6996502B2 (en) * 2004-01-20 2006-02-07 International Business Machines Corporation Remote enterprise management of high availability systems
US20070180077A1 (en) * 2005-11-15 2007-08-02 Microsoft Corporation Heartbeat Heuristics
US7779109B2 (en) * 2007-01-31 2010-08-17 International Business Machines Corporation Facilitating synchronization of servers in a coordinated timing network

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9292547B1 (en) * 2010-01-26 2016-03-22 Hewlett Packard Enterprise Development Lp Computer data archive operations
US20120166957A1 (en) * 2010-12-22 2012-06-28 International Business Machines Corporation Content presentation in management sessions for information technology systems
US11165639B2 (en) * 2011-01-10 2021-11-02 Snowflake Inc. Fail-over in cloud services
US11736345B2 (en) 2011-01-10 2023-08-22 Snowflake Inc. System and method for extending cloud services into the customer premise
US11750452B2 (en) 2011-01-10 2023-09-05 Snowflake Inc. Fail-over in cloud services
EP2645635A1 (en) * 2012-03-29 2013-10-02 Nec Corporation Cluster monitor, method for monitoring a cluster, and computer-readable recording medium
US9049101B2 (en) 2012-03-29 2015-06-02 Nec Corporation Cluster monitor, method for monitoring a cluster, and computer-readable recording medium
US20210392059A1 (en) * 2015-06-05 2021-12-16 Cisco Technology, Inc. Auto update of sensor configuration
US11936663B2 (en) 2015-06-05 2024-03-19 Cisco Technology, Inc. System for monitoring and managing datacenters
US11042443B2 (en) * 2018-10-17 2021-06-22 California Institute Of Technology Fault tolerant computer systems and methods establishing consensus for which processing system should be the prime string

Also Published As

Publication number Publication date
US8112518B2 (en) 2012-02-07
CA2616229A1 (en) 2009-06-21

Similar Documents

Publication Publication Date Title
US8112518B2 (en) Redundant systems management frameworks for network environments
EP3129903B1 (en) Systems and methods for fault tolerant communications
US7225356B2 (en) System for managing operational failure occurrences in processing devices
US8001413B2 (en) Managing cluster split-brain in datacenter service site failover
JP4505763B2 (en) Managing node clusters
WO2015169199A1 (en) Anomaly recovery method for virtual machine in distributed environment
US8065560B1 (en) Method and apparatus for achieving high availability for applications and optimizing power consumption within a datacenter
CN105229613A (en) Coordinate the fault recovery in distributed system
CN106980529B (en) Computer system for managing resources of baseboard management controller
EP2645635B1 (en) Cluster monitor, method for monitoring a cluster, and computer-readable recording medium
CN107071189B (en) Connection method of communication equipment physical interface
JP6838334B2 (en) Cluster system, server, server operation method, and program
JP2011203941A (en) Information processing apparatus, monitoring method and monitoring program
JP6638818B2 (en) Survival management program, survival management method, and survival management device
US8595349B1 (en) Method and apparatus for passive process monitoring
US9880855B2 (en) Start-up control program, device, and method
JP2009026182A (en) Program execution system and execution device
KR101883251B1 (en) Apparatus and method for determining failover in virtual system
JP5594668B2 (en) Node, clustering system, clustering system control method, and program
JP2009032052A (en) Information processor, information processing method and program
JP2016151965A (en) Redundant configuration system and redundant configuration control method
JP6224985B2 (en) Notification device and notification method
WO2022009438A1 (en) Server maintenance control device, system, control method, and program
WO2024013828A1 (en) Signal processing resource switching device, signal processing resource switching system, signal processing resource switching method, and program
US8289842B2 (en) Bridging infrastructure for message flows

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION,NEW YO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:UNDERHILL, WILLIAM ROY;REEL/FRAME:021758/0025

Effective date: 20080922

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:UNDERHILL, WILLIAM ROY;REEL/FRAME:021758/0025

Effective date: 20080922

AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION,NEW YO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:UNDERHILL, WILLIAM ROY;REEL/FRAME:021620/0755

Effective date: 20080922

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:UNDERHILL, WILLIAM ROY;REEL/FRAME:021620/0755

Effective date: 20080922

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20160207