US20090164565A1 - Redundant systems management frameworks for network environments - Google Patents
Redundant systems management frameworks for network environments Download PDFInfo
- Publication number
- US20090164565A1 US20090164565A1 US12/242,796 US24279608A US2009164565A1 US 20090164565 A1 US20090164565 A1 US 20090164565A1 US 24279608 A US24279608 A US 24279608A US 2009164565 A1 US2009164565 A1 US 2009164565A1
- Authority
- US
- United States
- Prior art keywords
- central server
- active agent
- agent
- active
- meta
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/04—Network management architectures or arrangements
- H04L41/042—Network management architectures or arrangements comprising distributed management centres cooperatively managing the network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/40—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/04—Network management architectures or arrangements
- H04L41/046—Network management architectures or arrangements comprising network management agents or mobile agents therefor
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0659—Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
- H04L41/0661—Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities by reconfiguring faulty entities
Definitions
- the present invention relates to systems management frameworks for network environments.
- an agent is placed on a target system to perform various management tasks with a central server instructing the agent.
- the agent on the target system is left as a single point of failure.
- the network path between the target system and the central servers may sometimes be a single point of failure. If communications are lost between the agent and the instructing central server, the central server may not be aware of the loss, or may not be able to restore communication with the target system. If there is a problem with the agent, the central server may not be able to detect the problem or fix it.
- the present invention relates to systems management frameworks for network environments.
- the invention provides a systems management framework for managing a target system.
- the systems management framework Includes an active agent configured to receive instructions from an active central server to perform management tasks on the target system and a plurality of meta-agents provided on the target system.
- Each meta-agent is an endpoint on the target system configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
- each meta-agent is configured to monitor the active agent, and is further configured to revive or restart the active agent upon detecting a failure of the active agent.
- each meta-agent is further configured to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
- each central server includes at least one endpoint for monitoring the operation of a central server, each endpoint corresponding to another central server.
- the central servers are configured to transfer management operations from a failed central server to a newly active central server.
- the active agent is configured to login to and receive management instructions from the newly active central server.
- the systems management framework further includes a redundant redirection layer between the active agent and the central servers such that communication between the active agent and the central servers may take place over alternate paths.
- the invention is a method for managing a target system including providing an active agent configured to receive instructions from an active central server to perform management tasks on the target system and providing a plurality of meta-agents on the target system, where each meta-agent is an endpoint on the target system configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
- the method further includes configuring each meta-agent to monitor the active agent, and to revive or restart the active agent upon detecting a failure of the active
- the method further includes configuring each meta-agent to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
- the method further includes providing at least one endpoint in each central server for monitoring the operation of the central server, each endpoint corresponding to another central server.
- the method further includes configuring the central servers to transfer management operations from a failed central server to a newly active central server.
- the method further includes configuring the active agent to log into and receive management instructions from the newly active central server.
- the method further includes providing a redundant redirection layer between the active agent and the central servers such that communication between the active agent and central servers may take place over alternative paths.
- a data processor readable medium storing data processor code that when loaded into a data processing device adapts the device to perform a method for managing a target system that includes configuring an active agent to receive instructions from an active central server to perform management tasks on the target system and configuring a plurality of meta-agents on the target system.
- Each meta-agent is an endpoint on the target system and is configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
- the data processor readable medium further includes code that adapts the device to configure each meta-agent to monitor the active agent, and to revive or restart the active agent upon detecting a failure of the active agent.
- the data processor readable medium further includes code that adapts the device to configure each meta-agent to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
- the data processor readable medium further includes code that adapts the device to provide at least one endpoint in each central server for monitoring the operation of the central server, each endpoint corresponding to another central server.
- the data processor readable medium further includes code that adapts the device to configure the central servers to transfer management operations from, a failed central server to a newly active central server.
- the data processor readable medium further Includes code that adapts the device to configure the active agent to log into and receive management instructions from the newly active central server.
- the data processor readable medium further includes code that adapts the device to provide a redundant redirection layer between the active agent and the central servers such that communication between the active agent and central servers may take place over alternative paths.
- FIG. 1 shows a generic data processing system that may provide a suitable operating environment
- FIG. 2 shows a schematic block diagram of an illustrative topology for a systems management framework in accordance with an embodiment
- FIG. 3 shows a more detailed block diagram of a plurality of central servers in the illustrative topology in FIG. 2 ;
- FIG. 4 shows a schematic flowchart of an illustrative method in accordance with an embodiment.
- the present invention relates to a systems management framework for networked environments.
- FIG. 1 shows a generic data processing system 100 that may include a central processing unit (“CPU”) 102 connected to a storage unit 104 and to a random access memory 106 .
- the CPU 102 may process an operating system 101 , application program 103 , and data 123 .
- the operating system 101 , application program 103 , and data 123 may be stored in storage unit 104 and loaded into memory 106 , as may be required.
- An operator 107 may interact with the data processing system 100 using a video display 108 connected by a video interface 105 , and various input/output devices such as a keyboard 110 , mouse 112 , and disk drive 114 connected by an I/O interface 109 .
- the mouse 112 may be configured to control movement of a cursor in the video display 108 , and to operate various graphical user interface (GUI) controls appearing in the video display 108 with a mouse button.
- GUI graphical user interface
- the disk drive 114 may be configured to accept data processing system readable media 116 .
- the data processing system 100 may form part of a network via a network interface 111 , allowing the data processing system 100 to communicate with other suitably configured data processing systems (not shown).
- the particular configurations shown by way of example in this specification are not meant to be limiting.
- this single point of failure may lead to situations where the failure cannot be easily repaired by the systems management framework.
- this single point of failure may have been tolerated as an acceptable risk, or the single point of failure may have been removed by providing a redundant second agent on the target system.
- this second agent is configured to duplicate the function of the first agent.
- This approach may have some drawbacks including duplication of management functions and a doubling of the resources consumed by the redundant agents for managing a target system.
- uncoordinated or poorly coordinated redundant agents may take conflicting or duplicate management actions to correct a problem that may have unintended consequences, potentially resulting in instability.
- the invention presents a novel framework for systems management involving a hierarchical arrangement of agents on a target system.
- the hierarchical arrangement of agents may include an active agent and a plurality of meta-agents that are configured to monitor and pass instructions to the active agent.
- Each of the meta-agents is associated in a one-to-one configuration with a central server and may be configured to monitor its respective central server for ongoing operation.
- FIG. 2 shown is a schematic block diagram of an illustrative framework 200 for systems management in accordance with an embodiment of the invention.
- the FIG. 2 systems management framework 200 may include a target system 202 having a plurality of meta-agents 204 A- 204 N, and an active agent 206 .
- Active agent 206 may be connected to a redirection layer 210 configured to connect active agent 206 to a plurality of central servers 220 A- 22 ON through redundant network paths.
- Redirection layer 210 may be implemented in many different ways, and may employ proxies to coordinate the redirection between various alternative paths.
- the purpose of the redirection layer is to make it possible for the active agent 206 to connect to the central servers 220 A- 220 N in more than one way, either by being able to change its connection to another one of the central servers 220 A- 220 N via their respective gateways 230 A- 230 N, or by being able to connect over more then one network path in the redirection layer 210 , or both.
- meta-agents 204 A- 204 N on the target system 202 are each configured to monitor the active agent 206 to revive or restart the active agent 206 .
- the meta-agents 204 A- 204 N may be implemented using endpoints that correspond to the central servers 220 A- 220 N with a one-to-one relationship. This allows each of the central servers 220 A- 220 N to have a dedicated agent on the target system 202 for monitoring the connection to the central server 220 A- 220 N and the active agent 206 .
- the active agent 206 is configured to be the only agent on the target system 202 that is capable of taking direct management action on the target system 202 . All other agents (i.e. the meta-agents 204 A- 204 N) on the target system 202 can only monitor the active agent 206 , monitor the connection from the target system 202 to their respective central servers 220 A- 220 N, and pass instructions to the active agent 206 for execution of specific tasks to change which of the central servers 220 A- 220 N the active agent 206 is logged into.
- All other agents i.e. the meta-agents 204 A- 204 N
- the systems management framework provides redundancy at the level of the central servers 220 A- 220 N.
- the plurality of central servers 220 A- 220 N introduced earlier in FIG. 2 may be configured to monitor each other for proper operation.
- each central server 220 A- 220 N may contain a sufficient number of endpoints 310 A- 310 N to monitor every other central server 220 A- 220 N.
- each central server 220 A- 22 ON may monitor every other central server 220 A- 220 N by using a sufficient number of endpoints on each central server 220 A- 220 N, each endpoint corresponding to another central server 220 A- 220 N.
- central server 220 A includes endpoints 310 B and 310 N so that It can be monitored by central servers 220 B and 220 N;
- central server 220 B includes endpoints 310 A and 310 N so that it can be monitored by central servers 220 A and 220 N;
- central server 220 N includes endpoints 310 A and 310 B so that it can be monitored by central servers 220 A and 220 B.
- each central server 220 A- 220 N may monitor only one or perhaps several designated central servers 220 A- 220 N such that failure or improper operation of a central server 220 A- 220 N will be detected by one or more monitoring central servers 220 A- 220 N. Any change of activity to a central server 220 A- 220 N may then be noted and suitably reflected by also changing the central server into which the active agent 206 is logged by sending appropriate instruction through the meta-agent 204 A- 204 N that corresponds to the new active central server.
- central server 220 A has failed.
- central server 220 B may notify its meta-agent 204 B on the target system 202 to reconfigure the active agent 206 to receive instructions from newly active central server 220 B rather than the non-operational central server 220 A.
- the active agent 206 remains the only agent on the target system 202 that is actively performing management, tasks.
- the systems management framework of the present invention also provides redundancy at the level of the active agent 206 .
- the systems management framework of the present invention also provides redundancy at the level of the active agent 206 .
- any problems with the active agent 206 itself may be quickly detected by one or more of the meta-agents 204 A- 204 N, and the active agent 206 may be revived or restarted as necessary to address the problem.
- the active agent 206 By providing a single active agent 206 embodying a single endpoint that is capable of actively performing management tasks on the target system 202 , the possibility that two or more agents may execute conflicting or duplicate management tasks on target system 202 is removed.
- the basic approach is to provide redundancy at the server level, as well as at the agent level for performing management tasks on a target system.
- the agent level a hierarchical arrangement is provided with a single endpoint, called the active agent, which performs all management tasks on the target system.
- Other agents called meta-agents, monitor the active agent and the connectivity to their respective central servers, but do not directly manage the target system themselves.
- the meta agents on the target system may perform only limited monitoring of the central servers, and may not perform general purpose monitoring.
- any particular meta-agent monitors its connection to its central server only sufficiently to ascertain whether it can currently receive management instructions from its central server. Any change in the identity of the active central server thus results In a corresponding change at the meta-agents level such that the active agent continues to receive instructions from the new active central server.
- the systems management framework as described above removes the single points of failure at the central server level and at the agent level, while avoiding duplication of agents performing management tasks on the target server. Redundancy may also be provided at the redirection layer to provide the necessary network availability that is required for the business purpose of the target system. Because the active agent is supported and monitored by multiple meta-agents, failure of the active agent itself brings prompt corrective action from one of the monitoring meta-agents. In addition, because there can be multiple network paths in the redirection layer, network failures will not blind the central servers or the meta-agents provided on the target system.
- FIG. 4 shown is a schematic flowchart 400 of an illustrative method in accordance with an embodiment.
- the FIG. 4 method 400 may begin, and at block 402 may set up an active agent configured to take management actions on a target device or system. Method 400 may then proceed to block 404 , where method 400 may configure a plurality of meta-agents to monitor their respective central servers and the active agent.
- Method 400 may then proceed to block 405 , where endpoints 310 A- 310 N on the central servers 220 A- 200 N are configured to monitor the central servers 220 A- 220 N.
- Method 400 may then proceed to block 406 , where method 400 may monitor the active agent with at least one meta-agent.
- Method 400 may then proceed to block 408 , where method 400 may send management instructions to the active agent from the currently active central server.
- Method 400 may then proceed to decision block 410 , where method 400 checks for any indication of a failure or improper operation of the active agent. If no, method 400 may proceed directly to decision block 414 . If yes, method 400 may proceed to block 412 , where method 400 may revive or restart the active agent as necessary using one of the meta-agents. Method 400 may then proceed to block 414 .
- method 400 may try to detect any Indication of a failure or improper operation of the central server. If no, method 400 loops back to block 406 to continue. If yes, method 400 may proceed to block 416 to initiate a transfer of management operations from the failed central server to a new central server. Method 400 may then proceed to block 418 , where the meta-agent corresponding to the new active central server may be selected to relay new central server login instructions to the active agent on the target device. Method 400 then loops back to block 406 to continue.
- the meta-agents may be configured to relay other types of information, including management instructions.
- the meta-agents may act as a go-between for all instructions between the active agent and the currently active central server. It will be appreciated, however, that this may require some additional overhead to operate the meta-agents.
Abstract
Description
- This application is based upon and claims the benefit of priority under 35 USC §119(a) from Canadian Patent Application No. 2616229, filed Dec. 21, 2007, the content of which is incorporated herein in its entirety.
- A portion of the disclosure of this patent document contains material which Is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
- The present invention relates to systems management frameworks for network environments.
- Heretofore, various solutions have been proposed for management of systems connected in a network environment. In some systems management infrastructures, an agent is placed on a target system to perform various management tasks with a central server instructing the agent. However, in this configuration, the agent on the target system is left as a single point of failure. As well, the network path between the target system and the central servers may sometimes be a single point of failure. If communications are lost between the agent and the instructing central server, the central server may not be aware of the loss, or may not be able to restore communication with the target system. If there is a problem with the agent, the central server may not be able to detect the problem or fix it.
- The present invention relates to systems management frameworks for network environments.
- In one aspect, the invention provides a systems management framework for managing a target system. The systems management framework Includes an active agent configured to receive instructions from an active central server to perform management tasks on the target system and a plurality of meta-agents provided on the target system. Each meta-agent is an endpoint on the target system configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
- In one embodiment, each meta-agent is configured to monitor the active agent, and is further configured to revive or restart the active agent upon detecting a failure of the active agent.
- In another embodiment, each meta-agent is further configured to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
- In another embodiment, each central server includes at least one endpoint for monitoring the operation of a central server, each endpoint corresponding to another central server.
- In another embodiment, the central servers are configured to transfer management operations from a failed central server to a newly active central server.
- In another embodiment, the active agent is configured to login to and receive management instructions from the newly active central server.
- In another embodiment, the systems management framework further includes a redundant redirection layer between the active agent and the central servers such that communication between the active agent and the central servers may take place over alternate paths.
- In another aspect, the invention is a method for managing a target system including providing an active agent configured to receive instructions from an active central server to perform management tasks on the target system and providing a plurality of meta-agents on the target system, where each meta-agent is an endpoint on the target system configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
- In one embodiment, the method further includes configuring each meta-agent to monitor the active agent, and to revive or restart the active agent upon detecting a failure of the active
- In another embodiment, the method further includes configuring each meta-agent to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
- In another embodiment, the method further includes providing at least one endpoint in each central server for monitoring the operation of the central server, each endpoint corresponding to another central server.
- In another embodiment, the method further includes configuring the central servers to transfer management operations from a failed central server to a newly active central server.
- In another embodiment, the method further includes configuring the active agent to log into and receive management instructions from the newly active central server.
- In another embodiment, the method further includes providing a redundant redirection layer between the active agent and the central servers such that communication between the active agent and central servers may take place over alternative paths.
- In another aspect, there is provided a data processor readable medium storing data processor code that when loaded into a data processing device adapts the device to perform a method for managing a target system that includes configuring an active agent to receive instructions from an active central server to perform management tasks on the target system and configuring a plurality of meta-agents on the target system. Each meta-agent is an endpoint on the target system and is configured to monitor a corresponding central server, and to reassign the active agent to another central server upon failure of the active central server.
- In an embodiment, the data processor readable medium further includes code that adapts the device to configure each meta-agent to monitor the active agent, and to revive or restart the active agent upon detecting a failure of the active agent.
- In another embodiment, the data processor readable medium further includes code that adapts the device to configure each meta-agent to monitor its corresponding central server, and upon detection of a failure of the active central server, to transfer the function of issuing management instructions to another central server.
- In another embodiment, the data processor readable medium further includes code that adapts the device to provide at least one endpoint in each central server for monitoring the operation of the central server, each endpoint corresponding to another central server.
- In another embodiment, the data processor readable medium further includes code that adapts the device to configure the central servers to transfer management operations from, a failed central server to a newly active central server.
- In another embodiment, the data processor readable medium further Includes code that adapts the device to configure the active agent to log into and receive management instructions from the newly active central server.
- In another embodiment, the data processor readable medium further includes code that adapts the device to provide a redundant redirection layer between the active agent and the central servers such that communication between the active agent and central servers may take place over alternative paths.
- These and other aspects of the invention will become apparent from the following more particular descriptions of exemplary embodiments.
- In the figures which illustrate exemplary embodiments of the invention:
-
FIG. 1 shows a generic data processing system that may provide a suitable operating environment; -
FIG. 2 shows a schematic block diagram of an illustrative topology for a systems management framework in accordance with an embodiment; -
FIG. 3 shows a more detailed block diagram of a plurality of central servers in the illustrative topology inFIG. 2 ; and -
FIG. 4 shows a schematic flowchart of an illustrative method in accordance with an embodiment. - As noted above, the present invention relates to a systems management framework for networked environments.
- The invention may be practiced in various embodiments. A suitably configured data processing system, and associated communications networks, devices, software and firmware may provide a platform for enabling one or more embodiments. By way of example,
FIG. 1 shows a genericdata processing system 100 that may include a central processing unit (“CPU”) 102 connected to astorage unit 104 and to arandom access memory 106. TheCPU 102 may process anoperating system 101,application program 103, anddata 123. Theoperating system 101,application program 103, anddata 123 may be stored instorage unit 104 and loaded intomemory 106, as may be required. Anoperator 107 may interact with thedata processing system 100 using avideo display 108 connected by avideo interface 105, and various input/output devices such as akeyboard 110,mouse 112, anddisk drive 114 connected by an I/O interface 109. - In known manner, the
mouse 112 may be configured to control movement of a cursor in thevideo display 108, and to operate various graphical user interface (GUI) controls appearing in thevideo display 108 with a mouse button. Thedisk drive 114 may be configured to accept data processing systemreadable media 116. Thedata processing system 100 may form part of a network via anetwork interface 111, allowing thedata processing system 100 to communicate with other suitably configured data processing systems (not shown). The particular configurations shown by way of example in this specification are not meant to be limiting. - As noted above, providing a single agent on a target system to perform management functions may result in a single point of failure that may lead to situations where the failure cannot be easily repaired by the systems management framework. In prior art systems, this single point of failure may have been tolerated as an acceptable risk, or the single point of failure may have been removed by providing a redundant second agent on the target system. Commonly, this second agent is configured to duplicate the function of the first agent. This approach may have some drawbacks including duplication of management functions and a doubling of the resources consumed by the redundant agents for managing a target system. In addition, there is a potential that uncoordinated or poorly coordinated redundant agents may take conflicting or duplicate management actions to correct a problem that may have unintended consequences, potentially resulting in instability.
- The invention presents a novel framework for systems management involving a hierarchical arrangement of agents on a target system. The hierarchical arrangement of agents may include an active agent and a plurality of meta-agents that are configured to monitor and pass instructions to the active agent. Each of the meta-agents is associated in a one-to-one configuration with a central server and may be configured to monitor its respective central server for ongoing operation. An illustrative systems management framework in accordance with an embodiment of the invention will now be described in more detail.
- Referring to
FIG. 2 , shown is a schematic block diagram of anillustrative framework 200 for systems management in accordance with an embodiment of the invention. TheFIG. 2 systems management framework 200 may include atarget system 202 having a plurality of meta-agents 204A-204N, and anactive agent 206.Active agent 206 may be connected to aredirection layer 210 configured to connectactive agent 206 to a plurality ofcentral servers 220A-22ON through redundant network paths.Redirection layer 210 may be implemented in many different ways, and may employ proxies to coordinate the redirection between various alternative paths. Regardless of the particular redirection implementation employed, the purpose of the redirection layer is to make it possible for theactive agent 206 to connect to thecentral servers 220A-220N in more than one way, either by being able to change its connection to another one of thecentral servers 220A-220N via theirrespective gateways 230A-230N, or by being able to connect over more then one network path in theredirection layer 210, or both. - In an embodiment, meta-
agents 204A-204N on thetarget system 202 are each configured to monitor theactive agent 206 to revive or restart theactive agent 206. The meta-agents 204A-204N may be implemented using endpoints that correspond to thecentral servers 220A-220N with a one-to-one relationship. This allows each of thecentral servers 220A-220N to have a dedicated agent on thetarget system 202 for monitoring the connection to thecentral server 220A-220N and theactive agent 206. - The
active agent 206 is configured to be the only agent on thetarget system 202 that is capable of taking direct management action on thetarget system 202. All other agents (i.e. the meta-agents 204A-204N) on thetarget system 202 can only monitor theactive agent 206, monitor the connection from thetarget system 202 to their respectivecentral servers 220A-220N, and pass instructions to theactive agent 206 for execution of specific tasks to change which of thecentral servers 220A-220N theactive agent 206 is logged into. - In an embodiment, the systems management framework provides redundancy at the level of the
central servers 220A-220N. By way of illustration, as shown inFIG. 3 , the plurality ofcentral servers 220A-220N introduced earlier inFIG. 2 may be configured to monitor each other for proper operation. For this purpose, eachcentral server 220A-220N may contain a sufficient number ofendpoints 310A-310N to monitor every othercentral server 220A-220N. - In an embodiment, each
central server 220A-22ON may monitor every othercentral server 220A-220N by using a sufficient number of endpoints on eachcentral server 220A-220N, each endpoint corresponding to anothercentral server 220A-220N. Thus, in the illustrative example shown inFIG. 3 ,central server 220A includesendpoints central servers central server 220B includesendpoints central servers central server 220N includesendpoints central servers - Alternatively, each
central server 220A-220N may monitor only one or perhaps several designatedcentral servers 220A-220N such that failure or improper operation of acentral server 220A-220N will be detected by one or more monitoringcentral servers 220A-220N. Any change of activity to acentral server 220A-220N may then be noted and suitably reflected by also changing the central server into which theactive agent 206 is logged by sending appropriate instruction through the meta-agent 204A-204N that corresponds to the new active central server. - As an illustrative example, referring back to
FIG. 2 , assume thatcentral server 220A has failed. Upon detection bycentral server 220B that the afflictedcentral server 220A is not operational,central server 220B may notify its meta-agent 204B on thetarget system 202 to reconfigure theactive agent 206 to receive instructions from newly activecentral server 220B rather than the non-operationalcentral server 220A. Theactive agent 206 remains the only agent on thetarget system 202 that is actively performing management, tasks. - In another embodiment, the systems management framework of the present invention also provides redundancy at the level of the
active agent 206. Referring back to the illustrative example inFIG. 2 , by providing a plurality of meta-agents 204A-204N to monitor the status of theactive agent 206, any problems with theactive agent 206 itself may be quickly detected by one or more of the meta-agents 204A-204N, and theactive agent 206 may be revived or restarted as necessary to address the problem. By providing a singleactive agent 206 embodying a single endpoint that is capable of actively performing management tasks on thetarget system 202, the possibility that two or more agents may execute conflicting or duplicate management tasks ontarget system 202 is removed. - In the illustrative examples described above with respect to
FIG. 2 andFIG. 3 , the basic approach is to provide redundancy at the server level, as well as at the agent level for performing management tasks on a target system. In summary, at the agent level, a hierarchical arrangement is provided with a single endpoint, called the active agent, which performs all management tasks on the target system. Other agents, called meta-agents, monitor the active agent and the connectivity to their respective central servers, but do not directly manage the target system themselves. - In an embodiment, the meta agents on the target system may perform only limited monitoring of the central servers, and may not perform general purpose monitoring. Thus, any particular meta-agent monitors its connection to its central server only sufficiently to ascertain whether it can currently receive management instructions from its central server. Any change in the identity of the active central server thus results In a corresponding change at the meta-agents level such that the active agent continues to receive instructions from the new active central server.
- As will be appreciated, the systems management framework as described above removes the single points of failure at the central server level and at the agent level, while avoiding duplication of agents performing management tasks on the target server. Redundancy may also be provided at the redirection layer to provide the necessary network availability that is required for the business purpose of the target system. Because the active agent is supported and monitored by multiple meta-agents, failure of the active agent itself brings prompt corrective action from one of the monitoring meta-agents. In addition, because there can be multiple network paths in the redirection layer, network failures will not blind the central servers or the meta-agents provided on the target system.
- While there may initially be some increased effort required at implementation to install and set up the meta-
agents 204A-204N,active agent 206, and associated monitoring endpoints on each of thecentral servers 220A-220N, the ongoing efforts required to manage the systems management framework of the present invention are thought to be not significantly greater than the effort required for prior art systems. For example, updating of agent code and maintenance of profiles may be substantially the same because there is only oneactive agent 206 taking management action on thetarget system 202. As the systems management requirements for thetarget system 202 change, the requirements and implementation of theactive agent 202 may also change, but the requirements and the implementation of the meta-agents 204A-204N need not change. - Now referring to
FIG. 4 , shown is aschematic flowchart 400 of an illustrative method in accordance with an embodiment. TheFIG. 4 method 400 may begin, and atblock 402 may set up an active agent configured to take management actions on a target device or system.Method 400 may then proceed to block 404, wheremethod 400 may configure a plurality of meta-agents to monitor their respective central servers and the active agent. -
Method 400 may then proceed to block 405, whereendpoints 310A-310N on thecentral servers 220A-200N are configured to monitor thecentral servers 220A-220N. -
Method 400 may then proceed to block 406, wheremethod 400 may monitor the active agent with at least one meta-agent.Method 400 may then proceed to block 408, wheremethod 400 may send management instructions to the active agent from the currently active central server.Method 400 may then proceed to decision block 410, wheremethod 400 checks for any indication of a failure or improper operation of the active agent. If no,method 400 may proceed directly todecision block 414. If yes,method 400 may proceed to block 412, wheremethod 400 may revive or restart the active agent as necessary using one of the meta-agents.Method 400 may then proceed to block 414. - At
decision block 414,method 400 may try to detect any Indication of a failure or improper operation of the central server. If no,method 400 loops back to block 406 to continue. If yes,method 400 may proceed to block 416 to initiate a transfer of management operations from the failed central server to a new central server.Method 400 may then proceed to block 418, where the meta-agent corresponding to the new active central server may be selected to relay new central server login instructions to the active agent on the target device.Method 400 then loops back to block 406 to continue. - While various illustrative embodiments of the invention have been described above, it will be appreciated by those skilled in the art that variations and modifications may be made.
- For example, rather than limiting the role of the meta-agents to that of relaying new central server login instructions to the active agent, the meta-agents may be configured to relay other types of information, including management instructions. In this case, the meta-agents may act as a go-between for all instructions between the active agent and the currently active central server. It will be appreciated, however, that this may require some additional overhead to operate the meta-agents.
Claims (21)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CACA2616229 | 2007-12-21 | ||
CA002616229A CA2616229A1 (en) | 2007-12-21 | 2007-12-21 | Redundant systems management frameworks for network environments |
CA2616229 | 2007-12-21 |
Publications (2)
Publication Number | Publication Date |
---|---|
US20090164565A1 true US20090164565A1 (en) | 2009-06-25 |
US8112518B2 US8112518B2 (en) | 2012-02-07 |
Family
ID=40789923
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/242,796 Expired - Fee Related US8112518B2 (en) | 2007-12-21 | 2008-09-30 | Redundant systems management frameworks for network environments |
Country Status (2)
Country | Link |
---|---|
US (1) | US8112518B2 (en) |
CA (1) | CA2616229A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120166957A1 (en) * | 2010-12-22 | 2012-06-28 | International Business Machines Corporation | Content presentation in management sessions for information technology systems |
EP2645635A1 (en) * | 2012-03-29 | 2013-10-02 | Nec Corporation | Cluster monitor, method for monitoring a cluster, and computer-readable recording medium |
US9292547B1 (en) * | 2010-01-26 | 2016-03-22 | Hewlett Packard Enterprise Development Lp | Computer data archive operations |
US11042443B2 (en) * | 2018-10-17 | 2021-06-22 | California Institute Of Technology | Fault tolerant computer systems and methods establishing consensus for which processing system should be the prime string |
US11165639B2 (en) * | 2011-01-10 | 2021-11-02 | Snowflake Inc. | Fail-over in cloud services |
US20210392059A1 (en) * | 2015-06-05 | 2021-12-16 | Cisco Technology, Inc. | Auto update of sensor configuration |
US11936663B2 (en) | 2015-06-05 | 2024-03-19 | Cisco Technology, Inc. | System for monitoring and managing datacenters |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8607158B2 (en) | 2010-12-09 | 2013-12-10 | International Business Machines Corporation | Content presentation in remote monitoring sessions for information technology systems |
US8806360B2 (en) | 2010-12-22 | 2014-08-12 | International Business Machines Corporation | Computing resource management in information technology systems |
US10225135B2 (en) | 2013-01-30 | 2019-03-05 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd. | Provision of management information and requests among management servers within a computing network |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5655081A (en) * | 1995-03-08 | 1997-08-05 | Bmc Software, Inc. | System for monitoring and managing computer resources and applications across a distributed computing environment using an intelligent autonomous agent architecture |
US6014686A (en) * | 1996-06-21 | 2000-01-11 | Telcordia Technologies, Inc. | Apparatus and methods for highly available directory services in the distributed computing environment |
US20040010586A1 (en) * | 2002-07-11 | 2004-01-15 | International Business Machines Corporation | Apparatus and method for distributed monitoring of endpoints in a management region |
US20040010716A1 (en) * | 2002-07-11 | 2004-01-15 | International Business Machines Corporation | Apparatus and method for monitoring the health of systems management software components in an enterprise |
US20040049572A1 (en) * | 2002-09-06 | 2004-03-11 | Hitachi, Ltd. | Event notification in storage networks |
US6996502B2 (en) * | 2004-01-20 | 2006-02-07 | International Business Machines Corporation | Remote enterprise management of high availability systems |
US7130909B2 (en) * | 2003-10-07 | 2006-10-31 | Hitachi, Ltd. | Storage path control method |
US20070180077A1 (en) * | 2005-11-15 | 2007-08-02 | Microsoft Corporation | Heartbeat Heuristics |
US7779109B2 (en) * | 2007-01-31 | 2010-08-17 | International Business Machines Corporation | Facilitating synchronization of servers in a coordinated timing network |
-
2007
- 2007-12-21 CA CA002616229A patent/CA2616229A1/en not_active Abandoned
-
2008
- 2008-09-30 US US12/242,796 patent/US8112518B2/en not_active Expired - Fee Related
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5655081A (en) * | 1995-03-08 | 1997-08-05 | Bmc Software, Inc. | System for monitoring and managing computer resources and applications across a distributed computing environment using an intelligent autonomous agent architecture |
US6014686A (en) * | 1996-06-21 | 2000-01-11 | Telcordia Technologies, Inc. | Apparatus and methods for highly available directory services in the distributed computing environment |
US20040010586A1 (en) * | 2002-07-11 | 2004-01-15 | International Business Machines Corporation | Apparatus and method for distributed monitoring of endpoints in a management region |
US20040010716A1 (en) * | 2002-07-11 | 2004-01-15 | International Business Machines Corporation | Apparatus and method for monitoring the health of systems management software components in an enterprise |
US20040049572A1 (en) * | 2002-09-06 | 2004-03-11 | Hitachi, Ltd. | Event notification in storage networks |
US7130909B2 (en) * | 2003-10-07 | 2006-10-31 | Hitachi, Ltd. | Storage path control method |
US6996502B2 (en) * | 2004-01-20 | 2006-02-07 | International Business Machines Corporation | Remote enterprise management of high availability systems |
US20070180077A1 (en) * | 2005-11-15 | 2007-08-02 | Microsoft Corporation | Heartbeat Heuristics |
US7779109B2 (en) * | 2007-01-31 | 2010-08-17 | International Business Machines Corporation | Facilitating synchronization of servers in a coordinated timing network |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9292547B1 (en) * | 2010-01-26 | 2016-03-22 | Hewlett Packard Enterprise Development Lp | Computer data archive operations |
US20120166957A1 (en) * | 2010-12-22 | 2012-06-28 | International Business Machines Corporation | Content presentation in management sessions for information technology systems |
US11165639B2 (en) * | 2011-01-10 | 2021-11-02 | Snowflake Inc. | Fail-over in cloud services |
US11736345B2 (en) | 2011-01-10 | 2023-08-22 | Snowflake Inc. | System and method for extending cloud services into the customer premise |
US11750452B2 (en) | 2011-01-10 | 2023-09-05 | Snowflake Inc. | Fail-over in cloud services |
EP2645635A1 (en) * | 2012-03-29 | 2013-10-02 | Nec Corporation | Cluster monitor, method for monitoring a cluster, and computer-readable recording medium |
US9049101B2 (en) | 2012-03-29 | 2015-06-02 | Nec Corporation | Cluster monitor, method for monitoring a cluster, and computer-readable recording medium |
US20210392059A1 (en) * | 2015-06-05 | 2021-12-16 | Cisco Technology, Inc. | Auto update of sensor configuration |
US11936663B2 (en) | 2015-06-05 | 2024-03-19 | Cisco Technology, Inc. | System for monitoring and managing datacenters |
US11042443B2 (en) * | 2018-10-17 | 2021-06-22 | California Institute Of Technology | Fault tolerant computer systems and methods establishing consensus for which processing system should be the prime string |
Also Published As
Publication number | Publication date |
---|---|
US8112518B2 (en) | 2012-02-07 |
CA2616229A1 (en) | 2009-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8112518B2 (en) | Redundant systems management frameworks for network environments | |
EP3129903B1 (en) | Systems and methods for fault tolerant communications | |
US7225356B2 (en) | System for managing operational failure occurrences in processing devices | |
US8001413B2 (en) | Managing cluster split-brain in datacenter service site failover | |
JP4505763B2 (en) | Managing node clusters | |
WO2015169199A1 (en) | Anomaly recovery method for virtual machine in distributed environment | |
US8065560B1 (en) | Method and apparatus for achieving high availability for applications and optimizing power consumption within a datacenter | |
CN105229613A (en) | Coordinate the fault recovery in distributed system | |
CN106980529B (en) | Computer system for managing resources of baseboard management controller | |
EP2645635B1 (en) | Cluster monitor, method for monitoring a cluster, and computer-readable recording medium | |
CN107071189B (en) | Connection method of communication equipment physical interface | |
JP6838334B2 (en) | Cluster system, server, server operation method, and program | |
JP2011203941A (en) | Information processing apparatus, monitoring method and monitoring program | |
JP6638818B2 (en) | Survival management program, survival management method, and survival management device | |
US8595349B1 (en) | Method and apparatus for passive process monitoring | |
US9880855B2 (en) | Start-up control program, device, and method | |
JP2009026182A (en) | Program execution system and execution device | |
KR101883251B1 (en) | Apparatus and method for determining failover in virtual system | |
JP5594668B2 (en) | Node, clustering system, clustering system control method, and program | |
JP2009032052A (en) | Information processor, information processing method and program | |
JP2016151965A (en) | Redundant configuration system and redundant configuration control method | |
JP6224985B2 (en) | Notification device and notification method | |
WO2022009438A1 (en) | Server maintenance control device, system, control method, and program | |
WO2024013828A1 (en) | Signal processing resource switching device, signal processing resource switching system, signal processing resource switching method, and program | |
US8289842B2 (en) | Bridging infrastructure for message flows |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION,NEW YO Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:UNDERHILL, WILLIAM ROY;REEL/FRAME:021758/0025 Effective date: 20080922 Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:UNDERHILL, WILLIAM ROY;REEL/FRAME:021758/0025 Effective date: 20080922 |
|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION,NEW YO Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:UNDERHILL, WILLIAM ROY;REEL/FRAME:021620/0755 Effective date: 20080922 Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:UNDERHILL, WILLIAM ROY;REEL/FRAME:021620/0755 Effective date: 20080922 |
|
REMI | Maintenance fee reminder mailed | ||
LAPS | Lapse for failure to pay maintenance fees | ||
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20160207 |