US8479048B2 - Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained - Google Patents

Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained Download PDF

Info

Publication number
US8479048B2
US8479048B2 US13/211,694 US201113211694A US8479048B2 US 8479048 B2 US8479048 B2 US 8479048B2 US 201113211694 A US201113211694 A US 201113211694A US 8479048 B2 US8479048 B2 US 8479048B2
Authority
US
United States
Prior art keywords
information
event
event type
root cause
management
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US13/211,694
Other versions
US20110302305A1 (en
Inventor
Tomohiro Morimura
Takayuki Nagai
Kiminori Sugauchi
Takaki Kuroda
Yoshihiro Arato
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Priority to US13/211,694 priority Critical patent/US8479048B2/en
Publication of US20110302305A1 publication Critical patent/US20110302305A1/en
Application granted granted Critical
Publication of US8479048B2 publication Critical patent/US8479048B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3495Performance evaluation by tracing or monitoring for systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/86Event-based monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • a technology disclosed in this specification relates to a system management method, an apparatus, a system, and a program for managing an operation of an information processing system which includes a server computer, a network apparatus, and a storage apparatus, and to a medium that includes the program, and an apparatus for delivering the program.
  • IT is an abbreviation for Information Technology, and hereinafter, an IT system is also referred to as an information processing system
  • various IT apparatuses hereinafter, also referred to as information processing apparatuses
  • Faults may affect the various IT apparatuses via the network.
  • Patent Document 1 discloses an event correlation technology of analyzing a fault location and a cause by using event information used by an IT apparatus to notify fault contents.
  • the event correlation technology is also called a technology of estimating a root cause by using the correlation of events sent from computers when faults occur.
  • Non-Patent Document 2 discloses a technology in which a rule is made from a combination of the technology disclosed in Patent Document 1 and events occurring at the time of faults, and an estimated root cause, handled as a pair, thereby quickly determining a root cause by using an inference engine made based on an expert system.
  • the system management server Since a system management server that performs processing required for operation management cannot obtain events of all IT apparatuses coupled to the network, the system management server limits the number of IT apparatuses from which event information is received (or obtained) and displays an analysis result by using a root cause analysis technology.
  • event information can be obtained from all IT apparatuses coupled to the network.
  • an event for example, a fault
  • an IT apparatus from which the system management server obtains event information is affected by this fault
  • a rule is not applied thereto and the root cause of the fault cannot be identified.
  • the present invention provides an apparatus, a system, a method, a program, and a storage medium which are related to analysis of events occurring in a plurality of information processing apparatuses in an information processing system that includes the plurality of information processing apparatuses, a screen output apparatus, and a system management server which has a processor and a memory.
  • the system management server stores identification information of a server apparatus which is included in the plurality of information processing apparatuses and which is an access target of each of the plurality of information processing apparatuses for using a network service as a client, in configuration information held by the memory; registers a plurality of monitored apparatuses which are included in the plurality of information processing apparatuses and from which the system management server obtains event information, in the configuration information held by the memory; stores in the memory, a correlation analysis rule information, indicating that; when an event that includes a first event type related to the network service and an event that includes a second event type being different from the first event type related to the network service, both occurring in the plurality of information processing apparatuses, are detected, an event corresponding to the first event type can occur due to an event corresponding to the second event type; stores in the memory, a plurality of the event information obtained from the plurality of monitored apparatuses; identifies first event information which includes the first event type from among the plurality of the event information stored
  • the correlation analysis rule information may include topology condition information indicating a topology condition between a first information processing apparatus which is one of the plurality of information processing apparatuses and in which the first event type is occurred and a second information processing apparatus which is one of the plurality of information processing apparatuses and in which the second event type is occurred; and the fault cause apparatus may be identified based on the topology condition information in the cause identifying step.
  • an event-related information processing apparatus which is a server apparatus for the plurality of monitored apparatuses and which is included in the plurality of information processing apparatuses but is not included in the plurality of monitored apparatuses, may be identified based on the correlation analysis rule information and the configuration information; whether event information can be obtained from the event-related information processing apparatus may be checked; and information identifying the event-related information processing apparatus may be sent to the screen output apparatus, based on a result of the checking, when event information can be obtained from the event-related information processing apparatus; thereby information indicating that event information can be obtained from the event-related information processing apparatus may be displayed on the screen output apparatus.
  • the event-information acquisition permission/inhibition checking may be performed based on a result of an access by the system management server, according to a predetermined procedure to an information processing apparatus that is included in the plurality of information processing apparatuses and that has an IP address included in an IP address range specified in advance as a checking range.
  • the fault cause apparatus may be a storage apparatus which has a controller and provides a logical volume;
  • the network service may be a service providing the logical volume by a block access protocol;
  • the first event type may be the occurrence of a fault in the controller and the first event type may be a fail in accessing the logical volume.
  • second event information which includes the second event type and which has been obtained from the fault cause apparatus, may be identified from among the plurality of the event information, and information identifying the first monitored apparatus, the first event information, the fault cause apparatus, and the second event information may be sent to the screen output apparatus based on the correlation analysis rule information and the configuration information; thereby a message indicating that an event corresponding to the first event information that occurred in the first monitored apparatus is caused by an event corresponding to the second event information that occurred in the fault cause apparatus may be displayed on the screen output apparatus.
  • an information processing apparatus that is an event-information acquisition target is registered as a monitored apparatus in configuration information; event information that complies with a rule stored in advance is identified from among a plurality of event information stored in the system management server; a server apparatus for a network service related to the event information is identified; and a message is displayed which indicates that the cause of the event that occurred in a client information processing apparatus which has generated event information is an event related to the network service, which occurred in the server apparatus.
  • FIG. 1 shows an entire configuration diagram of an operation management system according to the present invention.
  • FIG. 2 schematically shows an entire processing flow of fault analysis according to one embodiment of the present invention.
  • FIG. 3 schematically shows one representative configuration example of an IT system which is a target of the present invention.
  • FIG. 4 schematically shows correlation analysis rule information used in the operation management system of the present invention.
  • FIG. 5 schematically shows topologies specified as application targets in the correlation analysis rule information shown in FIG. 4 .
  • FIG. 6 schematically shows a rule-application-destination management table which is one example of a table data structure for managing lists of IT apparatuses to which rules are applied.
  • FIG. 7 is a processing flow of generating application information of the correlation analysis rule information according to one embodiment of the present invention.
  • FIG. 8 schematically shows connection information of IP-SAN storage apparatuses, obtained from computers serving as IP-SAN clients, in a first embodiment of the present invention.
  • FIG. 9 schematically shows configuration information related to an IP-SAN storage that is a management-target IT apparatus, the configuration information being held in configuration management, in the first embodiment of the present invention.
  • FIG. 10 is an example screen display which proposes a user to set a not-managed IT apparatus to a management target, in the first embodiment of the present invention.
  • FIG. 11 schematically shows a not-managed IT-apparatus management table which is one example of a table data structure for managing not-managed IT apparatuses, in the first embodiment of the present invention.
  • FIG. 12 schematically shows the rule-application-destination management table, holding lists of IT apparatuses to which rules are applied, in the first embodiment of the present invention.
  • FIG. 13 schematically shows connection information of FC-SAN storage apparatuses, obtained from computers serving as FC-SAN clients, in the first embodiment of the present invention.
  • FIG. 14 schematically shows information related to an FC-SAN storage that is a management-target IT apparatus, the information being held in the configuration management, in the first embodiment of the present invention.
  • FIG. 15 schematically shows identification information and public names related to file servers, which can be obtained from computers serving as the file servers, in the first embodiment of the present invention.
  • FIG. 16 schematically shows a processing flow of displaying a fault analysis result on a screen, in the first embodiment of the present invention.
  • FIG. 17 schematically shows an example of fault analysis result data in a case where a not-managed IT apparatus causes a fault, in the first embodiment of the present invention.
  • FIG. 18 schematically shows an example screen display configuration for a fault analysis result in the case where the not-managed IT apparatus causes a fault, in the first embodiment of the present invention.
  • FIG. 19 schematically shows screen display for a fault analysis result in the case where the not-managed IT apparatus causes a fault, in the first embodiment of the present invention.
  • FIG. 20 schematically shows an entire processing flow of fault analysis, in a second embodiment of the present invention.
  • FIG. 21 is the processing flow of generating application information of the correlation analysis rule information, according to one embodiment of the present invention.
  • FIG. 1 is an overview showing one configuration of an information processing system for implementing the present invention.
  • the information processing system includes an operation management system and a system management server.
  • the system management server N 0 monitors and manages, as management targets, computers, a network switch (NW switch), and a storage apparatus which constitute the IT system.
  • NW switch network switch
  • the system management server N 0 of the present invention includes an event reception part C 0 for receiving event information such as a status change in a management-target IT apparatus, fault information, and notification information; a rule engine C 1 for performing fault analysis based on the received event information according to a rule R 0 defined in advance; configuration management C 3 for managing configuration information of management-target IT apparatuses; and a screen display part C 2 for outputting information required for operation management to a screen.
  • event information such as a status change in a management-target IT apparatus, fault information, and notification information
  • a rule engine C 1 for performing fault analysis based on the received event information according to a rule R 0 defined in advance
  • configuration management C 3 for managing configuration information of management-target IT apparatuses
  • a screen display part C 2 for outputting information required for operation management to a screen.
  • the operation management system includes a screen output apparatus M 1 for displaying information used for operation management on the screen based on output data and the control of the screen display part.
  • the screen output apparatus M 1 is coupled to the system management server N 0 .
  • a first candidate for the screen output apparatus M 1 is a display apparatus coupled to the system management server; however, another apparatus can be used instead if the apparatus can display analysis result information for the administrator of the operation management system.
  • the screen output apparatus M 1 include a mobile terminal which can receive electronic mail sent from the system management server N 0 and display it, as a screen output apparatus; and a computer having a display unit, which provides the administrator with information based on analysis result information sent by the system management server N 0 , receives an input from the administrator, and sends it to the system management server N 0 .
  • the rule engine C 1 includes a rule application part C 11 that reads analysis rule information R 0 (hereinafter also referred to as correlation analysis rule information) used for event correlation analysis, obtains configuration information T 0 from the configuration management C 3 , and performs processing to apply a rule to IT apparatuses in the IT system; a rule memory C 13 , serving as a working memory, for managing a rule-application-destination management table C 130 in which application information used by the rule application part to apply a rule to IT apparatuses is managed and for performing rule analysis processing; and an event analysis processing part C 12 that receives event information received by the event reception part C 0 and performs event correlation analysis.
  • the rule-application-destination management table C 130 may not be stored in the rule memory C 13 , but it needs to be stored in a memory of the system management server N 0 .
  • the correlation analysis rule information may be generated and stored by the administrator of the system management server N 0 , may be included in a program of the present invention, to be described later, and stored in the memory, or may be stored in the memory through initializing processing of the program of the present invention.
  • hardware items constituting the system management server N 0 include a processor, the memory (including secondary storage devices typical of which are a semiconductor memory and an HDD), and a network port. Those hardware items are coupled to each other by an internal network such as a bus.
  • the event reception part C 0 , the rule engine C 1 , the screen display part C 2 , and the configuration management C 3 are stored in the memory of the system management server N 0 and realized by a program executed by the processor; however, part or all of those functions may be realized by hardware.
  • the program which includes the event reception part C 0 , the rule engine C 1 , the screen display part C 2 , and the configuration management C 3 is referred to as an event analysis program in the following description.
  • the correlation analysis rule information R 0 , the configuration information T 0 , and the rule-application-destination management table C 130 are stored in the memory of the system management server N 0 .
  • the configuration information T 0 includes at least one of the following: connection information of IP-SAN storage apparatuses ( FIG. 8 ); information related to an IP-SAN storage ( FIG. 9 ); connection information of FC-SAN storage apparatuses ( FIG. 13 ); information related to an FC-SAN storage ( FIG. 14 ); and identification information and public names related to file servers ( FIG. 15 ), all of which will be described later. Further, a description will be given in which a not-managed IT-apparatus management table ( FIG.
  • the not-managed IT-apparatus management table is also included in the configuration information; however, if the not-managed IT-apparatus management table is stored in the memory of the system management server N 0 , it may be not stored as information which is not included in the configuration information T 0 .
  • the correlation analysis rule information R 0 , the configuration information T 0 , the rule-application-destination management table C 130 , the connection information of IP-SAN storage apparatuses, the information related to an IP-SAN storage, the connection information of FC-SAN storage apparatuses, the information related to an FC-SAN storage, the identification information and the public names related to file servers, and the not-managed IT-apparatus management table are not necessarily stored in a text file, in a table, in a specific format such as that having a queue structure, or in a data structure; they just need to include information to be described later.
  • the correlation analysis rule information R 0 , the configuration information T 0 , the rule-application-destination management table C 130 , the connection information of IP-SAN storage apparatuses, the connection information of FC-SAN storage apparatuses, the information related to an IP-SAN storage, the information related to an FC-SAN storage, the identification information and the public names related to file servers, the not-managed IT-apparatus management table are also referred to as correlation analysis rule information, configuration information, rule-application-destination management information, connection information of IP-SAN storage apparatuses, connection information of FC-SAN storage apparatuses, information related to an IP-SAN storage, information related to an FC-SAN storage, information of identification and public names related to file servers, and not-managed IT-apparatus management information, respectively.
  • system management server stores, as event entries, event information received from various management-target IT apparatuses in an event database defined in the memory although that is not shown.
  • event database may have any data structure if one or more event entries are included therein.
  • event information includes event contents, and it may also include an event occurrence time. Further, in the event database, past event information may be left as a history according to a specified condition.
  • the program (in particular, the configuration management C 3 ) of the system management server may associate the event information with the identification information of an IT apparatus from which the event information has been obtained and with the time at which the system management server has received the event information, and may include them all together.
  • the event contents include at least the type of an event, and, depending on the situation, the event contents may also include information identifying hardware and software in the IT apparatus, in which the event has occurred.
  • the operation state of the IT apparatus enters a predetermined state (for example, the occurrence of a hardware fault or a software fault is included in this type).
  • a predetermined health-check result is obtained (for example, a case where no health-check response is obtained for a given period of time is included in this type).
  • the processing speed and the amount of used resources such as a processor, a memory, and an HDD, which are components constituting the IT apparatus satisfy a predetermined condition (for example, a case where the remaining capacity of the HDD falls below 10% is included in this type).
  • the IT apparatus receives network access which satisfies a predetermined condition (for example, a case where the IT apparatus received requests more than a predetermined number of times, a case where a network packet which is identified as a requested DoS attack is received a predetermined number of times, and a case where a request is received from an IT apparatus other than a specified IT apparatus are included in this type).
  • a predetermined condition for example, a case where the IT apparatus received requests more than a predetermined number of times, a case where a network packet which is identified as a requested DoS attack is received a predetermined number of times, and a case where a request is received from an IT apparatus other than a specified IT apparatus are included in this type).
  • the event analysis program in the memory, it is conceivable to use a method in which the program is installed or copied from a medium, such as a DVD-ROM or a CD-ROM, which has stored the program, or to use a method in which the program (or information from which the program can be generated on the memory) is received from a program distribution server that can communicate with the system management server N 0 ; however, other methods can also be used.
  • the system management server N 0 may be distributed.
  • the above-described system management server N 0 analyzes the root cause of faults in the information processing system.
  • management-target IT apparatuses are specified in advance, event information is used as an analysis target of correlation analysis, and necessary information is received from the IT apparatuses. If all IT apparatuses coupled to the network are managed, the processor, the memory, and the storage device, such as a hard disk, of the management server needs to be used very exhaustively for the management, thereby making practical monitoring difficult. Therefore, management-target IT apparatuses, from which information is received, are specified to be narrowed down in the operation management system to avoid such difficulty. Further, when a management tool is a commercially-available tool, the number of licenses is limited based on the types and the number of IT apparatuses to be managed, in almost all cases.
  • the IT system includes an IT apparatus from which the system management server N 0 obtains or is allowed to obtain event information for event information analysis (hereinafter, such an IT apparatus is also be expressed as monitored IT apparatus, managed IT apparatus, management IT apparatus, in-management IT apparatus, or monitored apparatus; and such expressions apply to a computer, a switch, a router, and a storage apparatus, which are specific examples of an IT apparatus), and an IT apparatus from which the system management server N 0 does not obtain or is prevented from obtaining event information (hereinafter, such an IT apparatus is also expressed as not-monitored IT apparatus, not-managed IT apparatus, IT apparatus that is out of management, out-of-management IT apparatus, or event-related information processing apparatus; and such expressions apply to a computer, a switch, a router, and a storage apparatus which are specific examples of an IT apparatus).
  • IT apparatus is also be expressed as monitored IT apparatus, managed IT apparatus, management IT apparatus, in-management IT apparatus, or monitored apparatus; and such expressions apply to a computer, a switch, a router
  • IT apparatuses that are not monitored or managed in the system management server N 0 are further classified into an IT apparatus that was once found, confirmed, or managed in the system management server N 0 , and an IT apparatus that has never been found, confirmed, or managed in the system management server N 0 .
  • configuration information for example, the IP address, the host name, or the fully qualified domain name (FQDN) of the IT apparatus, which is obtained when it is found or confirmed, may be held therein and managed, not always in the same manner as for an IT apparatus that is monitored and managed.
  • a non-management-target IT apparatus for which corresponding configuration information is not held in the system management server N 0 and a non-management-target IT apparatus for which part or all of corresponding configuration information has been stored in the system management server N 0 are also defined as non-management-target IT apparatuses.
  • Example cases to be out of management of the operation management system include a case where a management-target IT apparatus uses a globally-provided service such as a DNS server, and a case where the operation management system cannot sufficiently obtain information used for management due to circumstances such as a firewall, an access-right problem, a network configuration, and an access-means defect.
  • a management-target IT apparatus uses a globally-provided service such as a DNS server
  • the operation management system cannot sufficiently obtain information used for management due to circumstances such as a firewall, an access-right problem, a network configuration, and an access-means defect.
  • the present invention relates to analysis of the correlation among a plurality of IT apparatuses existing in the network.
  • the clock signals in the individual apparatuses are shifted, and further, the timing to transfer event information is also shifted. Therefore, the system management server N 0 analyzes event information that occurred or was received for the duration (a period of time) predetermined by a program developer or for a period of time specified by the administrator.
  • events related to the cause may occur at different timing (for example, in a case where a predetermined network service such as a Web service or a DNS service is received through caching processing from a server computer).
  • analysis needs to be performed for a period of time instead of at a particular time.
  • desired events be items occurring dynamically to some extent. Further, it is more preferable that the time difference between the time at which an event occurs in an IT apparatus, serving as the cause, because a predetermined cause arises (or the time at which the system management server receives the event) and the time at which, due to the cause, an event occurs in another IT apparatus (or the time at which the system management server receives the event) causes an event during the above-mentioned period of time.
  • information conceivable as one piece of configuration information include the types and the number of hardware items constituting an IT apparatus, and the communication identification information and the name which are necessary to communicate with the IT apparatus, and be quasi-static information which can be partially changed by the administrator of the IT apparatus.
  • FIG. 2 shows a flow of general processing based on the above-described configuration, according to one embodiment of the present invention.
  • the rule engine C 1 reads the correlation analysis rule information R 0 in advance, obtains the configuration information T 0 of management targets from the configuration management C 3 , searches T 0 for the identification information of IT apparatuses to which the rule group R 0 is applied, and stores the identification information in the rule-application-destination management table C 130 .
  • the process of S 1 is a preparation process for fault analysis processing using events, to be performed later, and needs to be performed prior to the analysis processing. In the first embodiment, which is one of the embodiments, it is assumed that the analysis processing is performed prior to the start of the operation, and the rule-application-destination management table C 130 is held in advance in the rule memory C 13 .
  • the event reception part C 0 waits to receive events sent from the management-target IT apparatuses in the operation management system.
  • S 3 is related to a system operation of the operation management system.
  • S 3 is a step to determine whether a halt process has been instructed and used to halt the operation.
  • the identified fault cause is output to the screen display part C 14 .
  • the screen display part C 14 sends analysis information based on received analysis result output data, thereby outputting and displaying a screen necessary for the operation management on the screen output apparatus M 1 .
  • received event information may be temporarily stored in the event database, instead of in the processes of S 2 and S 4 .
  • One advantage of the present invention is to allow fault cause analysis for an IT apparatus that is not a management target, by changing the process of the rule application part in this general processing flow, without largely changing the configuration and the subsequent processing flow.
  • FIG. 3 is an overview showing one configuration of an IT system assumed in the embodiments of the present invention.
  • the IT system of FIG. 3 includes an operation management system which is a target of operation management including a computer N 10 , a computer N 11 , and a computer N 12 which are operated and managed by the management server N 0 ; an IP switch N 21 and an FC switch N 31 which are network switches; a storage apparatus N 40 ; and a storage apparatus N 41 ; a storage apparatus U 2 ; and a computer U 5 which are non-management-target IT apparatuses that are not managed by the management server N 0 ; a storage apparatus U 1 which are coupled to a network G 0 via a router N 20 ; a computer U 3 and a computer U 4 .
  • an operation management system which is a target of operation management including a computer N 10 , a computer N 11 , and a computer N 12 which are operated and managed by the management server N 0 ; an IP switch N 21 and an FC switch N 31 which are network switches;
  • the number of the IT apparatuses such as computers, switches, routers, and storage apparatuses which are individually shown, is an example; the operation management system just needs to include at least an IT apparatus serving as a server which provides a network service and an IT apparatus serving as a client which receives the network service.
  • the storage apparatus U 1 which is a non-management-target IT apparatus, includes an IP-SAN interface and provides the management-target computer N 10 with a logical volume.
  • the storage apparatus U 2 which is a non-management-target IT apparatus, includes an FC-SAN interface and provides a management-target computer N 13 with a logical volume via the management-target FC switch N 31 .
  • the computer U 3 or the computer U 5 which is a non-management-target IT apparatus, is a file server and makes a file system available to both of the management-target computers N 10 and N 11 .
  • the computer U 3 belongs to a network segment different from that of the operation management system, and detailed information related to the computer U 3 cannot be obtained through the network.
  • the computer U 5 serving as a file server, belongs to the same network segment as the operation management system, and can be automatically found by the operation management system.
  • the computer U 5 is an IT apparatus that was found at the time of the operation but was not set to a management target.
  • the computer U 4 which is a non-management-target IT apparatus, is a DNS server and applies a name solution function to all the IT apparatuses included in the IT system of FIG. 3 .
  • FIG. 4 shows example rules suggesting that a fault in the controller of a storage apparatus is the root cause, for the IT system shown in FIG. 1 .
  • a rule for identifying the root cause in fault analysis a combination of events predicted to occur based on an event correlation and a fault serving as the root cause are described as a pair in an IF-THEN format, in many cases.
  • IF-THEN format a rule is expressed such that “when a condition described in the IF part is established, the THEN part is true”.
  • a rule is described in the IF-THEN format in the same way as general rules in expert systems, and information related to IT apparatuses to which the rule is applied is defined in advance in the IF condition part.
  • a rule may not be described in the IF-THEN format, but a topology needs to be defined in advance as any connection and relation information which can identify IT apparatuses to which the rule is applied.
  • the correlation analysis rule information includes one or more rule entries. More abstractly, it can be said that a rule entry includes the following information.
  • condition entry indicating a condition that includes an event type to which the rule is applied. As described above, this condition entry may include a topology as a condition.
  • FIG. 4 shows the following rules in advance as shown in FIG. 4 : a rule R 1 in which the root cause is a fault in the controller of an IP-SAN storage apparatus that uses iSCSI; a rule R 2 in which the root cause is a fault in the controller of an FC-SAN storage apparatus that uses Fibre Channel; a rule R 3 in which the root cause is a fault in a file server; and a rule R 4 in which the root cause is that the network does not reach the DNS server.
  • FIG. 6 shows the rule-application-destination management table that is information holding, for each rule, IT apparatuses to which the rule is applied.
  • the rule-application-destination management table includes a column C 101 for identification information indicating a rule, and a column C 102 for the list of application-destination IT apparatuses, storing the identification information of IT apparatuses to which the rule is applied.
  • the rule-application-destination management table does not need to be in a database. Note that this table data structure may be divided into a plurality of table data structures by normalizing the table, and the plurality of table data structures may be managed.
  • FIG. 5 shows topology patterns to which the rules R 1 to R 4 , shown in FIG. 3 , are applied.
  • FIG. 5 ( 1 ) shows a topology of connection and relation information suggested by the IF part of the rule R 1 .
  • FIG. 5 ( 1 ) indicates that Computer indicating a computer has iScsiInitiator and is coupled to iScsiTarget of Storage indicating a storage apparatus via Ipswitch indicating an IP switch.
  • iScsiTarget is an iSCSI name identifying the connection destination of iScsiInitiator.
  • the rule R 1 is applied to a combination of a computer and a storage apparatus in which connection-destination iScsiTarget held by the computer matches the iSCSI name of an iScsi port of the storage apparatus.
  • Rows L 101 and L 102 of FIG. 6 show IT apparatuses to which the rule R 1 is applied in the IT system of FIG. 3 .
  • FIG. 5 ( 2 ) indicates that a computer has FcHba and FcHba is coupled to FcPort of a storage apparatus via FcSwitch, as suggested by the IF part of the rule R 2 .
  • a connection-destination port WWN WWN (WWN: World Wide Name) held by FcHba matches FcPortWWN, which is WWN of FcPort serving as a Fibre Channel port of the storage apparatus, it means that they have a connection relation and the rule R 2 is applied to them.
  • a row L 103 of FIG. 6 shows IT apparatuses to which the rule R 2 is applied, as a combination of the computer and the storage apparatus, in the IT system of FIG. 3 .
  • FIG. 5 ( 3 ) shows a case where the IF part of the rule R 3 indicates a topology of a file server and a client.
  • a computer T 31 having information of ImportedFileShare which indicates that a file system of the file server is mounted and a computer T 33 having information of ExportedFileShare which indicates that the file system is made available to the outside have the relation of a client and a file server via an IP switch T 32 .
  • ImportedFileShare T 311 includes, as information related to the file server of the mount source, the identification information (the IP address, the FQDN (Fully Qualified Domain Name), etc.) of the file server, and the public name of the file system made available to the outside.
  • ExportedFileShare T 331 includes the location of the file system made available to the outside and the public name (also called share name) thereof.
  • the rule R 3 is applied to those computers, as a pair, as the topology of the file client and the file server.
  • a row L 104 of FIG. 6 shows IT apparatuses to which the rule R 3 is applied, as a combination satisfying the above condition, in the IT system of FIG. 3 .
  • FIG. 5 ( 4 ) shows a topology of a DNS server and a client suggested by the rule R 4 .
  • a computer T 42 serving as the DNS server, which provides a name solution service, and a computer T 41 serving as the client, which solves an IP address and an FQDN name with the DNS server, are stored as a pair in the application-destination management table shown in FIG. 6 .
  • the application-destination management table of FIG. 6 for IT apparatuses to which each rule is applied is provided. Therefore, when events occur, it is possible, by referring to the table, to judge a rule to which the events are related and to select a rule to be applied.
  • the method of applying a rule to management-target IT apparatuses has been described above.
  • FIGS. 7 and 21 show details of Step S 1 of FIG. 2 performed in the rule application part C 11 , according to one embodiment of the present invention.
  • the first embodiment will be described with the IT system shown in FIG. 3 and the rules R 1 to R 4 shown in FIG. 4 .
  • the entire processing shown in FIGS. 7 and 21 is performed in the rule application part. It is assumed that the operation management system stores in advance an IT apparatus once found, and can judge that the IT apparatus has been found. Alternatively, when the operation management system does not have a function of automatically finding an IT apparatus in the IT system, or even if the operation management system has the function of automatically finding an IT apparatus in the IT system, when it does not have a function of storing the found IT apparatus, the processing of FIGS. 7 and 21 is performed as if the found IT apparatus did not exist.
  • S 101 it is judged whether a rule to be read, that is, a rule that has not been read yet, is included in the correlation analysis rule information R 0 .
  • a rule to be read that is, a rule that has not been read yet
  • the flow advances to S 102 . Otherwise (NO), the flow ends. Since rules to be read, which are the rules R 1 to R 4 , are included (YES), the flow advances to S 102 .
  • one rule is read.
  • the rule is marked or is stored as a read rule, for example, so that it can be recognized to be one that has been read.
  • the rule R 1 is read and is stored as a read rule.
  • the flow advances to S 103 .
  • a search condition for IT apparatuses corresponding to the topology information described in the rule is obtained and the flow advances to S 4 .
  • a search condition is obtained for IT apparatuses which include a computer having iScsiInitiator, a storage apparatus having an iSCSI port identified by iScsiTarget, and an IP switch coupled to them and to which the rule R 1 is applied as in the topology information of the rule R 1 . It is assumed that the search condition is defined in advance with respect to the description of the rule.
  • the configuration information of management-target IT apparatuses is searched for the IT apparatus serving as a client in the topology information.
  • the configuration information is stored in a database
  • the database is searched.
  • the configuration information is stored in a file
  • the configuration information is searched for the computer having iScsiInitiator, serving as a client in the topology of the rule R 1 .
  • the identification information of the computer N 10 and the computer N 11 is found through the search.
  • S 105 it is judged whether an IT apparatus that has not been selected is included in the IT apparatuses found through the search, because processes of S 106 and the subsequent steps are performed for a plurality of computers.
  • the flow advances to S 106 .
  • one of the IT apparatuses that have not been selected is selected and regarded as a selected IT apparatus.
  • the computer N 10 is selected and regarded as a selected IT apparatus.
  • the flow advances to S 107 .
  • the information of an IT apparatus serving as a server includes: information identifying the IT apparatus serving as the server (such as the IP address, the host name, or the FQDN); and information related to a service to be provided (the public name (also called share name) of an available file system of the file server, the LUN number identifying a disk volume of the storage apparatus, the iSCSI name of a connection destination, or the WWN of an FC Port).
  • the public name also called share name
  • ConnectedIscsiTarget which is the iSCSI name of a connection destination shown in FIG. 8 is obtained as the information of storage apparatuses serving as servers, which are opposed to the computer N 10 .
  • S 108 it is judged whether information corresponding to an IT apparatus that has not been searched for is included in the information related to IT apparatuses serving as servers, obtained in S 107 .
  • the flow advances to S 109 .
  • the flow returns to S 105 .
  • the flow advances to S 109 .
  • the information includes the identification information indicating an IT apparatus (more specifically, a computer) and the identification information, in iSCSI, of a storage apparatus to which the IT apparatus is coupled.
  • one piece of information which has not been searched for is selected from the information related to IT apparatuses serving as servers, obtained in S 107 .
  • the configuration information of management targets is searched for the IT apparatus serving as a server.
  • the configuration information of management targets is searched for a storage apparatus having, as iScsiTarget, an iSCSI name indicated in a row L 201 of ConnectedIscsiTarget shown in FIG. 8 , obtained from the computer N 10 .
  • FIG. 9 shows configuration information about iScsiTarget of a management-target storage apparatus. Since the storage apparatus having iScsiTarget identical to ConnectedIscsiTarget in the row L 201 of FIG. 8 is not found in the management target as shown in FIG. 9 , the flow advances to S 111 .
  • the information includes the identification information indicating a storage apparatus and the identification information, in iSCSI, held by the storage apparatus.
  • the configuration information T 0 includes, for each of one or more IT apparatuses that have been found, event-acquisition permission/inhibition information which indicates whether the apparatus is an event acquisition target (specifically, whether the apparatus is monitored; in other words, whether event acquisition from the apparatus is permitted or inhibited).
  • the judgment of S 110 is performed by referring to this data.
  • S 111 it is judged whether the IT apparatus has been already found in the operation management system. Specifically, it is judged whether the IT apparatus was once found, confirmed, or managed in the operation management system and the static configuration information of the IT apparatus is partially held in the operation management system. In this embodiment, since there is no configuration information related to the storage apparatus having iScsiTarget identical to ConnectedIscsiTarget in the row L 201 of FIG. 8 , it is assumed that the IT apparatus is not a found resource (NO). Then, the flow advances to S 112 .
  • the judgment of S 111 can be performed by judging whether information related to the apparatus (for example, the event-acquisition permission/inhibition information) is included in the configuration information.
  • an attempt is made to find the storage apparatus having iScsiTarget identical to ConnectedIscsiTarget in the row L 201 of FIG. 8 , from not-managed IT apparatuses.
  • a request to receive a service related to the target resource is sent to a communication identifier such as the FQDN or the IP address corresponding to the target resource, obtained from the configuration information or input by the user; or a communication identifier such as the FQDN or the IP address in the network address, which is the IP address corresponding to the network segment that includes the target resource, obtained from the configuration information or input by the user.
  • the presence of the target resource is confirmed.
  • an attempt is made to find the storage apparatus from the IT system shown in FIG. 3 .
  • S 114 it is judged whether the IT apparatus found in S 113 can be set to a management target of the operation management system. Whether the IT apparatus can be set to a management target is judged depending on whether information required by the operation management system for monitoring and management can be obtained from the target IT apparatus. Although information required for monitoring and management is different for each operation management system, information identifying the IT apparatus is required in common that includes at least one of the following: the IP address, the WWN (World Wide Name), some unique identification information (number), an apparatus name (host name), and the FQDN.
  • the system management server N 0 holds a predetermined criterion and this judgment is performed based on the criterion.
  • the storage apparatus has an iSCSI port and information of iScsiTarget can be obtained as the iSCSI name of the iSCSI port.
  • the IT apparatus has been judged to be able to be set to a management target. The flow advances to S 115 .
  • this apparatus may be set to a management target in a process to be performed later, the processing may be configured such that it is confirmed in this step that event information can be received from this IT apparatus, and only when it is confirmed that event information can be received from this IT apparatus, the flow advances to S 115 .
  • S 115 whether the IT apparatus found in S 113 is set to a management target is presented to the user.
  • the fact that the storage apparatus U 3 has been found as a storage server for the computer N 1 and whether the storage apparatus U 3 is added to management targets are presented.
  • the indication screen is shown in FIG. 10 .
  • the system management server N 0 (in particular, the rule engine) receives an input from the management screen output apparatus.
  • S 117 it is judged whether the user has set the found IT apparatus to a management target.
  • the flow advances to S 118 . Otherwise (NO), the flow advances to S 119 .
  • the not-managed IT-apparatus management table TL 3 includes the following information for each of not-managed IT apparatuses that have been found.
  • the identification information of the not-managed IT apparatus is marked such that it can be recognized that the IT apparatus is not managed, and then the identification information is stored in the rule-application-destination management table TL 1 as shown in FIG. 12 .
  • the identification information is stored in the rule-application-destination management table TL 1 , based on the information related to the storage apparatus U 3 included in the not-managed IT-apparatus management table.
  • the flow returns to S 8 , in which it is judged whether search information related to an IT apparatus serving as a server opposed to the selected IT apparatus serving as a client is included.
  • the storage apparatus corresponding to L 202 is searched for in the configuration management.
  • the storage apparatus corresponding to L 202 exists as shown in FIG. 9 , it is recognized that the IT apparatus corresponding to L 202 is a management target. Therefore, it is judged in S 110 that the IT apparatus is a management-target IT apparatus, and the flow advances to S 120 .
  • the list of the storage apparatus N 40 and the computer N 10 which are management-target IT apparatuses, is stored in L 101 of the rule-application-destination management table of FIG. 11 , as IT apparatuses to which the rule R 1 is applied.
  • the rule R 1 can be applied also to the non-management-target storage apparatus U 1 , which provides the computer N 10 with a logical volume.
  • the screen display part C 2 obtains, from the rule engine C 1 , fault analysis result data D 1 shown in FIG. 17 which indicates a fault analysis result obtained in the rule engine.
  • the rule engine C 1 (in particular, the event processing analysis part C 12 ) performs processes described with reference to S 4 of FIG. 2 , and FIGS. 4 and 5 .
  • the fault analysis result data D 1 includes fault-cause IT-apparatus information which is information related to a fault-cause IT apparatus and a received-event list which is information related to an event in a management-target IT apparatus, received by the operation management system.
  • the fault-cause IT-apparatus information D 11 includes information indicating the fault-cause IT apparatus and information related to a component at the fault location. Acquisition of the information related to a component at the fault location depends on how much fault information can be obtained from the fault-cause IT apparatus that is a non-management-target IT apparatus. When fault information cannot be obtained at all, “unknown” is indicated as in FIG. 17 .
  • the received-event list includes a received-event transmission source which is information related to the transmission source of the received event which is information related to a correlated received-event in the rule defining this fault; and an event type indicating information related to the contents of the event.
  • the not-managed IT-apparatus management table of FIG. 11 is searched based on the fault-cause IT-apparatus information of the fault analysis result data D 11 , and information related to this not-managed IT apparatus is obtained. Then, the flow advances to S 606 .
  • information related to the storage apparatus U 1 is obtained from L 401 of FIG. 11 .
  • a message indicating that the root cause of the fault that occurred is a not-managed IT apparatus is displayed on the screen, together with the information obtained in S 605 .
  • an example structure of the screen displayed at this time includes a message notifying that the not-managed IT apparatus is the root cause of the fault; a fault analysis result which is the result obtained through analysis of the cause of the fault; and fault information detected by the operation management system for the fault that occurred, such as a received event.
  • a screen display such as a window or a dialog that includes the above items is output to the screen output apparatus M 1 .
  • the screen display 19 shows an example screen display in a case where the fault in the storage U 1 that is a not-managed IT apparatus is the root cause, according to this embodiment.
  • the screen display includes information indicating that the fault-cause IT apparatus is a non management target, and the type of the IT apparatus.
  • the screen display shows that the IT apparatus is an IP-SAN storage apparatus, and the IP address, which is an example of the identification information, of the IT apparatus is 192.168.100.15.
  • S 101 since the rule R 2 is included, the flow advances to S 102 .
  • the rule R 2 is read and R 2 is marked to indicate that it has been read.
  • S 103 as topology information described in the rule R 2 and as the FC-SAN topology of FIG. 4 ( 2 ), a topology in which a computer T 21 serving as a client and having a Fibre-Channel Host Bus Adapter, i.e., FcHba T 211 , is coupled via an FC switch T 22 to a storage apparatus T 23 serving as a server and having FcPort T 231 which is a Fibre-Channel port is defined in the search condition.
  • ConnectedFcPortWWN C 502 indicating the WWN of an FC Port, which is a Fibre-Channel port, of the storage apparatus serving as a server to which the computer N 13 is coupled is obtained from the computer N 13 as shown in FIG. 13 .
  • connection information of FC-SAN storage apparatuses shown in FIG. 13 is described.
  • the connection information includes, as information for each IT apparatus, the communication identification information of FibreChannel held by a storage apparatus to which the IT apparatus is coupled.
  • the information includes the identification information indicating a storage apparatus and the communication identification information used in FibreChannel held by the storage apparatus.
  • FIG. 10 shows an example screen display used for the rule R 1 , but the structure of screen display is basically the same and just the message contents are replaced with those for the actual IT apparatus.
  • S 118 information that needs to be obtained as that for a management-target IT apparatus is obtained for the storage apparatus U 2 added as a new management target.
  • the information to be obtained as that for a management target includes event information and configuration management information.
  • the storage apparatus U 2 serving as a management-target IT apparatus and the computer N 14 are registered in the rule-application-destination management table as IT apparatuses to which the rule R 2 is applied. In this example case, they are registered in the table data structure formed of the column C 101 for a rule and the column C 102 for storing the list of IT apparatuses to which the rule is applied, shown in FIG. 12 .
  • fault analysis for an FC-SAN storage apparatus that is a non-management-target IT apparatus can be performed through the conventional rule-based event correlation.
  • processing of displaying a message indicating that the FC-SAN storage that is a non-management-target IT apparatus is the root cause of the fault, on the screen based on the fault analysis result data is performed through the steps of FIG. 16 in the same way as the processing of displaying on the screen a message indicating that the non-management-target IP-SAN storage is the root cause of the fault, performed for the rule R 1 .
  • S 101 since the rule R 3 is included, the flow advances to S 102 .
  • the rule R 3 is read and R 103 is marked to indicate that it has been read.
  • S 103 as topology information described in the rule R 3 and as the topology of a file server and a client shown in FIG. 4 ( 3 ), a topology in which the computer T 31 serving as a client and having ImportedFileShare T 311 which indicates that a file system made available is mounted is coupled via an IP switch T 32 to the computer T 33 serving as a server and having ExportedFileShare T 331 which indicates that the computer T 33 has the file system made available to the other computers is defined in the search condition.
  • the computer N 10 is the client IT apparatus that has been searched for and that has not been selected. Thus, the flow advances to S 106 .
  • the computer N 10 shown in FIG. 3 is selected as the client IT apparatus that has not been selected, and is marked as a selected IT apparatus.
  • information of ImportedFileShare indicating the file server from which the file system made available is mounted is obtained as search information for the computer serving as a server IT apparatus opposed to the computer N 10 in the topology of FIG. 4 ( 3 ).
  • Information related to the file server, obtained from the client is managed in a table of FIG. 15 .
  • the table has a data structure which includes a column C 701 for a client computer, a column C 702 for the identification information related to a file server for the client computer, and a column C 703 for the public name of the file server.
  • the information related to a file server, obtained from the client may be obtained in advance as configuration information in the table of FIG. 15 , or may be obtained from the client IT apparatus in the process of S 7 . In other words, the acquisition of such information needs to be performed before the process of S 107 is completed.
  • the information includes the following information for each file server.
  • an attempt is made to find the computer having exportfs.domain2.com.
  • the attempt is made such that an IP address is solved by making an inquiry to the DNS server, the presence thereof is confirmed by sending a ping to the IP address, and the computer is accessed through a remote connection of telnet, ssh, or Windows (registered trademark).
  • telnet a remote connection of telnet
  • ssh a remote connection of telnet
  • Windows registered trademark
  • the flow advances to S 114 .
  • the computer having exportfs.domain2.com is registered in the not-managed IT-apparatus management table of FIG. 11 .
  • the information obtained from the client is stored in file-server identification information and service identification information.
  • rule application information is generated for the pair of the client computer N 10 and the computer U having exportfs.domain2.com. Specifically, as shown in L 107 of FIG. 121 , the computer N 10 and the computer U 3 that is a not-managed IT apparatus are registered in the list of application-destination IT apparatuses for the rule R 3 .
  • fault analysis can also be performed for the computer U 3 that is a not-managed IT apparatus serving as a file server for the computer N 10 .
  • Steps S 105 to S 107 information specified in a row L 703 of FIG. 15 and related to a file server for the computer N 11 is obtained.
  • S 109 since the file server specified in the row L 703 of FIG. 15 is not found in the management-target IT apparatuses, the flow advances to S 111 .
  • S 111 the computer U 5 having the IP address specified in the row L 703 of FIG. 15 is found in the found resources. Thus, the flow advances to S 115 .
  • a message proposing to add the computer U 5 to management targets is displayed on the screen.
  • a user instruction to set the computer U 5 to a management target is received as a user input.
  • monitoring information that includes configuration information, the operation state, and performance information of a device coupled to the computer U 5 is obtained in addition to the identification information of the IT apparatus, held as that of a found resource, and information used for access.
  • the obtained information is stored in the configuration information T 0 of management targets, in the configuration management C 3 .
  • the data structure shown in a row L 108 of FIG. 12 is stored in the rule memory, so that the rule R 3 can be applied to a topology which includes the computer N 11 that is a managed IT apparatus serving as a client and the computer U 5 serving as a file server.
  • Steps S 101 to S 104 the computer N 10 is found as a client IT apparatus in the rule R 4 .
  • Steps S 105 to S 107 as search information of a DNS server for the computer N 10 , the IP address 192.168.100.1 of the DNS server is obtained from the computer N 10 .
  • Steps S 108 to S 110 it is confirmed that the DNS server is not included in the configuration information T 0 of management targets in the configuration management C 3 , by using the obtained IP address 192.168.100.1.
  • the flow advances to S 111 .
  • S 111 it is judged that the DNS server is not a found IT apparatus.
  • the flow advances to S 112 .
  • S 112 an attempt is made to access the node having the IP address 192.168.100.1 from the actual IT system. As a result of the access, network connection is confirmed using a ping, but the node cannot be logged in because authentication information is not held.
  • S 114 it is judged that the DNS server cannot be set to a management target.
  • the flow advances to S 119 .
  • S 119 as shown in L 404 of FIG. 11 , information of the computer having the IP address 192.168.100.1 is stored and managed as that of a non-management-target IT apparatus and as that of a DNS server with identification information U 4 .
  • the flow advances to S 120 .
  • the computer N 10 serving as a client and the computer U 4 that is a not-managed IT apparatus serving as a DNS server are stored in the list of application-destination IT apparatuses for the rule 4 , as shown in a row L 109 of FIG. 12 .
  • the rule 4 can be similarly applied to another IT apparatus shown in FIG. 3 by generating application information for the computer U 4 that is a not-managed DNS server.
  • the processing procedure of the entire fault-analysis processing flow shown in FIG. 2 in the first embodiment is performed in a manner such that Step S 4 b of generating application information in the rule application part C 11 is performed after Step S 3 b of receiving events and before Step S 5 b of event analysis processing performed in the event analysis part C 12 , as shown in FIG. 20 .
  • the only difference between the first embodiment and the second embodiment is the timing of generating rule application information.
  • a program that implements, in the system management server which has the processor and the memory and which is coupled to a plurality of information processing apparatuses and the screen output apparatus, analysis of events occurring in the plurality of information processing apparatuses includes a part or all of the following processes.
  • the correlation analysis rule information may include topology condition information indicating a topology condition between a first information processing apparatus which is one of the plurality of information processing apparatuses and in which the first event type has occurred and a second information processing apparatus which is one of the plurality of information processing apparatuses and in which the second event type has occurred; and the fault cause apparatus may be identified based on the topology condition information in the cause identifying step.
  • the system management server may further include the following processes.
  • a related-apparatus identifying process of identifying an event-related information processing apparatus which is a server apparatus for the plurality of monitored apparatuses and which is included in the plurality of information processing apparatuses but is not included in the plurality of monitored apparatuses, based on the correlation analysis rule information and the configuration information.
  • registration into the system management server can be promoted without failing to perform registration, quickly after event monitoring with the system management server is newly required or allowed because of a change in a management method or in the administrator of an information processing apparatus.
  • the event-information acquisition permission/inhibition checking process may be performed based on a result obtained when the system management server accesses, according to a predetermined procedure, an information processing apparatus that is included in the plurality of information processing apparatuses and that has an IP address included in an IP address range specified in advance as a checking range.
  • an information processing apparatus in particular, a server computer accessed via the Internet
  • accesses from the outside to this information processing apparatus are monitored in some cases.
  • the access may also be recognized as an unauthorized access or a fraudulent attack, by the access monitoring.
  • the range of IP addresses of information processing apparatuses that are obviously not targets of event monitoring or the range of IP addresses of information processing apparatuses that can be targets of event monitoring is identified, thereby suppressing such a communication that is falsely recognized as an unauthorized access or a fraudulent attack.
  • the fault cause apparatus may be a storage apparatus which has a controller and provides a logical volume;
  • the network service may be a service providing the logical volume by a block access protocol (such as FibreChannel or iSCSI);
  • the first event type may be the occurrence of a fault in the storage apparatus and the first event type may be a fail in accessing the logical volume.
  • the fault cause apparatus may be a computer which provides a DNS as the network service
  • the first event type may be a fail in requesting a DNS
  • the first event type may be a disconnection of communication with a DNS server.
  • the fault cause apparatus may be a file server computer which has an NIC to receive data from at least one of the plurality of information processing apparatuses and which provides a stored file for at least one of the plurality of information processing apparatuses;
  • the network service may be a network file-sharing service for sharing the file stored by the file server computer;
  • the first event type may be the occurrence of a fault in the file server (for example, the occurrence of a fault in the NIC, the occurrence of a failure in software executed by the processor held by the file server, or the occurrence of a fault in which the communication function of the file server is stopped), and the first event type may be a fail in accessing the file provided by the network file-sharing service.
  • second event information which includes the second event type and which has been obtained from the fault cause apparatus may be identified from among the plurality of pieces of the event information; and information identifying the first monitored apparatus, the first event information, the fault cause apparatus, and the second event information may be sent to the screen output apparatus, thereby causing the screen output apparatus to display a message indicating that an event corresponding to the first event information that occurred in the first monitored apparatus was caused by an event corresponding to the second event information that occurred in the fault cause apparatus.
  • the first information processing apparatus may be a computer
  • the second information processing apparatus may be a storage apparatus
  • the topology condition information may include a combination of communication identification information corresponding to the computer and communication identification information corresponding to the storage apparatus, the combination indicating a connection relation of a topology in which the computer is coupled to the storage apparatus.
  • at least one of an iSCSI name, an IP address, and a WWN used in FibreChannel is a candidate for the communication identification information.
  • the first information processing apparatus may be a computer
  • the second information processing apparatus may be a file server computer which provides a stored file for the plurality of information processing apparatuses by a file-sharing service
  • the topology condition information may include a combination of communication identification information corresponding to the computer, and communication identification information corresponding to the file server computer or an export name used to make the file available, the combination indicating a connection relation of a topology in which the computer is coupled to the file server computer.
  • the first information processing apparatus may be a computer
  • the second information processing apparatus may be a DNS server computer which provides a DNS, as a network-sharing service, for the plurality of information processing apparatuses
  • the topology condition information may include a combination of communication identification information corresponding to the computer and communication identification information corresponding to the DNS server computer, the combination indicating a connection relation of a topology in which the computer is coupled to the DNS server computer.
  • an IP address or an FQDN is a candidate for each of the communication identification information corresponding to the computer and the communication identification information corresponding to the DNS server computer.
  • system management server may be configured by one or more computers.

Abstract

In the system management server, an information processing apparatus that is an event-information acquisition target is registered as a monitored apparatus in configuration information; event information that complies with a rule stored in advance is identified from among a plurality of pieces of event information stored in the system management server; a server apparatus for a network service related to the event information is identified; and a message is displayed which indicates that the cause of the event that occurred in a client information processing apparatus which has generated event information is an event related to the network service, which occurred in the server apparatus.

Description

CLAIM OF PRIORITY
The present application claims priority from Japanese application 2008-252093 filed on Sep. 30, 2008 and is a continuation application of U.S. application Ser. No. 12/444,398, filed Apr. 6, 2009, now U.S. Pat. No. 8,020,045 which is a 371 application of PCT/JP2009/000285, filed Jan. 26, 2009, the contents of which are hereby incorporated by reference into this application.
TECHNICAL FIELD
A technology disclosed in this specification relates to a system management method, an apparatus, a system, and a program for managing an operation of an information processing system which includes a server computer, a network apparatus, and a storage apparatus, and to a medium that includes the program, and an apparatus for delivering the program.
BACKGROUND ART
Recent years, each IT system (IT is an abbreviation for Information Technology, and hereinafter, an IT system is also referred to as an information processing system) has become complex and large-scaled because various IT apparatuses (hereinafter, also referred to as information processing apparatuses) are coupled thereto via a network. Faults may affect the various IT apparatuses via the network. As an example of root cause analysis technologies of identifying the locations and causes of the faults, Patent Document 1 discloses an event correlation technology of analyzing a fault location and a cause by using event information used by an IT apparatus to notify fault contents. The event correlation technology is also called a technology of estimating a root cause by using the correlation of events sent from computers when faults occur. Non-Patent Document 2 discloses a technology in which a rule is made from a combination of the technology disclosed in Patent Document 1 and events occurring at the time of faults, and an estimated root cause, handled as a pair, thereby quickly determining a root cause by using an inference engine made based on an expert system.
  • [Patent Citation 1] U.S. Pat. No. 6,249,755 Specification
  • [Non Patent Citation 1] “Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem”, ARTIFICIAL INTELLIGENCE, Vol. 19, no. 1, 1982, pp. 17-37.
DISCLOSURE OF INVENTION Technical Problem
Since a system management server that performs processing required for operation management cannot obtain events of all IT apparatuses coupled to the network, the system management server limits the number of IT apparatuses from which event information is received (or obtained) and displays an analysis result by using a root cause analysis technology.
However, in the analysis technology, it is premised that event information can be obtained from all IT apparatuses coupled to the network. As a result, when an event (for example, a fault) occurs in an IT apparatus from which the system management server does not obtain event information, and an IT apparatus from which the system management server obtains event information is affected by this fault, since the IT apparatus in which the fault has occurred is not an analysis target, a rule is not applied thereto and the root cause of the fault cannot be identified.
Technical Solution
The present invention provides an apparatus, a system, a method, a program, and a storage medium which are related to analysis of events occurring in a plurality of information processing apparatuses in an information processing system that includes the plurality of information processing apparatuses, a screen output apparatus, and a system management server which has a processor and a memory.
According to an embodiment of the present invention, the system management server stores identification information of a server apparatus which is included in the plurality of information processing apparatuses and which is an access target of each of the plurality of information processing apparatuses for using a network service as a client, in configuration information held by the memory; registers a plurality of monitored apparatuses which are included in the plurality of information processing apparatuses and from which the system management server obtains event information, in the configuration information held by the memory; stores in the memory, a correlation analysis rule information, indicating that; when an event that includes a first event type related to the network service and an event that includes a second event type being different from the first event type related to the network service, both occurring in the plurality of information processing apparatuses, are detected, an event corresponding to the first event type can occur due to an event corresponding to the second event type; stores in the memory, a plurality of the event information obtained from the plurality of monitored apparatuses; identifies first event information which includes the first event type from among the plurality of the event information stored in the memory, based on the correlation analysis rule information; identifies a first monitored apparatus which is one of monitored apparatuses that sends the first event information and, a fault cause apparatus which serves as a server apparatus of the network service for the first monitored apparatus corresponding to the first event type, based on the configuration information; and sends information identifying the first monitored apparatus, the first event type, the fault cause apparatus, and the second event type to the screen output apparatus in case that the fault cause apparatus is not included in the plurality of monitored apparatuses, based on the correlation analysis rule information and the configuration information, thereby causing the screen output apparatus to display a message indicating that an event corresponding to the first event information that occurred in the first monitored apparatus is estimated to be caused by the fact that an event of the second event type occurred in the fault cause apparatus.
Note that the correlation analysis rule information may include topology condition information indicating a topology condition between a first information processing apparatus which is one of the plurality of information processing apparatuses and in which the first event type is occurred and a second information processing apparatus which is one of the plurality of information processing apparatuses and in which the second event type is occurred; and the fault cause apparatus may be identified based on the topology condition information in the cause identifying step.
Further, an event-related information processing apparatus which is a server apparatus for the plurality of monitored apparatuses and which is included in the plurality of information processing apparatuses but is not included in the plurality of monitored apparatuses, may be identified based on the correlation analysis rule information and the configuration information; whether event information can be obtained from the event-related information processing apparatus may be checked; and information identifying the event-related information processing apparatus may be sent to the screen output apparatus, based on a result of the checking, when event information can be obtained from the event-related information processing apparatus; thereby information indicating that event information can be obtained from the event-related information processing apparatus may be displayed on the screen output apparatus.
Further, the event-information acquisition permission/inhibition checking may be performed based on a result of an access by the system management server, according to a predetermined procedure to an information processing apparatus that is included in the plurality of information processing apparatuses and that has an IP address included in an IP address range specified in advance as a checking range.
Further, the fault cause apparatus may be a storage apparatus which has a controller and provides a logical volume; the network service may be a service providing the logical volume by a block access protocol; and the first event type may be the occurrence of a fault in the controller and the first event type may be a fail in accessing the logical volume.
Further, when the fault cause apparatus is one of the plurality of monitored apparatuses, second event information which includes the second event type and which has been obtained from the fault cause apparatus, may be identified from among the plurality of the event information, and information identifying the first monitored apparatus, the first event information, the fault cause apparatus, and the second event information may be sent to the screen output apparatus based on the correlation analysis rule information and the configuration information; thereby a message indicating that an event corresponding to the first event information that occurred in the first monitored apparatus is caused by an event corresponding to the second event information that occurred in the fault cause apparatus may be displayed on the screen output apparatus.
According to another embodiment of the present invention, in the system management server, an information processing apparatus that is an event-information acquisition target is registered as a monitored apparatus in configuration information; event information that complies with a rule stored in advance is identified from among a plurality of event information stored in the system management server; a server apparatus for a network service related to the event information is identified; and a message is displayed which indicates that the cause of the event that occurred in a client information processing apparatus which has generated event information is an event related to the network service, which occurred in the server apparatus.
Advantageous Effects
According to the present invention, even when an event has occurred in an IT apparatus from which event information is not obtained, an analysis result can be displayed.
BRIEF DESCRIPTION OF DRAWINGS
FIG. 1 shows an entire configuration diagram of an operation management system according to the present invention.
FIG. 2 schematically shows an entire processing flow of fault analysis according to one embodiment of the present invention.
FIG. 3 schematically shows one representative configuration example of an IT system which is a target of the present invention.
FIG. 4 schematically shows correlation analysis rule information used in the operation management system of the present invention.
FIG. 5 schematically shows topologies specified as application targets in the correlation analysis rule information shown in FIG. 4.
FIG. 6 schematically shows a rule-application-destination management table which is one example of a table data structure for managing lists of IT apparatuses to which rules are applied.
FIG. 7 is a processing flow of generating application information of the correlation analysis rule information according to one embodiment of the present invention.
FIG. 8 schematically shows connection information of IP-SAN storage apparatuses, obtained from computers serving as IP-SAN clients, in a first embodiment of the present invention.
FIG. 9 schematically shows configuration information related to an IP-SAN storage that is a management-target IT apparatus, the configuration information being held in configuration management, in the first embodiment of the present invention.
FIG. 10 is an example screen display which proposes a user to set a not-managed IT apparatus to a management target, in the first embodiment of the present invention.
FIG. 11 schematically shows a not-managed IT-apparatus management table which is one example of a table data structure for managing not-managed IT apparatuses, in the first embodiment of the present invention.
FIG. 12 schematically shows the rule-application-destination management table, holding lists of IT apparatuses to which rules are applied, in the first embodiment of the present invention.
FIG. 13 schematically shows connection information of FC-SAN storage apparatuses, obtained from computers serving as FC-SAN clients, in the first embodiment of the present invention.
FIG. 14 schematically shows information related to an FC-SAN storage that is a management-target IT apparatus, the information being held in the configuration management, in the first embodiment of the present invention.
FIG. 15 schematically shows identification information and public names related to file servers, which can be obtained from computers serving as the file servers, in the first embodiment of the present invention.
FIG. 16 schematically shows a processing flow of displaying a fault analysis result on a screen, in the first embodiment of the present invention.
FIG. 17 schematically shows an example of fault analysis result data in a case where a not-managed IT apparatus causes a fault, in the first embodiment of the present invention.
FIG. 18 schematically shows an example screen display configuration for a fault analysis result in the case where the not-managed IT apparatus causes a fault, in the first embodiment of the present invention.
FIG. 19 schematically shows screen display for a fault analysis result in the case where the not-managed IT apparatus causes a fault, in the first embodiment of the present invention.
FIG. 20 schematically shows an entire processing flow of fault analysis, in a second embodiment of the present invention.
FIG. 21 is the processing flow of generating application information of the correlation analysis rule information, according to one embodiment of the present invention.
EXPLANATION OF REFERENCE
    • N0: system management server
    • N1 to N3: computer
    • N4: network (NW) switch
    • N5: storage apparatus
    • O1: computer
    • O2: NW switch
    • O3: storage apparatus
    • M1: screen output apparatus
BEST MODE FOR CARRYING OUT THE INVENTION
Embodiments of the present invention will be described below.
First Embodiment
FIG. 1 is an overview showing one configuration of an information processing system for implementing the present invention.
The information processing system includes an operation management system and a system management server. In the operation management system, the system management server N0 monitors and manages, as management targets, computers, a network switch (NW switch), and a storage apparatus which constitute the IT system.
The system management server N0 of the present invention includes an event reception part C0 for receiving event information such as a status change in a management-target IT apparatus, fault information, and notification information; a rule engine C1 for performing fault analysis based on the received event information according to a rule R0 defined in advance; configuration management C3 for managing configuration information of management-target IT apparatuses; and a screen display part C2 for outputting information required for operation management to a screen.
Further, the operation management system includes a screen output apparatus M1 for displaying information used for operation management on the screen based on output data and the control of the screen display part. The screen output apparatus M1 is coupled to the system management server N0. Note that a first candidate for the screen output apparatus M1 is a display apparatus coupled to the system management server; however, another apparatus can be used instead if the apparatus can display analysis result information for the administrator of the operation management system. Other examples of the screen output apparatus M1 include a mobile terminal which can receive electronic mail sent from the system management server N0 and display it, as a screen output apparatus; and a computer having a display unit, which provides the administrator with information based on analysis result information sent by the system management server N0, receives an input from the administrator, and sends it to the system management server N0.
The rule engine C1 includes a rule application part C11 that reads analysis rule information R0 (hereinafter also referred to as correlation analysis rule information) used for event correlation analysis, obtains configuration information T0 from the configuration management C3, and performs processing to apply a rule to IT apparatuses in the IT system; a rule memory C13, serving as a working memory, for managing a rule-application-destination management table C130 in which application information used by the rule application part to apply a rule to IT apparatuses is managed and for performing rule analysis processing; and an event analysis processing part C12 that receives event information received by the event reception part C0 and performs event correlation analysis. Note that the rule-application-destination management table C130 may not be stored in the rule memory C13, but it needs to be stored in a memory of the system management server N0.
Note that the correlation analysis rule information may be generated and stored by the administrator of the system management server N0, may be included in a program of the present invention, to be described later, and stored in the memory, or may be stored in the memory through initializing processing of the program of the present invention.
Note that hardware items constituting the system management server N0 include a processor, the memory (including secondary storage devices typical of which are a semiconductor memory and an HDD), and a network port. Those hardware items are coupled to each other by an internal network such as a bus. Note that it is first conceivable that the event reception part C0, the rule engine C1, the screen display part C2, and the configuration management C3 are stored in the memory of the system management server N0 and realized by a program executed by the processor; however, part or all of those functions may be realized by hardware. Note that the program which includes the event reception part C0, the rule engine C1, the screen display part C2, and the configuration management C3 is referred to as an event analysis program in the following description.
Further, the correlation analysis rule information R0, the configuration information T0, and the rule-application-destination management table C130 are stored in the memory of the system management server N0. Further, the configuration information T0 includes at least one of the following: connection information of IP-SAN storage apparatuses (FIG. 8); information related to an IP-SAN storage (FIG. 9); connection information of FC-SAN storage apparatuses (FIG. 13); information related to an FC-SAN storage (FIG. 14); and identification information and public names related to file servers (FIG. 15), all of which will be described later. Further, a description will be given in which a not-managed IT-apparatus management table (FIG. 11), to be described later, is also included in the configuration information; however, if the not-managed IT-apparatus management table is stored in the memory of the system management server N0, it may be not stored as information which is not included in the configuration information T0.
Further, the correlation analysis rule information R0, the configuration information T0, the rule-application-destination management table C130, the connection information of IP-SAN storage apparatuses, the information related to an IP-SAN storage, the connection information of FC-SAN storage apparatuses, the information related to an FC-SAN storage, the identification information and the public names related to file servers, and the not-managed IT-apparatus management table are not necessarily stored in a text file, in a table, in a specific format such as that having a queue structure, or in a data structure; they just need to include information to be described later. In order to clarify that they are more general information in the following description and claims, the correlation analysis rule information R0, the configuration information T0, the rule-application-destination management table C130, the connection information of IP-SAN storage apparatuses, the connection information of FC-SAN storage apparatuses, the information related to an IP-SAN storage, the information related to an FC-SAN storage, the identification information and the public names related to file servers, the not-managed IT-apparatus management table are also referred to as correlation analysis rule information, configuration information, rule-application-destination management information, connection information of IP-SAN storage apparatuses, connection information of FC-SAN storage apparatuses, information related to an IP-SAN storage, information related to an FC-SAN storage, information of identification and public names related to file servers, and not-managed IT-apparatus management information, respectively.
In addition, the system management server stores, as event entries, event information received from various management-target IT apparatuses in an event database defined in the memory although that is not shown. Note that the event database may have any data structure if one or more event entries are included therein.
Note that event information includes event contents, and it may also include an event occurrence time. Further, in the event database, past event information may be left as a history according to a specified condition. When the event information is included in the event database and stored in the memory, the program (in particular, the configuration management C3) of the system management server may associate the event information with the identification information of an IT apparatus from which the event information has been obtained and with the time at which the system management server has received the event information, and may include them all together. Note that the event contents include at least the type of an event, and, depending on the situation, the event contents may also include information identifying hardware and software in the IT apparatus, in which the event has occurred.
The following items are conceivable as example event types, but there may be event types other than those items.
(A) The operation state of the IT apparatus enters a predetermined state (for example, the occurrence of a hardware fault or a software fault is included in this type).
(B) A predetermined health-check result is obtained (for example, a case where no health-check response is obtained for a given period of time is included in this type).
(C) The processing speed and the amount of used resources, such as a processor, a memory, and an HDD, which are components constituting the IT apparatus satisfy a predetermined condition (for example, a case where the remaining capacity of the HDD falls below 10% is included in this type).
(D) The IT apparatus receives network access which satisfies a predetermined condition (for example, a case where the IT apparatus received requests more than a predetermined number of times, a case where a network packet which is identified as a requested DoS attack is received a predetermined number of times, and a case where a request is received from an IT apparatus other than a specified IT apparatus are included in this type).
Note that in order to store the event analysis program in the memory, it is conceivable to use a method in which the program is installed or copied from a medium, such as a DVD-ROM or a CD-ROM, which has stored the program, or to use a method in which the program (or information from which the program can be generated on the memory) is received from a program distribution server that can communicate with the system management server N0; however, other methods can also be used. Alternatively, after the program is stored in the system management server N0 in advance, the system management server N0 may be distributed.
The above-described system management server N0 analyzes the root cause of faults in the information processing system.
In the operation management system, management-target IT apparatuses are specified in advance, event information is used as an analysis target of correlation analysis, and necessary information is received from the IT apparatuses. If all IT apparatuses coupled to the network are managed, the processor, the memory, and the storage device, such as a hard disk, of the management server needs to be used very exhaustively for the management, thereby making practical monitoring difficult. Therefore, management-target IT apparatuses, from which information is received, are specified to be narrowed down in the operation management system to avoid such difficulty. Further, when a management tool is a commercially-available tool, the number of licenses is limited based on the types and the number of IT apparatuses to be managed, in almost all cases. Therefore, the IT system includes an IT apparatus from which the system management server N0 obtains or is allowed to obtain event information for event information analysis (hereinafter, such an IT apparatus is also be expressed as monitored IT apparatus, managed IT apparatus, management IT apparatus, in-management IT apparatus, or monitored apparatus; and such expressions apply to a computer, a switch, a router, and a storage apparatus, which are specific examples of an IT apparatus), and an IT apparatus from which the system management server N0 does not obtain or is prevented from obtaining event information (hereinafter, such an IT apparatus is also expressed as not-monitored IT apparatus, not-managed IT apparatus, IT apparatus that is out of management, out-of-management IT apparatus, or event-related information processing apparatus; and such expressions apply to a computer, a switch, a router, and a storage apparatus which are specific examples of an IT apparatus).
IT apparatuses that are not monitored or managed in the system management server N0 are further classified into an IT apparatus that was once found, confirmed, or managed in the system management server N0, and an IT apparatus that has never been found, confirmed, or managed in the system management server N0. In some system management servers N0, for such an IT apparatus that was once managed, found, or confirmed, configuration information, for example, the IP address, the host name, or the fully qualified domain name (FQDN) of the IT apparatus, which is obtained when it is found or confirmed, may be held therein and managed, not always in the same manner as for an IT apparatus that is monitored and managed. In the present invention, a non-management-target IT apparatus for which corresponding configuration information is not held in the system management server N0 and a non-management-target IT apparatus for which part or all of corresponding configuration information has been stored in the system management server N0 are also defined as non-management-target IT apparatuses.
Example cases to be out of management of the operation management system include a case where a management-target IT apparatus uses a globally-provided service such as a DNS server, and a case where the operation management system cannot sufficiently obtain information used for management due to circumstances such as a firewall, an access-right problem, a network configuration, and an access-means defect.
The present invention relates to analysis of the correlation among a plurality of IT apparatuses existing in the network. However, even when events simultaneously occur due to a cause in a plurality of apparatuses which are correlated with each other, the clock signals in the individual apparatuses are shifted, and further, the timing to transfer event information is also shifted. Therefore, the system management server N0 analyzes event information that occurred or was received for the duration (a period of time) predetermined by a program developer or for a period of time specified by the administrator. Further, even when a cause arises, events related to the cause may occur at different timing (for example, in a case where a predetermined network service such as a Web service or a DNS service is received through caching processing from a server computer). Thus, analysis needs to be performed for a period of time instead of at a particular time.
It is preferable that desired events be items occurring dynamically to some extent. Further, it is more preferable that the time difference between the time at which an event occurs in an IT apparatus, serving as the cause, because a predetermined cause arises (or the time at which the system management server receives the event) and the time at which, due to the cause, an event occurs in another IT apparatus (or the time at which the system management server receives the event) causes an event during the above-mentioned period of time.
It is preferable that information conceivable as one piece of configuration information include the types and the number of hardware items constituting an IT apparatus, and the communication identification information and the name which are necessary to communicate with the IT apparatus, and be quasi-static information which can be partially changed by the administrator of the IT apparatus.
FIG. 2 shows a flow of general processing based on the above-described configuration, according to one embodiment of the present invention.
In S1, the rule engine C1 reads the correlation analysis rule information R0 in advance, obtains the configuration information T0 of management targets from the configuration management C3, searches T0 for the identification information of IT apparatuses to which the rule group R0 is applied, and stores the identification information in the rule-application-destination management table C130. The process of S1 is a preparation process for fault analysis processing using events, to be performed later, and needs to be performed prior to the analysis processing. In the first embodiment, which is one of the embodiments, it is assumed that the analysis processing is performed prior to the start of the operation, and the rule-application-destination management table C130 is held in advance in the rule memory C13.
In S2, the event reception part C0 waits to receive events sent from the management-target IT apparatuses in the operation management system.
S3 is related to a system operation of the operation management system. S3 is a step to determine whether a halt process has been instructed and used to halt the operation.
In S4, it is judged whether events have been received by the event reception part C0. When it is judged that events have been received, the events received by the event reception part C0 are input to the event analysis processing part C12, a corresponding rule is determined based on the rule-application-destination management table C130, and a fault cause is identified according to the rule, in S5.
In S5, the identified fault cause is output to the screen display part C14. The screen display part C14 sends analysis information based on received analysis result output data, thereby outputting and displaying a screen necessary for the operation management on the screen output apparatus M1.
Note that received event information may be temporarily stored in the event database, instead of in the processes of S2 and S4.
One advantage of the present invention is to allow fault cause analysis for an IT apparatus that is not a management target, by changing the process of the rule application part in this general processing flow, without largely changing the configuration and the subsequent processing flow.
FIG. 3 is an overview showing one configuration of an IT system assumed in the embodiments of the present invention. The IT system of FIG. 3 includes an operation management system which is a target of operation management including a computer N10, a computer N11, and a computer N12 which are operated and managed by the management server N0; an IP switch N21 and an FC switch N31 which are network switches; a storage apparatus N40; and a storage apparatus N41; a storage apparatus U2; and a computer U5 which are non-management-target IT apparatuses that are not managed by the management server N0; a storage apparatus U1 which are coupled to a network G0 via a router N20; a computer U3 and a computer U4. Note that the number of the IT apparatuses, such as computers, switches, routers, and storage apparatuses which are individually shown, is an example; the operation management system just needs to include at least an IT apparatus serving as a server which provides a network service and an IT apparatus serving as a client which receives the network service.
The storage apparatus U1, which is a non-management-target IT apparatus, includes an IP-SAN interface and provides the management-target computer N10 with a logical volume. The storage apparatus U2, which is a non-management-target IT apparatus, includes an FC-SAN interface and provides a management-target computer N13 with a logical volume via the management-target FC switch N31. The computer U3 or the computer U5, which is a non-management-target IT apparatus, is a file server and makes a file system available to both of the management-target computers N10 and N11. The computer U3 belongs to a network segment different from that of the operation management system, and detailed information related to the computer U3 cannot be obtained through the network.
On the other hand, the computer U5, serving as a file server, belongs to the same network segment as the operation management system, and can be automatically found by the operation management system. The computer U5 is an IT apparatus that was found at the time of the operation but was not set to a management target. The computer U4, which is a non-management-target IT apparatus, is a DNS server and applies a name solution function to all the IT apparatuses included in the IT system of FIG. 3.
To provide better understanding, a description will be given of how to apply a rule of an event correlation technology to management-target IT apparatuses, before the first embodiment is described.
FIG. 4 shows example rules suggesting that a fault in the controller of a storage apparatus is the root cause, for the IT system shown in FIG. 1. In a rule for identifying the root cause in fault analysis, a combination of events predicted to occur based on an event correlation and a fault serving as the root cause are described as a pair in an IF-THEN format, in many cases. In the IF-THEN format, a rule is expressed such that “when a condition described in the IF part is established, the THEN part is true”.
In the embodiments, it is assumed that a rule is described in the IF-THEN format in the same way as general rules in expert systems, and information related to IT apparatuses to which the rule is applied is defined in advance in the IF condition part. Note that a rule may not be described in the IF-THEN format, but a topology needs to be defined in advance as any connection and relation information which can identify IT apparatuses to which the rule is applied.
In addition, information for actually storing each rule is called a rule entry. The correlation analysis rule information includes one or more rule entries. More abstractly, it can be said that a rule entry includes the following information.
(A) A condition entry indicating a condition that includes an event type to which the rule is applied. As described above, this condition entry may include a topology as a condition.
(B) A cause entry indicating an event serving as a cause and the location of an IT apparatus related to the event or its hardware and software, when the condition is satisfied.
In the first embodiment, it is assumed that the following rules are defined in advance as shown in FIG. 4: a rule R1 in which the root cause is a fault in the controller of an IP-SAN storage apparatus that uses iSCSI; a rule R2 in which the root cause is a fault in the controller of an FC-SAN storage apparatus that uses Fibre Channel; a rule R3 in which the root cause is a fault in a file server; and a rule R4 in which the root cause is that the network does not reach the DNS server. FIG. 6 shows the rule-application-destination management table that is information holding, for each rule, IT apparatuses to which the rule is applied. The rule-application-destination management table includes a column C101 for identification information indicating a rule, and a column C102 for the list of application-destination IT apparatuses, storing the identification information of IT apparatuses to which the rule is applied. The rule-application-destination management table does not need to be in a database. Note that this table data structure may be divided into a plurality of table data structures by normalizing the table, and the plurality of table data structures may be managed.
FIG. 5 shows topology patterns to which the rules R1 to R4, shown in FIG. 3, are applied. FIG. 5(1) shows a topology of connection and relation information suggested by the IF part of the rule R1. FIG. 5(1) indicates that Computer indicating a computer has iScsiInitiator and is coupled to iScsiTarget of Storage indicating a storage apparatus via Ipswitch indicating an IP switch. iScsiTarget is an iSCSI name identifying the connection destination of iScsiInitiator. The rule R1 is applied to a combination of a computer and a storage apparatus in which connection-destination iScsiTarget held by the computer matches the iSCSI name of an iScsi port of the storage apparatus. Rows L101 and L102 of FIG. 6 show IT apparatuses to which the rule R1 is applied in the IT system of FIG. 3.
Similarly, FIG. 5(2) indicates that a computer has FcHba and FcHba is coupled to FcPort of a storage apparatus via FcSwitch, as suggested by the IF part of the rule R2. When a connection-destination port WWN (WWN: World Wide Name) held by FcHba matches FcPortWWN, which is WWN of FcPort serving as a Fibre Channel port of the storage apparatus, it means that they have a connection relation and the rule R2 is applied to them. A row L103 of FIG. 6 shows IT apparatuses to which the rule R2 is applied, as a combination of the computer and the storage apparatus, in the IT system of FIG. 3.
FIG. 5(3) shows a case where the IF part of the rule R3 indicates a topology of a file server and a client. A computer T31 having information of ImportedFileShare which indicates that a file system of the file server is mounted and a computer T33 having information of ExportedFileShare which indicates that the file system is made available to the outside have the relation of a client and a file server via an IP switch T32. ImportedFileShare T311 includes, as information related to the file server of the mount source, the identification information (the IP address, the FQDN (Fully Qualified Domain Name), etc.) of the file server, and the public name of the file system made available to the outside. ExportedFileShare T331 includes the location of the file system made available to the outside and the public name (also called share name) thereof.
When the computer indicated by the identification information of the file server specified by ImportedFileShare has information of ExportedFileShare, and the public name in ExportedFileShare matches the public name specified by ImportedFileShare of the computer T31, the rule R3 is applied to those computers, as a pair, as the topology of the file client and the file server. A row L104 of FIG. 6 shows IT apparatuses to which the rule R3 is applied, as a combination satisfying the above condition, in the IT system of FIG. 3.
FIG. 5(4) shows a topology of a DNS server and a client suggested by the rule R4. A computer T42 serving as the DNS server, which provides a name solution service, and a computer T41 serving as the client, which solves an IP address and an FQDN name with the DNS server, are stored as a pair in the application-destination management table shown in FIG. 6.
It is assumed that the configuration corresponding to topology information related to such connections and relations described in the rules is defined in advance in the system, and is uniquely determined by the description of each rule.
The application-destination management table of FIG. 6 for IT apparatuses to which each rule is applied is provided. Therefore, when events occur, it is possible, by referring to the table, to judge a rule to which the events are related and to select a rule to be applied. The method of applying a rule to management-target IT apparatuses has been described above.
FIGS. 7 and 21 show details of Step S1 of FIG. 2 performed in the rule application part C11, according to one embodiment of the present invention. With reference to the processing flow, the first embodiment will be described with the IT system shown in FIG. 3 and the rules R1 to R4 shown in FIG. 4. The entire processing shown in FIGS. 7 and 21 is performed in the rule application part. It is assumed that the operation management system stores in advance an IT apparatus once found, and can judge that the IT apparatus has been found. Alternatively, when the operation management system does not have a function of automatically finding an IT apparatus in the IT system, or even if the operation management system has the function of automatically finding an IT apparatus in the IT system, when it does not have a function of storing the found IT apparatus, the processing of FIGS. 7 and 21 is performed as if the found IT apparatus did not exist.
(Description of a General Flow and a Case where the Rule R1 is Applied)
In S101, it is judged whether a rule to be read, that is, a rule that has not been read yet, is included in the correlation analysis rule information R0. When it is judged that a rule to be read is included (YES), the flow advances to S102. Otherwise (NO), the flow ends. Since rules to be read, which are the rules R1 to R4, are included (YES), the flow advances to S102.
In S102, one rule is read. The rule is marked or is stored as a read rule, for example, so that it can be recognized to be one that has been read. In the embodiment, the rule R1 is read and is stored as a read rule. The flow advances to S 103.
In S103, a search condition for IT apparatuses corresponding to the topology information described in the rule is obtained and the flow advances to S4. In the embodiment, a search condition is obtained for IT apparatuses which include a computer having iScsiInitiator, a storage apparatus having an iSCSI port identified by iScsiTarget, and an IP switch coupled to them and to which the rule R1 is applied as in the topology information of the rule R1. It is assumed that the search condition is defined in advance with respect to the description of the rule.
In S104, the configuration information of management-target IT apparatuses is searched for the IT apparatus serving as a client in the topology information. When the configuration information is stored in a database, the database is searched. When the configuration information is stored in a file, the file is searched. A storage medium, a device, or the like to be searched does not matter. In the embodiment, the configuration information is searched for the computer having iScsiInitiator, serving as a client in the topology of the rule R1. In this embodiment, when it is assumed that the computer N10 or the computer N11 has iScsiInitiator, the identification information of the computer N10 and the computer N11 is found through the search.
In S105, it is judged whether an IT apparatus that has not been selected is included in the IT apparatuses found through the search, because processes of S106 and the subsequent steps are performed for a plurality of computers. In this embodiment, since the computer N10 and the computer N11 are IT apparatuses that have not been selected, the flow advances to S106.
In S106, one of the IT apparatuses that have not been selected is selected and regarded as a selected IT apparatus. In this embodiment, the computer N10 is selected and regarded as a selected IT apparatus. The flow advances to S107.
In S107, information of IT apparatuses serving as servers which are opposed, in the topology, to the IT apparatus selected in S106 is obtained. The information of an IT apparatus serving as a server includes: information identifying the IT apparatus serving as the server (such as the IP address, the host name, or the FQDN); and information related to a service to be provided (the public name (also called share name) of an available file system of the file server, the LUN number identifying a disk volume of the storage apparatus, the iSCSI name of a connection destination, or the WWN of an FC Port). In this embodiment, ConnectedIscsiTarget which is the iSCSI name of a connection destination shown in FIG. 8 is obtained as the information of storage apparatuses serving as servers, which are opposed to the computer N10.
In S108, it is judged whether information corresponding to an IT apparatus that has not been searched for is included in the information related to IT apparatuses serving as servers, obtained in S107. When it is judged that such information is included (YES), the flow advances to S109. When it is judged that such information is not included (NO), the flow returns to S105. In this embodiment, since at least three pieces of information which have not been searched for are included as shown in FIG. 8 (YES), the flow advances to S109.
Information shown in FIG. 8 is described. The information includes the identification information indicating an IT apparatus (more specifically, a computer) and the identification information, in iSCSI, of a storage apparatus to which the IT apparatus is coupled.
In S109, one piece of information which has not been searched for is selected from the information related to IT apparatuses serving as servers, obtained in S107. Based on the selected information, the configuration information of management targets is searched for the IT apparatus serving as a server. In this embodiment, the configuration information of management targets is searched for a storage apparatus having, as iScsiTarget, an iSCSI name indicated in a row L201 of ConnectedIscsiTarget shown in FIG. 8, obtained from the computer N10.
In S110, when the corresponding storage apparatus is not included in management-target IT apparatuses (NO) through the search in S109, the flow advances to S111. On the other hand, when the corresponding storage apparatus is included in management-target IT apparatuses (YES), usual rule application processing will be performed and the flow advances to S121. In this embodiment, FIG. 9 shows configuration information about iScsiTarget of a management-target storage apparatus. Since the storage apparatus having iScsiTarget identical to ConnectedIscsiTarget in the row L201 of FIG. 8 is not found in the management target as shown in FIG. 9, the flow advances to S111.
Information shown in FIG. 9 is described. The information includes the identification information indicating a storage apparatus and the identification information, in iSCSI, held by the storage apparatus.
Note that the configuration information T0 includes, for each of one or more IT apparatuses that have been found, event-acquisition permission/inhibition information which indicates whether the apparatus is an event acquisition target (specifically, whether the apparatus is monitored; in other words, whether event acquisition from the apparatus is permitted or inhibited). The judgment of S110 is performed by referring to this data.
In S111, it is judged whether the IT apparatus has been already found in the operation management system. Specifically, it is judged whether the IT apparatus was once found, confirmed, or managed in the operation management system and the static configuration information of the IT apparatus is partially held in the operation management system. In this embodiment, since there is no configuration information related to the storage apparatus having iScsiTarget identical to ConnectedIscsiTarget in the row L201 of FIG. 8, it is assumed that the IT apparatus is not a found resource (NO). Then, the flow advances to S112.
Note that the judgment of S111 can be performed by judging whether information related to the apparatus (for example, the event-acquisition permission/inhibition information) is included in the configuration information.
In S112, an attempt is made to find the storage apparatus having iScsiTarget identical to ConnectedIscsiTarget in the row L201 of FIG. 8, from not-managed IT apparatuses. There is an example method of searching for the not-managed IT apparatus, to be used in S112. In the method, a request to receive a service related to the target resource is sent to a communication identifier such as the FQDN or the IP address corresponding to the target resource, obtained from the configuration information or input by the user; or a communication identifier such as the FQDN or the IP address in the network address, which is the IP address corresponding to the network segment that includes the target resource, obtained from the configuration information or input by the user. Depending on whether a response to the request is returned, the presence of the target resource is confirmed. In this embodiment, an attempt is made to find the storage apparatus from the IT system shown in FIG. 3.
In S113, it is judged whether the attempt made in S112 has succeeded. When it has succeeded (YES), the flow advances to S14. Otherwise (NO), the flow advances to S116. In this embodiment, it is assumed that a storage apparatus U3 shown in FIG. 3 has been found as the corresponding storage apparatus, and the flow advances to S114.
In S114, it is judged whether the IT apparatus found in S113 can be set to a management target of the operation management system. Whether the IT apparatus can be set to a management target is judged depending on whether information required by the operation management system for monitoring and management can be obtained from the target IT apparatus. Although information required for monitoring and management is different for each operation management system, information identifying the IT apparatus is required in common that includes at least one of the following: the IP address, the WWN (World Wide Name), some unique identification information (number), an apparatus name (host name), and the FQDN.
It is preferred that one or more pieces of information related to the types or the number of hardware items constituting the IT apparatus be able to be obtained to some extent. In the present invention, it is assumed that the system management server N0 holds a predetermined criterion and this judgment is performed based on the criterion. In this embodiment, it is assumed that, as information related to the storage apparatus U3, the storage apparatus has an iSCSI port and information of iScsiTarget can be obtained as the iSCSI name of the iSCSI port. It is also assumed that the IT apparatus has been judged to be able to be set to a management target. The flow advances to S115. Note that, since this apparatus may be set to a management target in a process to be performed later, the processing may be configured such that it is confirmed in this step that event information can be received from this IT apparatus, and only when it is confirmed that event information can be received from this IT apparatus, the flow advances to S115.
In S115, whether the IT apparatus found in S113 is set to a management target is presented to the user. In this embodiment, the fact that the storage apparatus U3 has been found as a storage server for the computer N1 and whether the storage apparatus U3 is added to management targets are presented. The indication screen is shown in FIG. 10.
In S116, the system management server N0 (in particular, the rule engine) receives an input from the management screen output apparatus.
In S117, it is judged whether the user has set the found IT apparatus to a management target. When the user has set the found IT apparatus to a management target (YES), the flow advances to S118. Otherwise (NO), the flow advances to S119. In this embodiment, it is assumed that the user did not set the storage apparatus U3 to a management target, and the flow advances to S119.
In S118, information for the IT apparatus which the user has determined to add to management targets is obtained and is stored in the configuration management as information of a management-target IT apparatus. In this embodiment, this side of branch is not being processed at this point.
In S119, information which can be obtained for the server opposed to the client and handled as a not-managed IT apparatus is stored and managed in the not-managed IT-apparatus management table. The flow advances to S120. In this embodiment, it is assumed that the FQDN and iScsiTarget which is the iSCSI name of the IP port of the storage apparatus can be obtained as information identifying the storage apparatus U3 and are stored in the not-managed IT-apparatus management table TL3 shown in FIG. 11.
A description is given with reference to FIG. 11. The not-managed IT-apparatus management table TL3 includes the following information for each of not-managed IT apparatuses that have been found.
(A) The identification information of the not-managed IT apparatus
(B) The type C401 of the not-managed IT apparatus
(C) The communication identification information C402 of the not-managed IT apparatus
(D) The identification information C403 required to access a service of the not-managed IT apparatus
In S120, the identification information of the not-managed IT apparatus is marked such that it can be recognized that the IT apparatus is not managed, and then the identification information is stored in the rule-application-destination management table TL1 as shown in FIG. 12. In this embodiment, the identification information is stored in the rule-application-destination management table TL1, based on the information related to the storage apparatus U3 included in the not-managed IT-apparatus management table. After the identification information is stored, the flow returns to S8, in which it is judged whether search information related to an IT apparatus serving as a server opposed to the selected IT apparatus serving as a client is included.
In this embodiment, when the flow returns to S108, it is judged whether information that has not been searched for is included in the search information related to storage apparatuses serving as servers, obtained in S107. Since there is search information related to a storage serving as a server for the computer N10, as in the row L202 of FIG. 8, the flow advances to S109.
In S109, the storage apparatus corresponding to L202 is searched for in the configuration management. In the embodiment, since the storage apparatus corresponding to L202 exists as shown in FIG. 9, it is recognized that the IT apparatus corresponding to L202 is a management target. Therefore, it is judged in S110 that the IT apparatus is a management-target IT apparatus, and the flow advances to S120. In S120, the list of the storage apparatus N40 and the computer N10, which are management-target IT apparatuses, is stored in L101 of the rule-application-destination management table of FIG. 11, as IT apparatuses to which the rule R1 is applied.
Through the above-described steps, the rule R1 can be applied also to the non-management-target storage apparatus U1, which provides the computer N10 with a logical volume.
Next, referring to the rule-application-destination management table of FIG. 11, a description will be given of an example case of S6 of FIG. 2. Specifically, a description will be given of screen display processing in which, when a fault occurs in the storage apparatus U1 that is not managed, the storage apparatus U1 is displayed on the screen as the root cause of the fault.
When a controller fault event occurs in the storage apparatus U1, and the fault-cause location is identified in the event analysis processing part C12 shown in FIG. 1 through event correlation according to a rule based on the rule-application-destination management table of FIG. 11, information of an analysis result is sent to the screen display part C2. According to the flow of FIG. 16, the screen display part C2 judges whether the IT apparatus serving as the root cause is a management target, and causes the screen display apparatus M1 to display a proper screen.
In Steps 601 to 603 of FIG. 16, the screen display part C2 obtains, from the rule engine C1, fault analysis result data D1 shown in FIG. 17 which indicates a fault analysis result obtained in the rule engine. Note that the rule engine C1 (in particular, the event processing analysis part C12) performs processes described with reference to S4 of FIG. 2, and FIGS. 4 and 5.
The fault analysis result data D1 includes fault-cause IT-apparatus information which is information related to a fault-cause IT apparatus and a received-event list which is information related to an event in a management-target IT apparatus, received by the operation management system. The fault-cause IT-apparatus information D11 includes information indicating the fault-cause IT apparatus and information related to a component at the fault location. Acquisition of the information related to a component at the fault location depends on how much fault information can be obtained from the fault-cause IT apparatus that is a non-management-target IT apparatus. When fault information cannot be obtained at all, “unknown” is indicated as in FIG. 17. The received-event list includes a received-event transmission source which is information related to the transmission source of the received event which is information related to a correlated received-event in the rule defining this fault; and an event type indicating information related to the contents of the event.
In S604, it is judged whether the fault-cause IT apparatus is a management target or a non management target, from the fault-cause IT-apparatus information of the obtained fault analysis result data D11. In this embodiment, since the fault-cause IT apparatus is a non-management-target IT apparatus, the flow advances to S605.
In S605, the not-managed IT-apparatus management table of FIG. 11 is searched based on the fault-cause IT-apparatus information of the fault analysis result data D11, and information related to this not-managed IT apparatus is obtained. Then, the flow advances to S606. In this embodiment, information related to the storage apparatus U1 is obtained from L401 of FIG. 11.
In S606, a message indicating that the root cause of the fault that occurred is a not-managed IT apparatus is displayed on the screen, together with the information obtained in S605. As shown in FIG. 18, an example structure of the screen displayed at this time includes a message notifying that the not-managed IT apparatus is the root cause of the fault; a fault analysis result which is the result obtained through analysis of the cause of the fault; and fault information detected by the operation management system for the fault that occurred, such as a received event. A screen display such as a window or a dialog that includes the above items is output to the screen output apparatus M1. FIG. 19 shows an example screen display in a case where the fault in the storage U1 that is a not-managed IT apparatus is the root cause, according to this embodiment. The screen display includes information indicating that the fault-cause IT apparatus is a non management target, and the type of the IT apparatus. For example, the screen display shows that the IT apparatus is an IP-SAN storage apparatus, and the IP address, which is an example of the identification information, of the IT apparatus is 192.168.100.15.
Through the above-described steps, when a fault occurs in the storage apparatus U1, which is a non-management-target IT apparatus, it is possible to handle a case where a fault of an IP-SAN storage, as defined in the rule R1, occurs in a non management target. It is also possible to display a message indicating that the root cause is a non-management-target IP-SAN storage, on the screen.
(Processing Flow for Rule R2)
For the rule R2, the flow will be described according to the embodiment in which the IT system of FIG. 3 is a target.
In S101, since the rule R2 is included, the flow advances to S102. In S102, the rule R2 is read and R2 is marked to indicate that it has been read. In S103, as topology information described in the rule R2 and as the FC-SAN topology of FIG. 4(2), a topology in which a computer T21 serving as a client and having a Fibre-Channel Host Bus Adapter, i.e., FcHba T211, is coupled via an FC switch T22 to a storage apparatus T23 serving as a server and having FcPort T231 which is a Fibre-Channel port is defined in the search condition.
In S104, it is assumed that the computer N13 having FcHba is found as a client IT apparatus.
In S105, since the computer N13 is an IT apparatus that has not been selected, the flow advances to S106.
In S106, the computer N13 is selected and is marked to indicate that it has been selected.
In S107, ConnectedFcPortWWN C502 indicating the WWN of an FC Port, which is a Fibre-Channel port, of the storage apparatus serving as a server to which the computer N13 is coupled is obtained from the computer N13 as shown in FIG. 13.
The connection information of FC-SAN storage apparatuses shown in FIG. 13 is described. The connection information includes, as information for each IT apparatus, the communication identification information of FibreChannel held by a storage apparatus to which the IT apparatus is coupled.
In S108, since ConnectedFcPortWWN which is search information related to the storage apparatus coupled to the computer N13 has not been searched for, the flow advances to S109.
In S109, by using a value specified in C502 in a row L501 as ConnectedFcPortWWN obtained from the computer N13, the storage apparatus having this WWN as an FcPort WWN is searched for.
In S110, as a result of the search in S109, the storage having the value specified in C502 in the row L501 of FIG. 13 as an FcPort WWN was not found in the configuration information of a management target as shown in FIG. 14. Thus, the flow advances to S111.
Information shown in FIG. 14 is described. The information includes the identification information indicating a storage apparatus and the communication identification information used in FibreChannel held by the storage apparatus.
In S111, the storage apparatus U2 having the value specified in C502 in the row L501 of FIG. 13 as an FcPort WWN is found among storage apparatuses that have been found. Thus, the flow advances to S115.
In S115, a message proposing to add the found storage apparatus U2 to the managed IP apparatuses is displayed on the screen. FIG. 10 shows an example screen display used for the rule R1, but the structure of screen display is basically the same and just the message contents are replaced with those for the actual IT apparatus.
In S116, the identification information of the storage apparatus U2 and instruction information to add this apparatus to the management targets are received from the administrator.
In S117, it is judged whether the user added the apparatus to the management targets. In this embodiment, since the user added the apparatus to the management targets, the flow advances to S118.
In S118, information that needs to be obtained as that for a management-target IT apparatus is obtained for the storage apparatus U2 added as a new management target. The information to be obtained as that for a management target includes event information and configuration management information.
In S121, the storage apparatus U2 serving as a management-target IT apparatus and the computer N14 are registered in the rule-application-destination management table as IT apparatuses to which the rule R2 is applied. In this example case, they are registered in the table data structure formed of the column C101 for a rule and the column C102 for storing the list of IT apparatuses to which the rule is applied, shown in FIG. 12.
As described above, with respect to the rule R2, fault analysis for an FC-SAN storage apparatus that is a non-management-target IT apparatus can be performed through the conventional rule-based event correlation.
Note that processing of displaying a message indicating that the FC-SAN storage that is a non-management-target IT apparatus is the root cause of the fault, on the screen based on the fault analysis result data is performed through the steps of FIG. 16 in the same way as the processing of displaying on the screen a message indicating that the non-management-target IP-SAN storage is the root cause of the fault, performed for the rule R1.
Through the process steps described above, when a fault occurs in the storage apparatus U2 that is a non-management-target IT apparatus, also in the rule R2, it is possible to handle the case where a fault of an FC-SAN storage, as defined in the rule R2, occurs in a non management target. It is also possible to display on the screen a message indicating that the root cause is a non-management-target FC-SAN storage.
(Processing Flow for Rule R3)
For the rule R3, the flow will be described according to the embodiment in which the IT system of FIG. 3 is a target.
In S101, since the rule R3 is included, the flow advances to S102. In S102, the rule R3 is read and R103 is marked to indicate that it has been read. In S103, as topology information described in the rule R3 and as the topology of a file server and a client shown in FIG. 4(3), a topology in which the computer T31 serving as a client and having ImportedFileShare T311 which indicates that a file system made available is mounted is coupled via an IP switch T32 to the computer T33 serving as a server and having ExportedFileShare T331 which indicates that the computer T33 has the file system made available to the other computers is defined in the search condition.
In S104, it is assumed that the computer N10 shown in FIG. 3 is found as the client IT apparatus in the topology of FIG. 4(3).
In S105, the computer N10 is the client IT apparatus that has been searched for and that has not been selected. Thus, the flow advances to S106.
In S106, the computer N10 shown in FIG. 3 is selected as the client IT apparatus that has not been selected, and is marked as a selected IT apparatus.
In S107, information of ImportedFileShare indicating the file server from which the file system made available is mounted is obtained as search information for the computer serving as a server IT apparatus opposed to the computer N10 in the topology of FIG. 4(3). Information related to the file server, obtained from the client, is managed in a table of FIG. 15. The table has a data structure which includes a column C701 for a client computer, a column C702 for the identification information related to a file server for the client computer, and a column C703 for the public name of the file server. Note that the information related to a file server, obtained from the client, may be obtained in advance as configuration information in the table of FIG. 15, or may be obtained from the client IT apparatus in the process of S7. In other words, the acquisition of such information needs to be performed before the process of S107 is completed.
Information shown in FIG. 15 is described. The information includes the following information for each file server.
(A) The identification information of the file-server IT apparatus
(B) The identification information and the public names of one or more file servers
In S108, the information related to the file server for the client, obtained in S107, is included in a row L701 of FIG. 15 and has not been searched for. Thus, the flow advances to S9.
In S109, an IT apparatus having the value specified in the column C702, for the identification information of a file server, in the row L701 of FIG. 15, that is, an FQDN of exportfs.domain2.com, is searched for.
In S110, the computer having the FQDN of exportfs.domain2.com is not included in the configuration information T0 of management targets. Thus, the flow advances to S111.
In S111, the computer having the FQDN of exportfs.domain2.com is not included in found resources. Thus, the flow advances to S112.
In S112, an attempt is made to find the computer having exportfs.domain2.com. The attempt is made such that an IP address is solved by making an inquiry to the DNS server, the presence thereof is confirmed by sending a ping to the IP address, and the computer is accessed through a remote connection of telnet, ssh, or Windows (registered trademark). In this embodiment, it is assumed that the ping to the IP address corresponding to exportfs.domain2.com returns “success” and the presence thereof is confirmed, but, since authentication information about the server is not held, other accesses fail, thereby preventing login. The flow advances to S114.
In S114, the found computer having exportfs.domain2.com cannot be set to a management target because, although it returns the ping response, information other than the response cannot be obtained therefrom. Thus, the flow advances to S119.
In S119, the computer having exportfs.domain2.com is registered in the not-managed IT-apparatus management table of FIG. 11. Specifically, as shown in L403 of FIG. 10, the information obtained from the client is stored in file-server identification information and service identification information.
In S120, rule application information is generated for the pair of the client computer N10 and the computer U having exportfs.domain2.com. Specifically, as shown in L107 of FIG. 121, the computer N10 and the computer U3 that is a not-managed IT apparatus are registered in the list of application-destination IT apparatuses for the rule R3.
As described above, fault analysis can also be performed for the computer U3 that is a not-managed IT apparatus serving as a file server for the computer N10.
Similarly, a description will be given of the processing flow according to the embodiment, in a case where the computer N11 is found as a client IT apparatus in the rule R3 through Steps S101 to S104. Through Steps S105 to S107, information specified in a row L703 of FIG. 15 and related to a file server for the computer N11 is obtained. In S109, since the file server specified in the row L703 of FIG. 15 is not found in the management-target IT apparatuses, the flow advances to S111. In S111, the computer U5 having the IP address specified in the row L703 of FIG. 15 is found in the found resources. Thus, the flow advances to S115.
In S115, a message proposing to add the computer U5 to management targets is displayed on the screen. In S116, a user instruction to set the computer U5 to a management target is received as a user input.
In S117, since the user instruction to set the computer U5 to a management target has been received in S116, the flow advances to S118.
In S118, as information required to set the computer U5 to a management target, monitoring information that includes configuration information, the operation state, and performance information of a device coupled to the computer U5 is obtained in addition to the identification information of the IT apparatus, held as that of a found resource, and information used for access. The obtained information is stored in the configuration information T0 of management targets, in the configuration management C3.
In S121, the data structure shown in a row L108 of FIG. 12 is stored in the rule memory, so that the rule R3 can be applied to a topology which includes the computer N11 that is a managed IT apparatus serving as a client and the computer U5 serving as a file server.
As described above, it is possible to perform fault analysis for the computer U5 serving as a file server, which was a found IT apparatus but was not a management target, according to the flow of FIG. 2. Further, when the flow of FIG. 16 is performed in the screen display part C2, the fault cause can be output to the screen display apparatus M1.
(Processing Flow for Rule R4)
For the rule R4, the flow will be described according to the embodiment in which the IT system of FIG. 3 is a target.
Through Steps S101 to S104, the computer N10 is found as a client IT apparatus in the rule R4. Through Steps S105 to S107, as search information of a DNS server for the computer N10, the IP address 192.168.100.1 of the DNS server is obtained from the computer N10. Through Steps S108 to S110, it is confirmed that the DNS server is not included in the configuration information T0 of management targets in the configuration management C3, by using the obtained IP address 192.168.100.1. The flow advances to S111. In S111, it is judged that the DNS server is not a found IT apparatus. The flow advances to S112. In S112, an attempt is made to access the node having the IP address 192.168.100.1 from the actual IT system. As a result of the access, network connection is confirmed using a ping, but the node cannot be logged in because authentication information is not held. In S114, it is judged that the DNS server cannot be set to a management target. The flow advances to S119. In S119, as shown in L404 of FIG. 11, information of the computer having the IP address 192.168.100.1 is stored and managed as that of a non-management-target IT apparatus and as that of a DNS server with identification information U4. The flow advances to S120. In S120, the computer N10 serving as a client and the computer U4 that is a not-managed IT apparatus serving as a DNS server are stored in the list of application-destination IT apparatuses for the rule 4, as shown in a row L109 of FIG. 12.
Through the above-described steps, it is possible to perform fault analysis for the computer U4, which is a not-managed DNS server, through the conventional rule-based event correlation. It is also possible to identify the not-managed DNS server as the root cause.
The rule 4 can be similarly applied to another IT apparatus shown in FIG. 3 by generating application information for the computer U4 that is a not-managed DNS server.
In the same way as for the other rules in the embodiment, when the flow of FIG. 16 is performed in the screen display part C2, a message indicating that the DNS server that is a not-managed IT apparatus is the root cause of the fault can be displayed on the screen.
Second Embodiment
In a second embodiment of the present invention, the processing procedure of the entire fault-analysis processing flow shown in FIG. 2 in the first embodiment is performed in a manner such that Step S4 b of generating application information in the rule application part C11 is performed after Step S3 b of receiving events and before Step S5 b of event analysis processing performed in the event analysis part C12, as shown in FIG. 20.
The only difference between the first embodiment and the second embodiment is the timing of generating rule application information.
As described above, even when the timing for rule application information is changed and the present invention is implemented, the advantages are still provided and a message indicating that a non-management-target IT apparatus is the root cause apparatus of a fault can be displayed on the screen.
According to the first and second embodiments, described in the specification of this application, a program that implements, in the system management server which has the processor and the memory and which is coupled to a plurality of information processing apparatuses and the screen output apparatus, analysis of events occurring in the plurality of information processing apparatuses includes a part or all of the following processes.
(a) A configuration information storing process of storing identification information of a server apparatus which is included in the plurality of information processing apparatuses and which is an access target of each of the plurality of information processing apparatuses in order to use a network service as a client, in configuration information held by the memory.
(b) A registration process of registering a plurality of monitored apparatuses which are included in the plurality of information processing apparatuses and from which the system management server obtains event information, in the configuration information held by the memory.
(c) A rule storing process of storing in the memory, when an event that includes a first event type related to the network service and an event that includes a second event type related to the network service, different from the first event type, both occurring in the plurality of information processing apparatuses are detected, correlation analysis rule information indicating that an event corresponding to the first event type can occur due to an event corresponding to the second event type.
(d) An event storing process of storing in the memory, a plurality of pieces of the event information obtained from the plurality of monitored apparatuses.
(e) An event information identifying process of identifying first event information which includes the first event type from among the plurality of pieces of the event information stored in the memory, based on the correlation analysis rule information.
(f) A cause identifying process of identifying, based on the configuration information, a first monitored apparatus which is one of monitored apparatuses that have sent the first event information and a fault cause apparatus which serves as a server apparatus for the first monitored apparatus in the network service corresponding to the first event type.
(g) An analysis result sending process of sending, when the fault cause apparatus is not included in the plurality of monitored apparatuses based on the correlation analysis rule information and the configuration information, information identifying the first monitored apparatus, the first event type, the fault cause apparatus, and the second event type to the screen output apparatus, thereby causing the screen output apparatus to display a message indicating that an event corresponding to the first event information that occurred in the first monitored apparatus is estimated to be caused by the fact that an event of the second event type occurred in the fault cause apparatus.
Further, the correlation analysis rule information may include topology condition information indicating a topology condition between a first information processing apparatus which is one of the plurality of information processing apparatuses and in which the first event type has occurred and a second information processing apparatus which is one of the plurality of information processing apparatuses and in which the second event type has occurred; and the fault cause apparatus may be identified based on the topology condition information in the cause identifying step. Through this process, it is possible to present estimation only for an information processing apparatus that is actually used by an information processing apparatus in which an event has occurred, thereby providing a higher level of convenience for the user of the system management server.
The system management server may further include the following processes.
(h) A related-apparatus identifying process of identifying an event-related information processing apparatus which is a server apparatus for the plurality of monitored apparatuses and which is included in the plurality of information processing apparatuses but is not included in the plurality of monitored apparatuses, based on the correlation analysis rule information and the configuration information.
(i) An event-information acquisition permission/inhibition checking process of checking whether event information can be obtained from the event-related information processing apparatus.
(j) An event-information-acquisition-target addition proposing step of sending, when event information can be obtained from the event-related information processing apparatus as a result of the checking, information identifying the event-related information processing apparatus to the screen output apparatus, thereby causing the screen output apparatus to display a message indicating that event information can be obtained from the event-related information processing apparatus.
Through those processes, registration into the system management server can be promoted without failing to perform registration, quickly after event monitoring with the system management server is newly required or allowed because of a change in a management method or in the administrator of an information processing apparatus.
Further, the event-information acquisition permission/inhibition checking process may be performed based on a result obtained when the system management server accesses, according to a predetermined procedure, an information processing apparatus that is included in the plurality of information processing apparatuses and that has an IP address included in an IP address range specified in advance as a checking range. In order to avoid unauthorized accesses or fraudulent attacks to an information processing apparatus (in particular, a server computer accessed via the Internet), accesses from the outside to this information processing apparatus are monitored in some cases. When an access is made by this checking process, the access may also be recognized as an unauthorized access or a fraudulent attack, by the access monitoring. Therefore, the range of IP addresses of information processing apparatuses that are obviously not targets of event monitoring or the range of IP addresses of information processing apparatuses that can be targets of event monitoring is identified, thereby suppressing such a communication that is falsely recognized as an unauthorized access or a fraudulent attack.
Further, the fault cause apparatus may be a storage apparatus which has a controller and provides a logical volume; the network service may be a service providing the logical volume by a block access protocol (such as FibreChannel or iSCSI); and the first event type may be the occurrence of a fault in the storage apparatus and the first event type may be a fail in accessing the logical volume.
Further, the fault cause apparatus may be a computer which provides a DNS as the network service, the first event type may be a fail in requesting a DNS, and the first event type may be a disconnection of communication with a DNS server.
Further, the fault cause apparatus may be a file server computer which has an NIC to receive data from at least one of the plurality of information processing apparatuses and which provides a stored file for at least one of the plurality of information processing apparatuses; the network service may be a network file-sharing service for sharing the file stored by the file server computer; and the first event type may be the occurrence of a fault in the file server (for example, the occurrence of a fault in the NIC, the occurrence of a failure in software executed by the processor held by the file server, or the occurrence of a fault in which the communication function of the file server is stopped), and the first event type may be a fail in accessing the file provided by the network file-sharing service.
Further, when the fault cause apparatus is one of the plurality of monitored apparatuses based on the correlation analysis rule information and the configuration information, second event information which includes the second event type and which has been obtained from the fault cause apparatus may be identified from among the plurality of pieces of the event information; and information identifying the first monitored apparatus, the first event information, the fault cause apparatus, and the second event information may be sent to the screen output apparatus, thereby causing the screen output apparatus to display a message indicating that an event corresponding to the first event information that occurred in the first monitored apparatus was caused by an event corresponding to the second event information that occurred in the fault cause apparatus.
Further, the first information processing apparatus may be a computer, and the second information processing apparatus may be a storage apparatus; and the topology condition information may include a combination of communication identification information corresponding to the computer and communication identification information corresponding to the storage apparatus, the combination indicating a connection relation of a topology in which the computer is coupled to the storage apparatus. Note that at least one of an iSCSI name, an IP address, and a WWN used in FibreChannel is a candidate for the communication identification information.
Further, the first information processing apparatus may be a computer, and the second information processing apparatus may be a file server computer which provides a stored file for the plurality of information processing apparatuses by a file-sharing service; and the topology condition information may include a combination of communication identification information corresponding to the computer, and communication identification information corresponding to the file server computer or an export name used to make the file available, the combination indicating a connection relation of a topology in which the computer is coupled to the file server computer.
Further, the first information processing apparatus may be a computer, and the second information processing apparatus may be a DNS server computer which provides a DNS, as a network-sharing service, for the plurality of information processing apparatuses; and the topology condition information may include a combination of communication identification information corresponding to the computer and communication identification information corresponding to the DNS server computer, the combination indicating a connection relation of a topology in which the computer is coupled to the DNS server computer. Note that an IP address or an FQDN is a candidate for each of the communication identification information corresponding to the computer and the communication identification information corresponding to the DNS server computer.
Furthermore, the system management server may be configured by one or more computers.

Claims (21)

What is claimed:
1. A system comprising:
a plurality of information processing apparatuses; and
a management computer,
wherein the management computer stores correlation analysis rule information, indicating that an event of a second event type is a root cause of an event of a first event type for a network service,
wherein the management computer stores configuration information including at least information about the network service of a plurality of monitored apparatuses,
wherein the plurality of monitored apparatuses are included in the plurality of information processing apparatuses,
wherein the management computer obtains event information from the plurality of monitored apparatuses,
wherein the management computer identifies, from the event information, a first event of the first event type,
wherein the management computer identifies a first monitored apparatus in which the first event occurs, and
wherein the management computer identifies a root cause apparatus which is a server of the network service, based on the correlation analysis rule information and the configuration information, even if the root cause apparatus is not included in the plurality of monitored apparatuses.
2. The system according to claim 1, wherein the management server selects the plurality of monitored apparatuses, each of the plurality of monitoried apparatuses having an IP (Internet Protocol) address in a predetermined IP address range.
3. The system according to claim 1,
wherein the root cause apparatus is a storage apparatus,
wherein the network service provides a logical volume of the storage apparatus, and
wherein the second event type is an occurrence of a fault in the storage apparatus, and the first event type is a failure of accessing the logical volume by a computer.
4. The system according to claim 1,
wherein the root cause apparatus is a DNS (Domain Name Service) server,
wherein the network service is a DNS,
wherein the second event type is a fault in the DNS server, and
wherein the first event type is a disconnection of communication for a DNS.
5. The system according to claim 1,
wherein the root cause apparatus is a file server computer,
wherein the network service is a file sharing service,
wherein the second event type is a fault in the file server computer, and
wherein the first event type is an access failure of a file provided by the file sharing service.
6. The system according to claim 1, wherein the management computer identifies the first monitored apparatus, the first event type, the root cause apparatus, and the second event type, and sends information identifying the first monitored apparatus, the first event type, the root cause apparatus, and the second event type to the screen output apparatus for displaying a root cause of the first event of the first event type that occurred in the first monitored apparatus and is estimated to be caused by a not obtained event of the second event type that occurred in the root cause apparatus.
7. The system according to claim 2, wherein the management computer suggests obtaining event information from the root cause apparatus, after checking whether or not the management server is able to obtain information from the root cause apparatus.
8. A management computer comprising:
a memory storing a management program; and
a CPU (Central Processing Unit) that executes the management program,
wherein when executed, the management program causes the CPU to:
store correlation analysis rule information, indicating that an event of a second event type is a root cause of an event of a first event type for a network service;
store configuration information including at least information about the network service of a plurality of monitored apparatuses;
obtain event information from the plurality of monitored apparatuses;
identify, from the event information, a first event of the first event type;
identify a first monitored apparatus in which the first event occurs; and
identify a root cause apparatus which is a server of the network service, based on the correlation analysis rule information and the configuration information, even if the root cause apparatus is not included in the plurality of monitored apparatuses.
9. The management computer according to claim 8, wherein the management program further causes the CPU to select the plurality of monitored apparatuses, each of the plurality of monitored apparatuses having an IP (Internet Protocol) address in a predetermined IP address range.
10. The management computer according to claim 8,
wherein the root cause apparatus is a storage apparatus,
wherein the network service provides a logical volume of the storage apparatus, and
wherein the second event type is an occurrence of a fault in the storage apparatus, and the first event type is a failure of accessing the logical volume by a computer.
11. The management computer according to claim 8,
wherein the root cause apparatus is a DNS (Domain Name Service) server,
wherein the network service is a DNS,
wherein the second event type is a fault in the DNS server, and
wherein the first event type is a disconnection of communication for a DNS.
12. The management computer according to claim 8,
wherein the root cause apparatus is a file server computer,
wherein the network service is a file sharing service,
wherein the second event type is a fault in the file server computer, and
wherein the first event type is a access failure of a file provided by the file sharing service.
13. The management computer according to claim 8,
wherein the management program further causes the CPU to: identify the first monitored apparatus, the first event type, the root cause apparatus, and the second event type; and
send information identifying the first monitored apparatus, the first event type, the root cause apparatus, and the second event type to the screen output apparatus for displaying a root cause of the first event of the first event type that occurred in the first monitored apparatus and is estimated to be caused by a not obtained event of the second event type that occurred in the root cause apparatus.
14. The management computer according to claim 9, wherein the management computer suggests obtaining event information from the root cause apparatus, after checking whether or not the CPU is able to obtain information from the root cause apparatus.
15. A non-transitory machine-readable storage medium tangibly embodying a program for execution on a management computer, the program comprising code causing the management computer to:
store correlation analysis rule information, indicating that an event of a second event type is a root cause of an event of a first event type for a network service;
store configuration information including at least information about the network service of a plurality of monitored apparatuses;
obtain event information from the plurality of monitored apparatuses;
identify, from the event information, a first event of the first event type;
identify a first monitored apparatus in which the first event occurs; and
identify a root cause apparatus which is a server of the network service, based on the correlation analysis rule information and the configuration information, even if the root cause apparatus is not included in the plurality of monitored apparatuses.
16. The non-transitory machine-readable storage medium according to claim 15, wherein the program further causes the management computer to select the plurality of monitored apparatuses, each of the plurality of monitored apparatuses having an IP (Internet Protocol) address in a predetermined IP address range.
17. The non-transitory machine-readable storage medium according to claim 15,
wherein the root cause apparatus is a storage apparatus,
wherein the network service provides a logical volume of the storage apparatus, and
wherein the second event type is an occurrence of a fault in the storage apparatus, and the first event type is a failure of accessing the logical volume by a computer.
18. The non-transitory machine-readable storage medium according to claim 15,
wherein the root cause apparatus is a DNS (Domain Name Service) server,
wherein the network service is a DNS,
wherein the second event type is a fault in the DNS server, and
wherein the first event type is a disconnection of communication for a DNS.
19. The non-transitory machine-readable storage medium according to claim 15,
wherein the root cause apparatus is a file server computer,
wherein the network service is a file sharing service,
wherein the second event type is a fault in the file server computer, and
wherein the first event type is a access failure of a file provided by the file sharing service.
20. The non-transitory machine-readable storage medium according to claim 15, wherein the program causes the management computer to identify the first monitored apparatus, the first event type, the root cause apparatus, and the second event type, and send information identifying the first monitored apparatus, the first event type, the root cause apparatus, and the second event type to the screen output apparatus for displaying a root cause of the first event of the first event type that occurred in the first monitored apparatus and is estimated to be caused by a not obtained event of the second event type that occurred in the root cause apparatus.
21. The non-transitory machine-readable storage medium according to claim 16, wherein the program causes the management computer to suggest obtaining event information from the root cause apparatus, after checking whether or not the management computer is able to obtain information from the root cause apparatus.
US13/211,694 2008-09-30 2011-08-17 Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained Active US8479048B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/211,694 US8479048B2 (en) 2008-09-30 2011-08-17 Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
JP2008-252093 2008-09-30
JP2008252093A JP5237034B2 (en) 2008-09-30 2008-09-30 Root cause analysis method, device, and program for IT devices that do not acquire event information.
US12/444,398 US8020045B2 (en) 2008-09-30 2009-01-26 Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained
PCT/JP2009/000285 WO2010038327A1 (en) 2008-09-30 2009-01-26 Root cause analysis method targeting information technology (it) device not to acquire event information, device and program
US13/211,694 US8479048B2 (en) 2008-09-30 2011-08-17 Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained

Related Parent Applications (2)

Application Number Title Priority Date Filing Date
PCT/JP2009/000285 Continuation WO2010038327A1 (en) 2008-09-30 2009-01-26 Root cause analysis method targeting information technology (it) device not to acquire event information, device and program
US12/444,398 Continuation US8020045B2 (en) 2008-09-30 2009-01-26 Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained

Publications (2)

Publication Number Publication Date
US20110302305A1 US20110302305A1 (en) 2011-12-08
US8479048B2 true US8479048B2 (en) 2013-07-02

Family

ID=42073117

Family Applications (2)

Application Number Title Priority Date Filing Date
US12/444,398 Expired - Fee Related US8020045B2 (en) 2008-09-30 2009-01-26 Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained
US13/211,694 Active US8479048B2 (en) 2008-09-30 2011-08-17 Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US12/444,398 Expired - Fee Related US8020045B2 (en) 2008-09-30 2009-01-26 Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained

Country Status (5)

Country Link
US (2) US8020045B2 (en)
EP (1) EP2336890A4 (en)
JP (1) JP5237034B2 (en)
CN (1) CN101981546B (en)
WO (1) WO2010038327A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120117573A1 (en) * 2008-06-17 2012-05-10 Hitachi, Ltd. Method and system for performing root cause analysis
US9053000B1 (en) * 2012-09-27 2015-06-09 Emc Corporation Method and apparatus for event correlation based on causality equivalence
US9298582B1 (en) 2012-06-28 2016-03-29 Emc Corporation Method and apparatus for performance data transformation in a cloud computing system
US20160170848A1 (en) * 2014-12-16 2016-06-16 At&T Intellectual Property I, L.P. Methods, systems, and computer readable storage devices for managing faults in a virtual machine network
US9389946B2 (en) 2011-09-19 2016-07-12 Nec Corporation Operation management apparatus, operation management method, and program
US9413685B1 (en) 2012-06-28 2016-08-09 Emc Corporation Method and apparatus for cross domain and cross-layer event correlation

Families Citing this family (102)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8381038B2 (en) * 2009-05-26 2013-02-19 Hitachi, Ltd. Management server and management system
JP5385982B2 (en) 2009-07-16 2014-01-08 株式会社日立製作所 A management system that outputs information indicating the recovery method corresponding to the root cause of the failure
US7996723B2 (en) * 2009-12-22 2011-08-09 Xerox Corporation Continuous, automated discovery of bugs in released software
US8411577B2 (en) * 2010-03-19 2013-04-02 At&T Intellectual Property I, L.P. Methods, apparatus and articles of manufacture to perform root cause analysis for network events
DE102010024966A1 (en) * 2010-06-24 2011-07-07 Siemens Aktiengesellschaft, 80333 Method for determination of quality of several hardware units and software modules of exhibiting information-technology (IT) plant, involves determining total core value of IT plant by weighted addition of score values
WO2012014305A1 (en) * 2010-07-29 2012-02-02 株式会社日立製作所 Method of estimating influence of configuration change event in system failure
US8819220B2 (en) 2010-09-09 2014-08-26 Hitachi, Ltd. Management method of computer system and management system
US8364813B2 (en) 2010-11-02 2013-01-29 International Business Machines Corporation Administering incident pools for event and alert analysis
US8386602B2 (en) 2010-11-02 2013-02-26 International Business Machines Corporation Relevant alert delivery in a distributed processing system
US8621277B2 (en) 2010-12-06 2013-12-31 International Business Machines Corporation Dynamic administration of component event reporting in a distributed processing system
US8805999B2 (en) 2010-12-07 2014-08-12 International Business Machines Corporation Administering event reporting rules in a distributed processing system
US8868984B2 (en) * 2010-12-07 2014-10-21 International Business Machines Corporation Relevant alert delivery in a distributed processing system with event listeners and alert listeners
US8737231B2 (en) 2010-12-07 2014-05-27 International Business Machines Corporation Dynamic administration of event pools for relevant event and alert analysis during event storms
EP2602718A4 (en) 2011-03-08 2015-06-10 Hitachi Ltd Computer system management method and management device
WO2012131868A1 (en) * 2011-03-28 2012-10-04 株式会社日立製作所 Management method and management device for computer system
US8756462B2 (en) 2011-05-24 2014-06-17 International Business Machines Corporation Configurable alert delivery for reducing the amount of alerts transmitted in a distributed processing system
US8645757B2 (en) 2011-05-26 2014-02-04 International Business Machines Corporation Administering incident pools for event and alert analysis
US8676883B2 (en) 2011-05-27 2014-03-18 International Business Machines Corporation Event management in a distributed processing system
US9213621B2 (en) 2011-05-27 2015-12-15 International Business Machines Corporation Administering event pools for relevant event analysis in a distributed processing system
US8880943B2 (en) 2011-06-22 2014-11-04 International Business Machines Corporation Restarting event and alert analysis after a shutdown in a distributed processing system
US8392385B2 (en) 2011-06-22 2013-03-05 International Business Machines Corporation Flexible event data content management for relevant event and alert analysis within a distributed processing system
US9419650B2 (en) 2011-06-22 2016-08-16 International Business Machines Corporation Flexible event data content management for relevant event and alert analysis within a distributed processing system
US8713366B2 (en) 2011-06-22 2014-04-29 International Business Machines Corporation Restarting event and alert analysis after a shutdown in a distributed processing system
US20130097272A1 (en) 2011-10-18 2013-04-18 International Business Machines Corporation Prioritized Alert Delivery In A Distributed Processing System
US8887175B2 (en) 2011-10-18 2014-11-11 International Business Machines Corporation Administering incident pools for event and alert analysis
US20130097215A1 (en) 2011-10-18 2013-04-18 International Business Machines Corporation Selected Alert Delivery In A Distributed Processing System
US9178936B2 (en) 2011-10-18 2015-11-03 International Business Machines Corporation Selected alert delivery in a distributed processing system
US8713581B2 (en) 2011-10-27 2014-04-29 International Business Machines Corporation Selected alert delivery in a distributed processing system
CN103283180B (en) * 2011-12-02 2015-12-02 华为技术有限公司 A kind of fault detection method, gateway, subscriber equipment and communication system
US9092329B2 (en) * 2011-12-21 2015-07-28 Sap Se Process integration alerting for business process management
WO2013121529A1 (en) 2012-02-14 2013-08-22 株式会社日立製作所 Computer program and monitoring device
FR2987533B1 (en) * 2012-02-23 2014-11-28 Aspserveur METHOD AND SYSTEM FOR FAULT CORRELATION ANALYSIS FOR A COMPUTER CENTER
JP5684946B2 (en) * 2012-03-23 2015-03-18 株式会社日立製作所 Method and system for supporting analysis of root cause of event
JP5884901B2 (en) * 2012-04-20 2016-03-15 富士通株式会社 Program, information processing apparatus, and event processing method
US8966206B2 (en) 2012-05-07 2015-02-24 Hitachi, Ltd. Computer system, storage management computer, and storage management method
EP2865133A1 (en) 2012-06-25 2015-04-29 Kni M Szaki Tanácsadó Kft. Methods of implementing a dynamic service-event management system
US8954811B2 (en) 2012-08-06 2015-02-10 International Business Machines Corporation Administering incident pools for incident analysis
US8943366B2 (en) 2012-08-09 2015-01-27 International Business Machines Corporation Administering checkpoints for incident analysis
JP5719974B2 (en) * 2012-09-03 2015-05-20 株式会社日立製作所 Management system for managing a computer system having a plurality of devices to be monitored
JP6039352B2 (en) * 2012-10-12 2016-12-07 キヤノン株式会社 Device management system, device management system control method, and program
WO2014068659A1 (en) * 2012-10-30 2014-05-08 株式会社日立製作所 Management computer and rule generation method
US20140297821A1 (en) * 2013-03-27 2014-10-02 Alcatel-Lucent Usa Inc. System and method providing learning correlation of event data
US9619314B2 (en) 2013-04-05 2017-04-11 Hitachi, Ltd. Management system and management program
US9361184B2 (en) 2013-05-09 2016-06-07 International Business Machines Corporation Selecting during a system shutdown procedure, a restart incident checkpoint of an incident analyzer in a distributed processing system
US9170860B2 (en) 2013-07-26 2015-10-27 International Business Machines Corporation Parallel incident processing
US20160004584A1 (en) * 2013-08-09 2016-01-07 Hitachi, Ltd. Method and computer system to allocate actual memory area from storage pool to virtual volume
US9658902B2 (en) 2013-08-22 2017-05-23 Globalfoundries Inc. Adaptive clock throttling for event processing
US9256482B2 (en) 2013-08-23 2016-02-09 International Business Machines Corporation Determining whether to send an alert in a distributed processing system
US9602337B2 (en) 2013-09-11 2017-03-21 International Business Machines Corporation Event and alert analysis in a distributed processing system
US9086968B2 (en) 2013-09-11 2015-07-21 International Business Machines Corporation Checkpointing for delayed alert creation
JP2015076072A (en) * 2013-10-11 2015-04-20 キヤノン株式会社 Monitoring device, monitoring method, and program
US9747156B2 (en) * 2013-10-30 2017-08-29 Hitachi, Ltd. Management system, plan generation method, plan generation program
CN103747028B (en) * 2013-11-27 2018-05-25 上海斐讯数据通信技术有限公司 A kind of method for authorizing user's temporary root authority
GB2536317A (en) * 2013-11-29 2016-09-14 Hitachi Ltd Management system and method for assisting event root cause analysis
US9389943B2 (en) 2014-01-07 2016-07-12 International Business Machines Corporation Determining a number of unique incidents in a plurality of incidents for incident processing in a distributed processing system
CN106062765B (en) 2014-02-26 2017-09-22 三菱电机株式会社 Attack detecting device and attack detection method
US11086897B2 (en) 2014-04-15 2021-08-10 Splunk Inc. Linking event streams across applications of a data intake and query system
US10360196B2 (en) 2014-04-15 2019-07-23 Splunk Inc. Grouping and managing event streams generated from captured network data
US10462004B2 (en) 2014-04-15 2019-10-29 Splunk Inc. Visualizations of statistics associated with captured network data
US10127273B2 (en) 2014-04-15 2018-11-13 Splunk Inc. Distributed processing of network data using remote capture agents
US10366101B2 (en) 2014-04-15 2019-07-30 Splunk Inc. Bidirectional linking of ephemeral event streams to creators of the ephemeral event streams
US11281643B2 (en) 2014-04-15 2022-03-22 Splunk Inc. Generating event streams including aggregated values from monitored network data
US9923767B2 (en) 2014-04-15 2018-03-20 Splunk Inc. Dynamic configuration of remote capture agents for network data capture
US10693742B2 (en) 2014-04-15 2020-06-23 Splunk Inc. Inline visualizations of metrics related to captured network data
US9762443B2 (en) * 2014-04-15 2017-09-12 Splunk Inc. Transformation of network data at remote capture agents
US9838512B2 (en) 2014-10-30 2017-12-05 Splunk Inc. Protocol-based capture of network data using remote capture agents
US10700950B2 (en) 2014-04-15 2020-06-30 Splunk Inc. Adjusting network data storage based on event stream statistics
US10523521B2 (en) 2014-04-15 2019-12-31 Splunk Inc. Managing ephemeral event streams generated from captured network data
JP6330456B2 (en) * 2014-04-30 2018-05-30 富士通株式会社 Correlation coefficient calculation method, correlation coefficient calculation program, and correlation coefficient calculation apparatus
JP2015215639A (en) * 2014-05-07 2015-12-03 株式会社リコー Failure management system, failure management device, equipment, failure management method, and program
JP6287691B2 (en) * 2014-08-28 2018-03-07 富士通株式会社 Information processing apparatus, information processing method, and information processing program
US9596253B2 (en) 2014-10-30 2017-03-14 Splunk Inc. Capture triggers for capturing network data
US10334085B2 (en) 2015-01-29 2019-06-25 Splunk Inc. Facilitating custom content extraction from network packets
US10387230B2 (en) * 2016-02-24 2019-08-20 Bank Of America Corporation Technical language processor administration
US10275182B2 (en) * 2016-02-24 2019-04-30 Bank Of America Corporation System for categorical data encoding
US10430743B2 (en) 2016-02-24 2019-10-01 Bank Of America Corporation Computerized system for simulating the likelihood of technology change incidents
US10223425B2 (en) * 2016-02-24 2019-03-05 Bank Of America Corporation Operational data processor
US10067984B2 (en) * 2016-02-24 2018-09-04 Bank Of America Corporation Computerized system for evaluating technology stability
US10019486B2 (en) * 2016-02-24 2018-07-10 Bank Of America Corporation Computerized system for analyzing operational event data
US10366367B2 (en) 2016-02-24 2019-07-30 Bank Of America Corporation Computerized system for evaluating and modifying technology change events
US10366337B2 (en) 2016-02-24 2019-07-30 Bank Of America Corporation Computerized system for evaluating the likelihood of technology change incidents
US10275183B2 (en) * 2016-02-24 2019-04-30 Bank Of America Corporation System for categorical data dynamic decoding
US10366338B2 (en) 2016-02-24 2019-07-30 Bank Of America Corporation Computerized system for evaluating the impact of technology change incidents
US10216798B2 (en) * 2016-02-24 2019-02-26 Bank Of America Corporation Technical language processor
CN105786635B (en) * 2016-03-01 2018-10-12 国网江苏省电力公司电力科学研究院 A kind of Complex event processing system and method towards Fault-Sensitive point dynamic detection
US10339032B2 (en) * 2016-03-29 2019-07-02 Microsoft Technology Licensing, LLD System for monitoring and reporting performance and correctness issues across design, compile and runtime
US10637745B2 (en) * 2016-07-29 2020-04-28 Cisco Technology, Inc. Algorithms for root cause analysis
CN106778178A (en) * 2016-12-28 2017-05-31 广东虹勤通讯技术有限公司 The call method and device of fingerprint business card
CN106844173A (en) * 2016-12-29 2017-06-13 四川九洲电器集团有限责任公司 A kind of information processing method and electronic equipment
JP6870347B2 (en) * 2017-01-31 2021-05-12 オムロン株式会社 Information processing equipment, information processing programs and information processing methods
CN107562632B (en) * 2017-09-12 2020-08-28 北京奇艺世纪科技有限公司 A/B testing method and device for recommendation strategy
US11075925B2 (en) 2018-01-31 2021-07-27 EMC IP Holding Company LLC System and method to enable component inventory and compliance in the platform
WO2019176038A1 (en) * 2018-03-15 2019-09-19 株式会社Fuji Mounting-related device and mounting system
US10693722B2 (en) 2018-03-28 2020-06-23 Dell Products L.P. Agentless method to bring solution and cluster awareness into infrastructure and support management portals
US10754708B2 (en) 2018-03-28 2020-08-25 EMC IP Holding Company LLC Orchestrator and console agnostic method to deploy infrastructure through self-describing deployment templates
US11086738B2 (en) * 2018-04-24 2021-08-10 EMC IP Holding Company LLC System and method to automate solution level contextual support
US10795756B2 (en) 2018-04-24 2020-10-06 EMC IP Holding Company LLC System and method to predictively service and support the solution
US11599422B2 (en) 2018-10-16 2023-03-07 EMC IP Holding Company LLC System and method for device independent backup in distributed system
US10862761B2 (en) 2019-04-29 2020-12-08 EMC IP Holding Company LLC System and method for management of distributed systems
US11301557B2 (en) 2019-07-19 2022-04-12 Dell Products L.P. System and method for data processing device management
KR20220083221A (en) * 2020-12-11 2022-06-20 삼성전자주식회사 Hub device of iot environment and method for processing event based on local network thereof
US20230259344A1 (en) * 2022-02-16 2023-08-17 Saudi Arabian Oil Company System and method for tracking and installing missing software applications

Citations (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5848143A (en) * 1995-03-02 1998-12-08 Geotel Communications Corp. Communications system using a central controller to control at least one network and agent system
JPH11259331A (en) 1998-03-13 1999-09-24 Nippon Telegr & Teleph Corp <Ntt> Method and device for detecting fault position on network and storage medium for storing network fault position detecting program
US6023507A (en) * 1997-03-17 2000-02-08 Sun Microsystems, Inc. Automatic remote computer monitoring system
US6249755B1 (en) 1994-05-25 2001-06-19 System Management Arts, Inc. Apparatus and method for event correlation and problem reporting
US20010016789A1 (en) * 1999-01-28 2001-08-23 Dieter E. Staiger Electronic control system
US6393386B1 (en) 1998-03-26 2002-05-21 Visual Networks Technologies, Inc. Dynamic modeling of complex networks and prediction of impacts of faults therein
US6393474B1 (en) 1998-12-31 2002-05-21 3Com Corporation Dynamic policy management apparatus and method using active network devices
US20020100017A1 (en) * 2000-04-24 2002-07-25 Microsoft Corporation Configurations for binding software assemblies to application programs
US20030014644A1 (en) 2001-05-02 2003-01-16 Burns James E. Method and system for security policy management
US20030046615A1 (en) 2000-12-22 2003-03-06 Alan Stone System and method for adaptive reliability balancing in distributed programming networks
US20030105537A1 (en) * 2000-12-28 2003-06-05 Norbert Crispin System and method for controlling and/or monitoring a control-unit group having at least two control units
US20030149919A1 (en) * 2000-05-05 2003-08-07 Joseph Greenwald Systems and methods for diagnosing faults in computer networks
US20030214908A1 (en) 2002-03-19 2003-11-20 Anurag Kumar Methods and apparatus for quality of service control for TCP aggregates at a bottleneck link in the internet
US6654782B1 (en) 1999-10-28 2003-11-25 Networks Associates, Inc. Modular framework for dynamically processing network events using action sets in a distributed computing environment
US6678835B1 (en) 1999-06-10 2004-01-13 Alcatel State transition protocol for high availability units
US20040088140A1 (en) * 2002-10-30 2004-05-06 O'konski Timothy Method for communicating diagnostic data
US20040225381A1 (en) * 2003-05-07 2004-11-11 Andrew Ritz Programmatic computer problem diagnosis and resolution and automated reporting and updating of the same
US6820042B1 (en) 1999-07-23 2004-11-16 Opnet Technologies Mixed mode network simulator
US6823299B1 (en) 1999-07-09 2004-11-23 Autodesk, Inc. Modeling objects, systems, and simulations by establishing relationships in an event-driven graph in a computer implemented graphics system
US6829639B1 (en) 1999-11-15 2004-12-07 Netvision, Inc. Method and system for intelligent global event notification and control within a distributed computing environment
JP2004348640A (en) 2003-05-26 2004-12-09 Hitachi Ltd Method and system for managing network
US6854069B2 (en) 2000-05-02 2005-02-08 Sun Microsystems Inc. Method and system for achieving high availability in a networked computer system
US20050086502A1 (en) 2003-10-16 2005-04-21 Ammar Rayes Policy-based network security management
US20050188268A1 (en) * 2004-02-19 2005-08-25 Microsoft Corporation Method and system for troubleshooting a misconfiguration of a computer system based on configurations of other computer systems
US6941247B2 (en) * 1999-12-07 2005-09-06 Carl Zeiss Jena Gmbh Method for monitoring a control system
US20050234824A1 (en) * 2004-04-19 2005-10-20 Gill Rajpal S System and method for providing support services using administrative rights on a client computer
JP2005316728A (en) 2004-04-28 2005-11-10 Mitsubishi Electric Corp Fault analysis device, method, and program
US6968291B1 (en) 2003-11-04 2005-11-22 Sun Microsystems, Inc. Using and generating finite state machines to monitor system status
US20060048017A1 (en) 2004-08-30 2006-03-02 International Business Machines Corporation Techniques for health monitoring and control of application servers
US7028228B1 (en) 2001-03-28 2006-04-11 The Shoregroup, Inc. Method and apparatus for identifying problems in computer networks
JP2006133983A (en) 2004-11-04 2006-05-25 Hitachi Ltd Information processor, method for controlling information processor and program
US7080143B2 (en) 2000-10-24 2006-07-18 Microsoft Corporation System and method providing automatic policy enforcement in a multi-computer service application
JP2006338305A (en) 2005-06-01 2006-12-14 Toshiba Corp Monitor and monitoring program
US20070150480A1 (en) * 2005-04-11 2007-06-28 Hans Hwang Service delivery platform
US7277783B2 (en) * 2001-12-17 2007-10-02 Iav Gmbh Ingenieurgesellschaft Auto Und Verkehr Motor vehicle control system and method for controlling a motor vehicle
JP2007334716A (en) 2006-06-16 2007-12-27 Nec Corp Operation management system, monitoring device, device to be monitored, operation management method, and program
US20090028053A1 (en) * 2007-07-27 2009-01-29 Eg Innovations Pte. Ltd. Root-cause approach to problem diagnosis in data networks

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7197546B1 (en) * 2000-03-07 2007-03-27 Lucent Technologies Inc. Inter-domain network management system for multi-layer networks
US7464298B2 (en) * 2005-07-01 2008-12-09 International Business Machines Corporation Method, system, and computer program product for multi-domain component management
US7801712B2 (en) * 2006-06-15 2010-09-21 Microsoft Corporation Declaration and consumption of a causality model for probable cause analysis

Patent Citations (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6249755B1 (en) 1994-05-25 2001-06-19 System Management Arts, Inc. Apparatus and method for event correlation and problem reporting
US5848143A (en) * 1995-03-02 1998-12-08 Geotel Communications Corp. Communications system using a central controller to control at least one network and agent system
US6023507A (en) * 1997-03-17 2000-02-08 Sun Microsystems, Inc. Automatic remote computer monitoring system
JPH11259331A (en) 1998-03-13 1999-09-24 Nippon Telegr & Teleph Corp <Ntt> Method and device for detecting fault position on network and storage medium for storing network fault position detecting program
US6393386B1 (en) 1998-03-26 2002-05-21 Visual Networks Technologies, Inc. Dynamic modeling of complex networks and prediction of impacts of faults therein
US6393474B1 (en) 1998-12-31 2002-05-21 3Com Corporation Dynamic policy management apparatus and method using active network devices
US20010016789A1 (en) * 1999-01-28 2001-08-23 Dieter E. Staiger Electronic control system
US6678835B1 (en) 1999-06-10 2004-01-13 Alcatel State transition protocol for high availability units
US6823299B1 (en) 1999-07-09 2004-11-23 Autodesk, Inc. Modeling objects, systems, and simulations by establishing relationships in an event-driven graph in a computer implemented graphics system
US6820042B1 (en) 1999-07-23 2004-11-16 Opnet Technologies Mixed mode network simulator
US6654782B1 (en) 1999-10-28 2003-11-25 Networks Associates, Inc. Modular framework for dynamically processing network events using action sets in a distributed computing environment
US6829639B1 (en) 1999-11-15 2004-12-07 Netvision, Inc. Method and system for intelligent global event notification and control within a distributed computing environment
US6941247B2 (en) * 1999-12-07 2005-09-06 Carl Zeiss Jena Gmbh Method for monitoring a control system
US20020100017A1 (en) * 2000-04-24 2002-07-25 Microsoft Corporation Configurations for binding software assemblies to application programs
US6854069B2 (en) 2000-05-02 2005-02-08 Sun Microsystems Inc. Method and system for achieving high availability in a networked computer system
US20030149919A1 (en) * 2000-05-05 2003-08-07 Joseph Greenwald Systems and methods for diagnosing faults in computer networks
US7080143B2 (en) 2000-10-24 2006-07-18 Microsoft Corporation System and method providing automatic policy enforcement in a multi-computer service application
US20030046615A1 (en) 2000-12-22 2003-03-06 Alan Stone System and method for adaptive reliability balancing in distributed programming networks
US20030105537A1 (en) * 2000-12-28 2003-06-05 Norbert Crispin System and method for controlling and/or monitoring a control-unit group having at least two control units
US7069480B1 (en) 2001-03-28 2006-06-27 The Shoregroup, Inc. Method and apparatus for identifying problems in computer networks
US7028228B1 (en) 2001-03-28 2006-04-11 The Shoregroup, Inc. Method and apparatus for identifying problems in computer networks
US20030014644A1 (en) 2001-05-02 2003-01-16 Burns James E. Method and system for security policy management
US7277783B2 (en) * 2001-12-17 2007-10-02 Iav Gmbh Ingenieurgesellschaft Auto Und Verkehr Motor vehicle control system and method for controlling a motor vehicle
US20030214908A1 (en) 2002-03-19 2003-11-20 Anurag Kumar Methods and apparatus for quality of service control for TCP aggregates at a bottleneck link in the internet
US20040088140A1 (en) * 2002-10-30 2004-05-06 O'konski Timothy Method for communicating diagnostic data
US20040225381A1 (en) * 2003-05-07 2004-11-11 Andrew Ritz Programmatic computer problem diagnosis and resolution and automated reporting and updating of the same
JP2004348640A (en) 2003-05-26 2004-12-09 Hitachi Ltd Method and system for managing network
US20050086502A1 (en) 2003-10-16 2005-04-21 Ammar Rayes Policy-based network security management
US6968291B1 (en) 2003-11-04 2005-11-22 Sun Microsystems, Inc. Using and generating finite state machines to monitor system status
US20050188268A1 (en) * 2004-02-19 2005-08-25 Microsoft Corporation Method and system for troubleshooting a misconfiguration of a computer system based on configurations of other computer systems
US20050234824A1 (en) * 2004-04-19 2005-10-20 Gill Rajpal S System and method for providing support services using administrative rights on a client computer
JP2005316728A (en) 2004-04-28 2005-11-10 Mitsubishi Electric Corp Fault analysis device, method, and program
US20060048017A1 (en) 2004-08-30 2006-03-02 International Business Machines Corporation Techniques for health monitoring and control of application servers
US20060230122A1 (en) 2004-11-04 2006-10-12 Hitachi, Ltd. Method and system for managing programs in data-processing system
JP2006133983A (en) 2004-11-04 2006-05-25 Hitachi Ltd Information processor, method for controlling information processor and program
US20070150480A1 (en) * 2005-04-11 2007-06-28 Hans Hwang Service delivery platform
JP2006338305A (en) 2005-06-01 2006-12-14 Toshiba Corp Monitor and monitoring program
JP2007334716A (en) 2006-06-16 2007-12-27 Nec Corp Operation management system, monitoring device, device to be monitored, operation management method, and program
US20090028053A1 (en) * 2007-07-27 2009-01-29 Eg Innovations Pte. Ltd. Root-cause approach to problem diagnosis in data networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Forgy, Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem*, Artificial Intelligence 19 (1982) pp. 17-37.

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120117573A1 (en) * 2008-06-17 2012-05-10 Hitachi, Ltd. Method and system for performing root cause analysis
US8583581B2 (en) * 2008-06-17 2013-11-12 Hitachi, Ltd. Method and system for performing root cause analysis
US8732111B2 (en) 2008-06-17 2014-05-20 Hitachi, Ltd. Method and system for performing root cause analysis
US20140229419A1 (en) * 2008-06-17 2014-08-14 Hitachi, Ltd. Method and system for performing root cause analysis
US8990141B2 (en) * 2008-06-17 2015-03-24 Hitachi, Ltd. Method and system for performing root cause analysis
US9389946B2 (en) 2011-09-19 2016-07-12 Nec Corporation Operation management apparatus, operation management method, and program
US9298582B1 (en) 2012-06-28 2016-03-29 Emc Corporation Method and apparatus for performance data transformation in a cloud computing system
US9413685B1 (en) 2012-06-28 2016-08-09 Emc Corporation Method and apparatus for cross domain and cross-layer event correlation
US9053000B1 (en) * 2012-09-27 2015-06-09 Emc Corporation Method and apparatus for event correlation based on causality equivalence
US20160170848A1 (en) * 2014-12-16 2016-06-16 At&T Intellectual Property I, L.P. Methods, systems, and computer readable storage devices for managing faults in a virtual machine network
US9946614B2 (en) * 2014-12-16 2018-04-17 At&T Intellectual Property I, L.P. Methods, systems, and computer readable storage devices for managing faults in a virtual machine network
US20180239679A1 (en) * 2014-12-16 2018-08-23 At&T Intellectual Property I, L.P. Methods, systems, and computer readable storage devices for managing faults in a virtual machine network
US10795784B2 (en) * 2014-12-16 2020-10-06 At&T Intellectual Property I, L.P. Methods, systems, and computer readable storage devices for managing faults in a virtual machine network
US11301342B2 (en) 2014-12-16 2022-04-12 At&T Intellectual Property I, L.P. Methods, systems, and computer readable storage devices for managing faults in a virtual machine network

Also Published As

Publication number Publication date
US20110302305A1 (en) 2011-12-08
CN101981546B (en) 2015-04-01
US8020045B2 (en) 2011-09-13
WO2010038327A1 (en) 2010-04-08
JP5237034B2 (en) 2013-07-17
EP2336890A4 (en) 2016-04-13
JP2010086115A (en) 2010-04-15
CN101981546A (en) 2011-02-23
US20100325493A1 (en) 2010-12-23
EP2336890A1 (en) 2011-06-22

Similar Documents

Publication Publication Date Title
US8479048B2 (en) Root cause analysis method, apparatus, and program for IT apparatuses from which event information is not obtained
US11652793B2 (en) Dynamic firewall configuration
CN107690800B (en) Managing dynamic IP address allocation
US8584195B2 (en) Identities correlation infrastructure for passive network monitoring
US8909758B2 (en) Physical server discovery and correlation
US8006282B2 (en) Method and system for tracking a user in a network
US20080016115A1 (en) Managing Networks Using Dependency Analysis
US20100138921A1 (en) Countering Against Distributed Denial-Of-Service (DDOS) Attack Using Content Delivery Network
US20130246606A1 (en) Detecting Transparent Network Communication Interception Appliances
US20150347246A1 (en) Automatic-fault-handling cache system, fault-handling processing method for cache server, and cache manager
US9729560B2 (en) Method and device for synchronizing network data flow detection status
KR101416523B1 (en) Security system and operating method thereof
CN112261172A (en) Service addressing access method, device, system, equipment and medium
Bahl et al. Discovering dependencies for network management
KR20110063328A (en) Remote procedure call(rpc) bind service with physical interface query and selection
US8195977B2 (en) Network fault isolation
CN112565203B (en) Centralized management platform
CN110611688A (en) Method, electronic device and computer program product for searching nodes
CN110995738B (en) Violent cracking behavior identification method and device, electronic equipment and readable storage medium
US8149723B2 (en) Systems and methods for discovering machines
WO2022026634A1 (en) Prioritizing assets using security metrics
JP4772025B2 (en) P2P communication detection device, method and program thereof
US11909577B2 (en) Method and system for detecting failure-causing client with failure handling edge server grouping
US11799856B2 (en) Application identification
Adim Hafshejani Design and Deployment of a Cloud Monitoring System for Enhanced Network Security

Legal Events

Date Code Title Description
STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8