WO2001061492A1

WO2001061492A1 - Method and apparatus for integrating response time and enterprise management

Info

Publication number: WO2001061492A1
Application number: PCT/US2001/005066
Authority: WO
Inventors: Eric Stinson; Lundy M. Lewis
Original assignee: Aprisma Management Technologies, Inc.
Priority date: 2000-02-16
Filing date: 2001-02-16
Publication date: 2001-08-23
Also published as: AU2001241523A1

Abstract

A method and apparatus for managing data, voice, and video networks through the integration of response time (210) and general enterprise management (220). A structured collaboration between these applications (210, 220) allows for better management of user applications and services, where the central concept is application/service response time (210). Alarms (250) generated by unacceptable response time measurements are sent from an RTM application (210) to an enterprise manager (220) where they are correlated (255) with other crucial measurements in the enterprise, including parameters from network devices, traffic, computer systems, and applications (230). The enterprise manager (220) utilizes this correlated information to identify and correct the causes of network and application performance degradation. An optional off-line capacity planning tool (280) may also be utilized by the enterprise manager (220) to US0104908roblems based on historical data and to model the result of proposed system changes.

Description

New Title: Method and Apparatus for Integrating Response Time and Enterprise Managment

Related Applications

This application claims priority to United States Provisional Application Ser. No. 60/183,081 , filed February 16, 2000.

Field of the Invention

The present invention relates to network management and, in particular, to a method and apparatus for integrating response time management and enterprise management in order to more efficiently manage data, voice, and video networks and applications.

Background

As network management matures and networked applications become more critical to business success, the performance of the network and applications is also becoming more critical. In the past, companies were satisfied as long as the network and applications were simply up and running. Now, however, the speed of networks and applications is becoming enormously important to businesses. According to Enterprise Network Management and Downtime Costs 1999 by Infonetics Research, service degradations cost companies an average of $900,000 per year. This cost includes the prorated salaries of employees who experienced productivity loss caused by service degradation, as well as the lost revenue that would have been generated by those affected employees.

This total does not include losses of the type which would be seen if an e- commerce web site were down, which obviously could be much greater. For example, if a user were attempting to purchase something from an e-commerce site and the site was too slow, then the user might go to a different site to make the purchase. That obviously is an immediate loss of revenue for the company but, additionally, that potential customer might also continue to go back to the other site, thus compounding the first company's loss. In another example, if a stockbroker needed to enter a large trade into the computer for an investor and the broker encountered poor application or network performance, then that trade might never occur. This leads to a decrease in customer satisfaction, as well as to the direct loss of the commission revenue. These are just a few of many possible examples of poor system performance leading to lost revenue; looking around in nearly any business environment it is possible to identify areas where poor network and/or application performance would cost the organization money.

Monitoring just the network or just the servers doesn't guarantee that a critical business application will perform at required levels. Since business application performance is becoming key to a business' success, it needs to be independently monitored. Response time management (RTM) applications solve this problem by measuring the end-to-end performance of the application from a client, through the network, to the server, and back again, thus representing the actual experience of the application user. This is therefore a true measure of how well the systems are working together and their ability to satisfactorily perform business processes. Further, RTM applications can be used to measure the performance levels of different business applications and to set performance thresholds per application or per groups of users of a specific application. This provides the flexibility needed to ensure that critical applications and users receive the resources necessary for optimum performance. Hence, response time management is becoming a key tool for a business because it keeps systems performing well, thus avoiding the many costs of service degradation.

Performance is, in fact, critical for more applications than just e-commerce. Businesses have come to rely on applications such as SAP, BAAN and PeopleSoft, as well as various custom applications. Many of these applications are required for the business to operate efficiently. In addition, email and web access are becoming very important, sometimes even critical, to the effective performance of many jobs. RTM applications are already being used to monitor these applications and will be deployed with increasing frequency in the future in order to ensure application performance. Response time management also plays a large role in the implementation of

Service Level Management (SLM) in networks. Response Time Management is used in Service Level Management to measure the end-to-end performance of application transactions across the network. The response time measurements allow examination of the latency of the client, network, and server in order to identify performance degradation in the system before it can cause loss of productivity and/or revenue. This proactive performance management allows problems to be fixed before they either spread throughout the business or worsen, causing lost revenue through e-commerce, lost productivity through employees needing to wait for applications to respond to requests, etc. Because businesses are discovering the true costs of performance degradation, they are starting to implement Service Level Agreements (SLAs) in order to hold information technology (IT) and service providers accountable for the performance of critical business tools. SLAs are being applied to both internal service providers (IT groups) and external service providers. These agreements typically include monthly requirements for minimum availability, time to repair, etc.

They also have begun to include performance metrics, such as response time and throughput. When an external service provider violates an SLA, there are typically penalties that the service provider must pay. For internal IT groups, bonuses and other compensation may be related to meeting SLAs. SLAs create a need for RTM applications for both the service provider T group and the customer. The service provider/TT group needs RTM applications in order to measure its own performance, both to ensure that performance stays at acceptable levels and to report performance to customers. If performance starts to degrade, the IT group needs to be able to make changes to keep the performance within the limits of the SLAs. The IT group may also use RTM applications for trending, reporting, and to assist in the troubleshooting of problems. Since every user's perception of the system is different, a user's call to a helpdesk stating that the connection is slow normally doesn't provide the IT staff with much real information unless there are some type of "real" measurements to back it up. RTM is proving to be a good method for determining whether there is conformance with an SLA; in fact, without RTM it is difficult to determine whether or not an SLA is actually being met.

Customers use RTM applications in order to make their own determination regarding whether or not the SLA requirements are being met. Customers can also use RTM applications to ascertain whether or not they are paying for unnecessary services. RTM reporting to customers typically shows system performance (maximum, minimum, mean), where and when thresholds were crossed, etc.

To ensure service level guarantees, more than just alarms and reporting are necessary; proactive management is required. In addition to measuring end-to-end latency through the network and applications, RTM tools frequently provide threshold monitoring, trend analysis, and reporting on that end-to-end performance.

Some also monitor such things as availability, CPU utilization, disk space, etc., as well as reporting on information from RMON and RMON π probes. Further, many of them provide immediate reports and traps whenever a threshold is crossed, providing data on where the actual problem may be and suggesting possible solutions.

Currently, RTM applications perform trending by projecting historical data into the future, in order to ascertain when thresholds are in danger of being reached. However, the alarm generated when a threshold is, or is about to be, crossed is still sent to a human administrator who then needs to investigate the problem and make a determination of what can be done to solve it. Frequently, this takes longer than is acceptable under the relevant SLA and the SLA is thereby violated.

Several vendors today are already building RTM products that measure the response time (latency) of an application or service (e.g. email, database applications, and electronic commerce). At periodic intervals, application response time is measured and the value (i) is logged into a historical database for later analysis and/or (ii) causes an alarm if it is out of range. These applications are useful because they demonstrate to consumers and service providers whether the quality of the service (defined in terms of response time) is healthy or not. However, this response time is not correlated with any other aspects of the enterprise. Since poor response time is usually caused by a malfunction or degradation in the various enterprise components, such as network devices, traffic over different kinds of media, computer system performance, and application performance, this limitation squanders an opportunity to uncover the cause behind the unacceptable response time measurements.

There are multiple approaches to end-to-end RTM. These methods are broken into two main categories: agent-based and probe-based. Examples of agent- based approaches are:

1. Actual user transactions

2. Artificial transactions with real applications

3. Simulated transactions between two agents Examples of probe-based approaches are: 1. Port response time

2. Actual user transactions

In agent-based approaches, an agent is placed on end stations in order to measure performance. In probe-based approaches, dedicated equipment, such as RMON π probes, is used to measure performance. As has been discussed, application performance is already becoming very important for business success and will become critical as businesses become increasingly reliant on networked applications. While RTM is currently in only the initial stages of deployment for some businesses, many of those are now working toward large-scale implementations. Other businesses already have large-scale RTM implementations. Today, businesses primarily use RTM to keep critical applications and users performing well. They use it to help identify network and application problems, through performance measurement, so that the problems can be corrected in order to attain optimal performance of critical applications. Once problems are found, the RTM applications are then used to troubleshoot the problems by providing key information leading to the formulation of solutions. These performance measurements are done for internal applications as well as for e- commerce applications, in order to keep businesses profitable. Further, RTM is used for trending to help predict future problems so that trouble areas can be identified and addressed before problems occur.

In the future, users will continue to use RTM tools to troubleshoot problems and keep their systems running at peak performance. Enhanced troubleshooting capabilities provided by RTM applications will allow businesses to save money on IT resources and keep the time to repair performance problems short. As SLAs become more prevalent between service providers and customers and between internal IT organizations and the users of the systems they manage, there needs to be a way to manage to the SLA requirements. RTM is expected to play a large role in SLA management; for example, RTM applications will allow service providers to provide SLA reports to customers demonstrating that they met the SLA over the month. They will also provide customers with a way to perform their own SLA tests and reports so that they can verify for themselves that the service providers or internal IT departments are living up to their commitments.

However, as previously mentioned, existing technologies do not allow the correlation of response time management with other components of the enterprise, such as network devices, traffic over different kinds of media, computer system performance, and application performance. This is a deficiency, as poor response time is usually caused by some corresponding malfunction or degradation in these other components. This deficiency is therefore a hindrance to quickly and efficiently finding and correcting the cause, or causes, of unacceptable response time measurements.

Objects of the Invention

Accordingly, a primary object of the present invention is to provide a method and apparatus by which to efficiently manage data, voice, and video networks through the integration of response time management and general enterprise management, including network, system, traffic, and application management. A particular object of this invention is to provide a way to use enterprise management to determine the cause or causes of poor response time measurements. Another particular object of the present invention is to provide a mechanism for automatic detection and correction of the causes of network and application service degradation. A further object of this invention is to provide a mechanism for collaboration between three entities: a response time management application, an enterprise management application, and a capacity planning application in order to more efficiently manage networks and applications.

Summary

A method and apparatus are described for more effective management of data, voice, and video networks through the integration of response time management (RTM) and general enterprise management. The integration of RTM with a network management system provides for automatic detection and correction of the causes of network and application service degradation. In one aspect of the invention, signals generated by unacceptable response time measurements are sent from the RTM application to the enteφrise manager where they are correlated with other crucial measurements in the enteφrise, including parameters from network devices, traffic, computer systems, and applications. The enteφrise manager may then utilize this correlated information to identify and correct the causes of network and application performance degradation.

In another aspect of the invention, a collaboration between an RTM application, an enteφrise manager, and an off-line capacity planning database allows for even better management of user applications and services. Alarms and/or other parameters are sent from the RTM application to the enteφrise manager, where they are correlated with other crucial measurements in the enteφrise and utilized to identify and correct the causes of network and application performance degradation. In addition, an off-line database/capacity planner also receives these parameters from the enteφrise manager and the RTM application and is later utilized by the enteφrise manager to predict problems based on historical data and/or to model the result of proposed system changes.

Brief Description of the Drawings

Fig. 1 is a state diagram of an embodiment of an integrated response time and enteφrise management system according to the present invention;

Fig. 2 is a block diagram of an embodiment of an integrated response time and enteφrise management system according to the present invention;

Fig. 3 illustrates the operation of the present invention utilizing a real-time alarm database; and

Fig. 4 illustrates the operation of the present invention utilizing an off-line capacity planning database. Detailed Description

The present invention is a novel and non-obvious method and apparatus for the integration of response time management (RTM) and enteφrise management (including network, systems, traffic, and application management) in order to provide for automatic detection and correction of the causes of network and application service degradation. The invention also provides a mechanism for collaboration between three entities: an RTM agent, an enteφrise manager, and a capacity planning agent, in order to effectively utilize historical system parameters for future trending and system modeling.

As IT system performance becomes more critical, RTM becomes more critical as well. Businesses continue to become increasingly reliant on e-commerce, both for revenue and to maintain a competitive appearance. As reliance increases, so too does the need to manage the performance of these systems. For example, businesses not only want to measure and control the performance of their own e- commerce systems, but they also want to test the performance of the e-commerce systems of competitors. Better performance may then be used as a competitive advantage or, if the performance of a particular business' systems is poorer than that of a competitor's systems, performance measurement can serve to alert the business to the need for enhancements to its own system. Integrated RTM and enteφrise management can therefore play a key role in helping businesses stay competitive.

As stated before, existing technologies do not currently allow the correlation of response time measurement with the performance of specific components of the enteφrise such as network devices, traffic over different kinds of media, computer system performance, and application performance. This is a serious deficiency, as poor response time (latency) is usually caused by some corresponding malfunction or degradation in these components. The present invention overcomes this problem by providing a method and apparatus for integrating and correlating data from RTM applications and general enteφrise management applications. This allows automatic determination and correction of the exact cause or causes of network problems. Any type of problem in a data, voice, or video network that can be detected and/or corrected by either an RTM application or an enteφrise management application can now be resolved through use of the present invention. Types of network problems that can be resolved therefore include, but are not limited to, fault, configuration, accounting, performance, security, and service problems. To measure the performance between network entities, the invention utilizes

RTM. The RTM of the invention can be between any of the entities in the network environment, including, but not limited to, two computers, keystroke-to-computer, a computer and an application, a computer and a network edge device, and two network edge devices. The entity whose latency is measured through RTM can be any data, voice, or video entity, including, but not limited to, an alarm, alert, event, or other signal, a packet, a cell, or the appearance of an item on a screen.

Besides measuring end-to-end latency through a network and applications,

RTM tools typically provide thresholds, trending, and reporting on end-to-end performance. Some also monitor things like availability, CPU utilization, disk space, etc., as well as reporting on information from RMON and RMON π probes.

Further, many RTM tools provide immediate reports and traps whenever a threshold is crossed, giving data which can be used to determine where the actual problem may be and possible ways to fix it. To provide proactive management, the invention uses enteφrise management applications to control network resources through such things as provisioning and

Quality of Service (QoS) policies, in order to head off loss of performance by making automatic changes to the system to whenever a threshold is in danger of being crossed. In one implementation, this is accomplished by the RTM application sending a trap to the enteφrise manager with enough information on the performance problem to enable the enteφrise manager to make a reasonable determination of what can be done to minimize or avoid the problem.

As previously discussed, there are several different approaches to end-to-end RTM. The methods are broken into two main categories: agent-based and probe- based. Any of these approaches may be used in the present invention.

Agent technology can be used in both active and passive environments. Agents offer many advantages for response time and performance management. Not only do they measure performance from the user's perspective, but they can also gather statistics, such as CPU utilization, memory utilization, etc., about the machine where they are installed. The agents used typically have a very small footprint for both memory and CPU, allowing them to be deployed widely without affecting a user's perception of application performance, even when the agent is on the user's end station.

Agents can be installed on end user machines, on servers, and/or on dedicated machines used primarily for performance measurement. Agents can even be installed on networking equipment, such as routers and switches, in order to measure performance from those points in the network. The deployment strategy selected will generally depend on the structure of a company's IT network and applications and on other organizational particulars. Although in some cases agents are installed on every end station, they are typically installed only on a portion of the end stations in order to provide a representative sampling of performance. Agent-based RTM approaches include using actual user transactions, using artificial transactions with real applications, and using simulated transactions between two agents. In the agent-based actual user transaction approach to RTM using agents, agents are installed on end stations throughout the network. Each agent passively measures the response time of the actual transactions initiated from the end station on which it is installed. This approach measures the "true" end user experience for each of the end stations on which an agent is installed. By monitoring the true end user experience, information is not only available for assessment of current service levels, but it is also available for troubleshooting problems when a user calls the helpdesk or for proactively identifying when a user is experiencing difficulties so that the cause of the difficulties can be corrected before that user calls the helpdesk. Though typical implementations do not have an agent on every desktop, where agents are available on every desktop, it can be a very powerful solution. Products that offer this approach at the current time include Lucent's VitalSuite, Candle's

ETEWatch, FirstSense's Enteφrise, and Ganymede's Pegasus Application Monitor.

Typically, many of these agents gather information on the performance of each transaction or application and correlate the gathered data with other statistics from the end station, such as CPU utilization, Disk I/O, etc. Some agents allow the helpdesk to connect directly to a user's machine in order to perform diagnostics on the machine, through the agent, for enhanced troubleshooting. Products currently having this capability include Lucent's VitalSuite and Peregrine's ServiceCenter. Because agents track exactly what the user is doing, the helpdesk staff can sometimes determine that problems are being caused by things the user or a particular application is not doing optimally. By way of example, suppose a user builds some SQL queries to retrieve information from a database and experiences slow response time. By investigating these transactions using the agent, it would be possible for the helpdesk staff to determine whether the queries are not built for optimal performance. If so, once the queries are changed, the response time should improve.

Another approach to RTM using agents has an agent measure the response time when "artificial transactions" are sent to the actual application server. Artificial transactions are actually real application transactions, but they are being used specifically for performance management. Since an artificial transaction is the same as a real transaction that a user would make with the application, it can be used to provide a measure of the "true" end-to-end response time through the client, network, server, and back to the user. This approach allows test transactions to be scheduled at specific times and provides for good management of application performance. Products offering this functionality at present include NextPoint's S3 and Lucent's VitalSuite.

There are some specific advantages of using agent-based artificial transactions with real applications as compared to only measuring the performance of actual user transactions. The artificial transaction approach uses a group of transactions to measure response time. This group remains a constant and is both repeatable and quantifiable. With the agent-based actual user transaction approach, a user may, for example, make two different database queries, one requiring more parsing on the server than the other. The query requiring more parsing will cause a longer response time, as expected. Even though this transaction is supposed to take longer, it may incorrectly trigger an alarm and/or show up as a violation of the applicable SLA. Since the artificial transaction approach uses the same group of transactions over and over, this problem doesn't occur, allowing conformance to the SLA to be measured more consistently.

As previously discussed, the artificial transaction approach uses scheduled transactions with real applications. These can be set up to take place during specific time frames, for example, during business hours, during nightly backups, etc. With the agent-based actual user transaction approach, response time can only be measured while the user is actually making transactions. As an example of why this may be less than optimal: Suppose a particular site deploys agents on 10% of its end stations in order to monitor application response time. In a department consisting of 30 people, there are therefore three end stations with agents. On a particular day, suppose that one of the users with an agent on their end station is out sick and another with an agent is writing a report that does not require the normal application load on the network. If an artificial transaction approach with real applications is used, all three end stations are still being used to measure response time, providing the desired 10% coverage. If an agent-based actual user transaction approach is used, only one end station is measuring response time on this day, providing only 3% coverage for response time measurements. In fact, while that one employee is on a long lunch break, there will be NO response time management for that department.

The simulated transactions between two agents approach to RTM takes pairs of agents and measures the response time between them. The first agent sends a simulated application transaction through the network infrastructure to another agent, then the second agent replies to the first. The first agent measures the response time for this transaction. Ganymede's Pegasus Network Monitor uses this approach, simulating transactions based on the protocol, packet size, etc., of true application transactions. This approach provides a measurement that is repeatable and can be scheduled. It also provides a very good representation of how the network handles the application traffic since it simulates the true application and can measure parameters such as latency, throughput, etc. However, although this approach is excellent at gauging the network performance, it is not able to provide information, based on such things as CPU utilization, I/O, etc., about how the "true" application is performing on the server and client.

Probes have been used in network management for over ten years. There are many types of probes, including both proprietary probes and standards-based probes such as RMON and RMON π. The RMON π working group is currently creating standards for RTM. Companies such as NetScout and Progress Software, among others, are currently implementing probe-based RTM solutions. Probe-based approaches to RTM include using port response time and using actual user transactions. Port response time is the most simplistic approach to probe-based RTM. In this approach, the probe actively "wakes up" an application-specific TCP or UDP port on another device and checks the response time (latency) through the network.

This basic measurement does not take into account any CPU congestion on the server, disk I/O problems, etc., and therefore does not provide an accurate representation of the true response time nor any throughput information for a specific application. This approach can, however, be used by an IT manager to interrogate a device on the network in order to observe what ports respond and, from that, to determine what applications are available on that device. This is the approach taken by Progress Software with their IPQoS product. NextPoint S3 also offers this functionality in their agent, which doesn't required a dedicated probe. However, a better approach than using an active probe to measure true response time is to install an agent and use this to create artificial transactions, as has been previously described. Actual User Transactions In some implementations of RTM, network probes are used to measure end- to-end application response time. Observing the start and stop packets for various application transactions on the network and measuring the time between the packets provides the response time measurement. This approach requires probes to be strategically placed throughout the network, particularly in today's highly switched environments, in order to allow IT managers to gather adequate information for management of the response times. Using a probe to manage actual user transactions can add value to the network by also gathering data on the total bandwidth used by a particular application. NetScout' s AppScout product uses this approach to RTM. Some agent-based RTM applications, such as NextPoint's S3 and Lucent's VitalSuite, also monitor RMON and RMON π probes to produce data such as total bandwidth consumed by a particular application, protocol breakdowns, etc.

The present invention provides proactive management by integrating RTM applications with an enteφrise management system. This allows IT groups and service providers to ensure that they meet SLAs without user intervention. This automated control eliminates the delay inherent in systems that operate by notifying administrators of problems and then waiting for the administrators to first find, and then fix, the problems. Administrators are still notified so that they can see if there are any other enhancements they can make, but they are no longer the critical path to keeping performance in check.

The present invention allows businesses to maximize the return on investment of their IT infrastructures. This requires ensuring that Service Level Agreements are set, monitored, and met. Integrating enteφrise management with top tier RTM products add value to a customer's SLM and performance management strategies. This integration will play an increasingly important role in meeting the business requirements of SLM.

The enteφrise management system/RTM integration of the invention has three possible aspects: alarm integration, alarm correlation and proactive monitoring and control, and offline capacity planning tools that allow the IT organization to make informed decisions regarding adding applications and users to their infrastructure, upgrading timeframes, etc. Not all aspects of the invention are present in any particular embodiment. In the preferred embodiment of the invention, the enteφrise management system is Aprisma Management Technologies' SPECTRUM, but other enteφrise management applications having the necessary capabilities described herein would be suitable. Similarly, the RTM application of the invention can be filled by any RTM application having the necessary capabilities described herein, including, but not limited to, those mentioned in the previous discussion. RTM software allows performance thresholds to be set. When they are 5 crossed, a signal, such as an event signal, an alert, or some other type of an alarm signal (such as an SNMP trap) is generated. In one embodiment of the integrated enteφrise management/RTM system, these signals from the RTM software are forwarded to the enteφrise manager where they can be addressed. This lets users have a single repository for things such as alarms, allowing them to be addressed o quickly and efficiently. This also provides the added benefit of the availability of the additional information provided by the enteφrise management application alarm console, such as probable cause of an alarm, recommendations for fixing the problem, etc. Once alarms, alerts, etc. are transferred to the enteφrise manager, they can be acknowledged, cleared, etc. in the same manner as alarms that are produced 5 directly by the enteφrise manager. Further, users may have the ability to set the enteφrise management application to automatically acknowledge or clear an alarm, alert, etc. upon problem resolution. The RTM signals also provide information to the enteφrise manager that will be used other phases of the integration.

Models are created in the enteφrise management system for the different o applications or groups of entities that are being monitored. Thresholds are then applied to those models. When alarms or other events occur in the RTM application, they are forwarded to the specific models in the enteφrise management system to which they apply. Further, alarms related to existing enteφrise management models can be associated with the application models in order to suggest, for example, that 5 an end-to-end response time alarm could be caused by a router problem discovered by the enteφrise manager. This tight alarm integration with the enteφrise manager allows users to pinpoint the problems that caused the poor response time reported by the RTM applications.

A second embodiment of the invention includes both alarm correlation and proactive monitoring and control, as well as the alarm integration of the previous embodiment. Rather than dumping all alarms, events, alerts, etc. into a single view, the enteφrise management system of this embodiment correlates them, allowing users to focus on the most important alarms first. This alarm correlation includes alarm suppression, where the enteφrise manager presents a single alarm with the resultant alarms rolled up underneath it. For example, if a router being down causes 500 performance alarms, the performance alarms are rolled up under the router alarm. When the router is repaired, the performance alarms are repaired as well. The user also has the ability to drill into a router alarm and show all applications, connections, groups, etc. that are affected. In the event that many isolated alarms occur in the same time frame, this helps the user to determine which of the alarms it is more important to fix first.

The second aspect of this embodiment is automated proactive monitoring and control, which is particularly useful for meeting SLAs. This functionality allows the enteφrise management system to automatically take corrective actions where required to prevent violation of applicable SLAs. In this embodiment of the invention, the RTM applications are used for trending, and alarms are sent whenever performance is getting close to crossing set SLA thresholds. When this happens, the RTM application notifies the enteφrise manager and the enteφrise manager takes a pre-configured corrective action in order to bring the performance back to an acceptable level. This pre-configured action could include any number of corrective actions, such as changing priority queuing, provisioning more bandwidth, etc. Since corrective actions can be specific to a particular network, a user may also have the option of choosing which of several actions to take in each situation where an SLA threshold is in danger being crossed.

The enteφrise management application also provides information that assists in the determination of which actions to take, both by trending historical data to identify problems and their effects on the system, and by identifying possible solutions. Further, an alert may be sent to notify the user that the threshold was in danger of being crossed and to specify what action was taken to correct the problem. The present invention's ability to make automated corrective actions gives an organization that has guaranteed certain service levels the peace of mind that those service levels will be met. This proactive management saves revenue that would be 5 lost through service degradations and keeps service providers and IT groups from incurring penalties due to violation of the SLA.

In another embodiment, offline capacity planning applications are used with the enteφrise management system in order to further refine the identification and correction of network and application problems. The data gathered by the RTM o applications is sent to the modeling engine of the enteφrise management application in order to provide historical data which is useful in gauging how changes affect performance. The enteφrise manager has the ability to work with the off-line capacity planning database to use this historical data for advanced statistical analysis, both for future trending and to model what will happen to the system as more users 5 or applications are added or if other changes are made.

A state diagram depicting an embodiment of the invention is shown in Fig. 1. RTM parameters are collected 110 by the RTM application. Meanwhile various enteφrise parameters are collected 120 by the enteφrise management system. The RTM application forwards alarms (caused by unacceptable response time o measurements) and other signals and/or parameters to the enteφrise manager, which puts them into a real-time database 130 along with the enteφrise parameters that have been collected. The enteφrise manager may also put selected forwarded RTM parameters and collected enteφrise parameters into an off-line database 140, to be used for trend analysis and data mining (capacity planning) 160. 5 The enteφrise manager correlates 150 the RTM alarms with known alarms in the enteφrise, and may then elect to raise an alert 170 or to execute a pre-specified corrective action 180. Alerts may be raised 170 in any of the usual manners, such as presentation of an alarm screen, email, pager, etc. Corrective actions that can be taken include suppression of the RTM alarm, association of causes with the alarm, escalation of the alarm, and/or correction of the underlying problem that has caused the alarm. The enteφrise manager may also forward a "heads up" warning to the RTM application, telling it to expect poor response time for some time interval and to ignore it. When capacity planning 160, is used, the RTM and enteφrise management applications enter select performance values into an off-line database that can be analyzed by an offline capacity planning application. The results of this analysis may be reported to a human administrator or forwarded to the enteφrise manager for further alarm correlation and/or corrective action.

A preferred embodiment of the invention utilizes SPECTRUM as the enteφrise management system and VitalSuite from Lucent Technologies as the response time management application. In the implementation of this embodiment, RTM forwards alarms to SPECTRUM. This functionality is implemented with VitalSuite' s notification method. SPECTRUM correlates the RTM alarms with known alarms in the enteφrise, and may elect to suppress the RTM alarm, to associate causes with the alarm, to escalate the alarm and/or to correct the underlying cause of the alarm. This functionality is implemented utilizing SPECTRUM'S SpectroWATCH. SPECTRUM may also forward a "heads up" warning to VitalSuite, telling it to expect poor response time for some time interval and to ignore it. This functionality is implemented with the SPECTRUM AlarmNotifier. VitalSuite and SPECTRUM may also enter selected performance values into an offline database that can be analyzed by an offline capacity planning application. The results of the analysis may be reported to a human administrator or forwarded to SPECTRUM for further alarm correlation. For VitalSuite, this functionality is implemented with standard SQL techniques. For SPECTRUM, this functionality is implemented with SPECTRUM Data Export.

An embodiment of the apparatus of the invention is shown in Fig. 2. This embodiment of the invention is a structured collaboration among an RTM application 210, an enteφrise management application 220, and an offline capacity planning application 280. In this embodiment, the RTM application 210 does performance measurement 240 on the IT system 230. The RTM application may utilize any suitable approach, including, but not limited to, the agent-based and probe-based approaches discussed previously. If agent-based technology is used, the method of communication among the agents may be achieved by any of the many methods known in the art, including, but not limited to, remote procedure calls, remote shell invocation, or CORBA.

If unacceptable response time measurements are detected by the RTM application 210, it forwards alarms 250 to the enteφrise management system 220. The enteφrise management system 220 continuously performs proactive monitoring and control 260 on the IT system 230. The data obtained from this activity by the enteφrise management system 220 is used with the alarm information forwarded 250 from the RTM application 210 to perform alarm correlation 255. The alarm correlation 255 provides additional information that can then be used by the enteφrise management system 220 to more efficiently perform further proactive control 260 of the IT system 230, including correction of the problems which led to the RTM alarms 250. The method by which the enteφrise manager 220 performs alarm correlation 255 may be embodied by any of the many of techniques known in the art, including, but not limited to, rule-based expert systems, look-up tables, case- based reasoning systems, model-based reasoning systems, state transition graphs, fuzzy logic methods, Petri net methods, and Markov chain methods.

In the embodiment of Fig. 2, the enteφrise management system 220 sends selected RTM, network, and traffic measurements 270 to an offline capacity planning database 280 in order to create a historical record for use in trend analysis and system modeling. The method by which the off-line capacity planning application 280 performs trend analysis can be embodied by any of the many specific methods known in the art, including, but not limited to, standard statistical methods, neural network methods, and genetic algorithms. The results 290 produced by the trend analysis and system modeling performed via the offline capacity planning database 280 are used by the enteφrise management system 220 for proactive control 260 of the IT system 230, as well as by the IT staff for planning changes or enhancements to the IT system 230.

The embodiment of Fig. 2 utilizes an offline capacity planning application 280. It should be noted that this is not a necessary element of the invention, but 5 rather adds additional desirable functionality. Similarly, the alarm correlation function 255 is an optional feature of the invention. All the features described in the embodiment of Fig. 2 therefore do not have to be present in any particular embodiment of the invention. For example, some embodiments of the invention include RTM-to-enteφrise manager alarm forwarding only. o The operation of an embodiment of the invention utilizing a real-time alarm database is illustrated in Fig. 3. In this embodiment, RTM and enteφrise parameters are collected 305 and stored in a real-time database 310. If an alarm has occurred 315, the RTM and enteφrise manager alarms are correlated 320 and the end user is notified 325. If a pre-specified action in response to the results from the alarm 5 correlation 320 exists 330, the pre-specified action is executed 335. Otherwise, the system waits for end user intervention 340 and, when received, executes the user- specified action 345.

Fig. 4 illustrates the operation of an embodiment of the invention utilizing an off-line capacity planning database. In this embodiment, RTM and enteφrise o parameters are collected 405 and stored in an off-line capacity planning database

410. Capacity planning, in the form of such things as trend analysis and data mining, is performed 415 on the stored data. If a problem is predicted 420, the end user is notified 430; otherwise the data is stored 425 for later reporting. If a pre-specified action in response to the predicted problem 420 exists 435, the pre-specified action 5 is executed 440. Otherwise, the system waits for end user intervention 445 and, when received, executes the user-specified action 450.

What has been described is merely illustrative of the application of the principles of the present invention. Other arrangements, methods, modifications and substitutions by one of ordinary skill in the art are also considered to be within the scope of the present invention, which is not to be limited except by the claims which follow.

Claims

CLAIMSWhat is claimed is:

1. A method for automatic detection of problems in a network comprising, in combination, the steps of: making at least one response time measurement; sending at least one signal to an enteφrise management application, said signal reflecting at least one aspect of the response time measurement; and utilizing the signal in the enteφrise management application to automatically detect the presence of at least one problem in the network.

2. The method of claim 1, wherein the signal is an alarm generated by an unacceptable value of said response time measurement.

3. The method of claim 1, wherein said step of utilizing said signal includes at least the step of correlating said signal with at least one parameter generated by the enteφrise management application.

4. The method of claim 3, wherein the signal is an alarm generated by an unacceptable value of said response time measurement.

5. The method of claim 4, wherein said parameter generated by the enteφrise management application is an alarm.

6. The method of claim 2, further comprising the step of utilizing the enteφrise management application to determine a cause for the alarm.

7. The method of claim 4, further comprising the step of utilizing the enteφrise management application and a result from said step of correlating to determine a cause for the alarm.

8. The method of claim 6, further comprising the step of correcting the cause for the alarm.

9. The method of claim 7, further comprising the step of correcting the cause for the alarm.

10. The method of claim 1, further comprising, in combination, the steps of: sending said at least one response time measurement that generated said signal to an off-line database; and utilizing said off-line database for capacity planning.

11. The method of claim 3, further comprising, in combination, the steps of: sending at least one of the response time measurement that generated said signal and said parameter to an off-line database; and utilizing said off-line database for capacity planning.

12. The method of claim 7, further comprising, in combination, the steps of: sending at least one of: the response time measurement that generated said alarm, said parameter generated by the enteφrise management application, said result from said step of correlating, and an indication of said cause for the alarm to an off-line database; and utilizing said off-line database for capacity planning.

13. The method of claim 9, further comprising, in combination, the steps of: sending at least one of: the response time measurement that generated said alarm, said parameter generated by the enteφrise management application, said result from said step of correlating, and an indication of said cause for the alarm to an off-line database; and utilizing said off-line database for capacity planning.

14. A method for automatic detection and correction of problems in a o network comprising, in combination, the steps of: making at least one response time measurement; sending at least one alarm signal to an enteφrise manager as a result of the response time measurement; utilizing the alarm signal in the enteφrise manager to automatically detect the presence of at least one problem in the network; utilizing the enteφrise manager to determine a cause for the network problem; and correcting the cause for the network problem.

15. The method of claim 14, further comprising the step of correlating said alarm signal with at least one parameter generated by the enteφrise manager.

16. The method of claim 15, wherein said step of utilizing the alarm signal in the enteφrise manager to determine a cause for the network problem comprises at least the step of utilizing a result from said step of correlating.

17. A method for automatic detection, correction, and prediction of problems in a network comprising, in combination, the steps of: making at least one response time measurement; sending at least one alarm signal to an enteφrise manager as a result of the response time measurement; utilizing the alarm signal in the enteφrise manager to automatically detect the presence of at least one problem in the network; utilizing the enteφrise manager to determine a root cause for the network problem; correcting the root cause for the network problem; sending at least one parameter representing at least one of the response time measurements that generated the alarm signal and the root cause for the network problem to an off-line database; and utilizing said off-line database for capacity planning.

18. The method of claim 17, further comprising the step of correlating said alarm signal with at least one parameter generated by the enteφrise manager.

19. The method of claim 18, wherein said step of utilizing the alarm signal in the enteφrise manager to determine a root cause for the network problem comprises at least the step of utilizing a result from said step of correlating.

20. An apparatus for automatic detection of problems in an information technology system comprising, in combination: a response time management application, said response time management application being capable of forwarding at least one signal to an enteφrise management application as a result of making at least one response time measurement; and an enteφrise management application, said enteφrise management application being capable of receiving at least one signal from the response time management application and utilizing the signal to automatically detect the presence of at least one problem in the information technology system.

21. The apparatus of claim 20, wherein the signal is an alarm generated by an unacceptable value of said response time measurement.

22. The apparatus of claim 20, wherein said enteφrise management application correlates said signal with at least one parameter generated by the enteφrise management application.

23. The apparatus of claim 22, wherein the signal is an alarm generated by an unacceptable value of said response time measurement.

24. The apparatus of claim 23, wherein said parameter generated by the enteφrise management application is an alarm.

25. The apparatus of claim 21, wherein the enteφrise management application determines a root cause for the alarm signal.

26. The apparatus of claim 23, wherein the enteφrise management application utilizes a result from the correlation to determine a root cause for the alarm signal.

27. The apparatus of claim 25, wherein the enteφrise management application corrects the cause for the alarm signal.

28. The apparatus of claim 26, wherein the enteφrise management application corrects the cause for the alarm signal.

29. The apparatus of claim 20, further comprising an off-line capacity planning database wherein said database receives said signal for use in capacity planning.

30. The apparatus of claim 22, further comprising an off-line capacity planning database wherein said database receives at least one of said signal and said parameter for use in capacity planning.

31. The apparatus of claim 26, further comprising an off-line capacity planning database wherein said database receives at least one of: said alarm, said parameter generated by the enteφrise management application, said result from said correlation, and an indication of said cause for the alarm for use in capacity planning.

32. The apparatus of claim 28, further comprising an off-line capacity planning database wherein said database receives at least one of: said alarm, said parameter generated by the network management application, said result from said correlation, and an indication of said cause for the alarm for use in capacity planning.

33. An apparatus for automatic detection and correction of problems in a network comprising, in combination: a response time management application, said response time management application being capable of forwarding at least one alarm signal to an enteφrise management application as a result of making at least one response time measurement; and an enteφrise management application, said enteφrise management application being capable of receiving at least one alarm signal from the response time management application and utilizing the alarm signal to automatically detect and correct at least one problem in the network.

34. The apparatus of claim 33, wherein said enteφrise management application correlates said alarm signal with at least one parameter generated by the enteφrise management application.

35. The apparatus of claim 34, wherein said enteφrise management application uses a result of said correlation to determine a cause for the problem in the network.

36. An apparatus for automatic detection, correction, and prediction of problems in an information technology system comprising, in combination: a response time management application, said response time management application being capable of forwarding at least one alarm signal to an enteφrise manager as a result of making at least one response time measurement; an enteφrise manager, said enteφrise manager being capable of receiving at least one alarm signal from the response time management application and utilizing the alarm signal to automatically detect and correct at least one problem in the information technology system; and an off-line capacity planning database wherein said database receives information related to said alarm signal and said information technology system problem for use in capacity planning.

37. The method of claim 36, wherein said enteφrise manager correlates said alarm signal with at least one parameter generated by the enteφrise manager.

38. The method of claim 37, wherein said enteφrise manager utilizes a result from said correlation to determine a root cause for the information technology system problem.