Directions for using
template:
Read the Guidance
(Arial blue font in brackets) to understand the
information that should be placed in each section of this template. Then delete
the Guidance and replace the placeholder within <<Begin text here>>
with your response. There may be additional Guidance in the Appendix of some
documents, which should also be deleted once it has been used.
Some templates have four levels of headings. They are not indented, but can be differentiated by font type and size:
You may elect to indent sections for readability.
Author |
|
Author Position |
|
Date |
|
Ó 2002 Microsoft Corporation. All rights reserved.
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.
Microsoft and Visual Basic are either registered trademarks
or trademarks of Microsoft in the
Change Record
Date |
Author |
Version |
Change Reference |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Reviewers
Name |
Version Approved |
Position |
Date |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Distribution
Name |
Position |
|
|
|
|
|
|
|
|
Document Properties
Item |
Details |
Document Title |
Monitoring Plan |
Author |
|
Creation Date |
|
Last Updated |
|
Application Health
and Performance Monitoring
Detecting Failures
(Incidents)
Diagnosing Failures
(Problems)
[Introduction to the Template
Description: The Monitoring Plan defines the process by which the
operational environment will monitor the solution. It describes what will be
monitored, what monitoring is looking for, how monitoring will be done, and how
the results of monitoring will be reported and used. Customers use automated
procedures to monitor many aspects of their solutions. Automated monitoring is a
key best practice that enables identification of failure conditions and
potential problems. Monitoring helps to reduce the time needed to recover from
failures.
Justification: The plan will provide the details of the monitoring
process, which will be incorporated into the functional specification. Once
incorporated into the functional specification, the monitoring process (manual
and automated) will be included in the solution design. Monitoring ensures that
operators are made aware that a failure has occurred so they can initiate
procedures to restore service. Additionally, some organizations monitor their
servers’ performance characteristics to spot usage trends. This proactive best
practice allows organizations to identify the conditions that contribute to
system failure and take action to prevent those conditions from occurring.
{Team Role Primary: Program Management is responsible for ensuring that the
plan is completed and has acceptable quality, as well as incorporating it into
the Master Project Plan and Operations Plan. Release
Management will contribute heavily to the content of the plan in its
responsibility for designing an effective solution monitoring process.
Team Role Secondary: Development will review the plan
to ensure that the functional specification and project deliverables are in
synch with the monitoring plan. Product Management
will review the plan to ensure that external customer needs are met by the
monitoring plan. Test and User
Experience will review the plan to ensure that what is monitored supports
their functional areas of interest.}]
[Description: Provide an overall summary of the contents of this
document.
Justification: Some project participants may need to know only the
highlights of the plan, and summarizing creates that user view. It also enables
the full reader to know the essence of the document before they examine the
details.]
<<Begin text here>>
[Description: The Objectives section describes the business and technical
drivers of the monitoring process and what key objectives are targeted for the
monitoring process.
Justification: Identifying the drivers and monitoring objectives signals
to the customer that Microsoft has carefully considered the situation and
solution and created an appropriate monitoring approach.]
<<Begin text here>>
[Description: The Anticipating Failures section should
This information could be documented in matrix such as:
Component |
Single Point of Failure (yes or no) |
Mean Time between Failures |
Conditions and Circumstances leading to Failure |
Probability of Component Failure |
Impacts of Failure |
|
|
|
|
|
|
|
|
|
|
|
|
Justification: Anticipating failures will enable operations either to avoid
them or be prepared to deal with them when they occur.]
<<Begin text here>>
[Description: The Resource Threshold Monitoring section identifies the
solution resources that will be monitored, it defines the conditions and
circumstances to be monitored for each type of resource, and it defines the
thresholds to be used to judge that resources are working properly and are/are
not sufficient to support the solution. Resources include hard drives, CPU,
memory, and threads.]
<<Begin text here>>
[Description: The Performance Monitoring section defines the
monitoring process that gathers and records information about the performance of
the total solution and the individual components in the solution. For each type
of solution event it includes
<<Begin text here>>
[Description: The Trend Analysis section defines the analysis that will
take place on the data collected during performance monitoring. Trend analysis
uses the information gathered and recorded by performance monitoring to predict
solution and component performance and health under different conditions and
circumstances, such as a larger user set and a changing solution
environment.]
<<Begin text here>>
[Description: The Application Health and Performance Monitoring section
should list and describe each software application in the solution and describe
the plan for monitoring each application:
<<Begin text here>>
[Description: The Detecting Failures section should describe how the
development team, operations, and maintenance will utilize the functional
specifications and user acceptance criteria to detect failure incidents. The
functional specifications clearly define the success criteria for a solution and
for each of its components. User Acceptance Criteria, based on the functional
specifications, precisely define user expectations for the correct and effective
operation of the solution.]
<<Begin text here>>
[Description: The Error Detection section describes the processes,
methods, and tools teams will use to detect and diagnose solution errors. The
goal of an error detection strategy should be that the error is detected,
resolved and recovered without the knowledge of the user community.
Justification: Error detection in a Windows environment will enhance a
solution’s reliability and availability. Early detection and handling of
application and system errors can help avoid a shutdown, or at least allow for
an orderly shutdown. It can also increase availability by allowing the solution
to continue operating in a degraded state.]
<<Begin text here>>
[Description: The SNMP protocol captures or traps configuration and
status information from a Windows NT server.]
<<Begin text here>>
[Description: The Event Logs section describes the logs that will provide
a system for capturing and reviewing significant application and system events.
Describe the logs operations will maintain and the procedures they will use to
record events and time in the logs.]
<<Begin text here>>
[Description: The Monitoring for Failure section should describe the
processes, methods, and tools teams will use to detect and report solution
failures.]
<<Begin text here>>
[Description: The Monitoring for Success section describes the processes,
methods, and tools teams will use to determine the solution is working correctly
and is meeting user expectations. Monitoring for success includes the use of
monitoring tools and interaction with solution users to gather information about
solution successes.]
<<Begin text here>>
[Description: The Monitoring for Alarms section describes how solution
alarms will signal that a problem is about to occur or has occurred in a
solution. It should identify all solution alarms, indicate how they will signal
users and operations, and define what each alarm means.]
<<Begin text here>>
[Description: The Exception Trapping section describes a type of
monitoring built into a solution that recognizes incidents, indicating a solution has produced a result that
is an exception to acceptable results (i.e., the result lies outside the range
of acceptability). This section should identify where the development team will
build exception traps into the solution that continually monitor solutions or
that operations will turn on when they suspect problems within a solution.
Exception trapping capabilities allow for reliable programmer and program
control over responses to exceptions that occur during the execution of a
solution.]
<<Begin text here>>
[Description: The Notifications section describes how people will be
notified when monitoring and exception trapping has detected solution failures.
This should include notification for errors and cases in which user performance
expectations have not been met.]
<<Begin text here>>
[Description: The Diagnosing Failures section describes the processes,
methods, and tools teams will employ to diagnose the problems detected in
solutions by monitoring and exception trapping.]
<<Begin text here>>
[Description: The Resolving Failures section describes the procedures
teams will use to correct the errors detected and diagnosed in solutions and to
improve solutions that do not meet user expectations.]
<<Begin text here>>
[Description: The Recovering from Failures section defines how the
solution will be recovered from failure or referenced the Backup and Recovery Plan.]
<<Begin text here>>
[Description: The Tools section lists and describes the tools teams can
employ to detect, diagnose, and correct errors and to improve a solution’s
performance. The table below is an example of this.]
<<Begin text here>>
Tool |
Description |
Microsoft Systems Management Server |
Integrated inventory, distribution, installation, and remote
troubleshooting tools for centralized management of hardware and software.
Microsoft Systems Management Server can be used in
medium to large multi-site Windows–based environments to reduce the cost of change and configuration
management of Windows based desktop and server computers. Details
available at http://www.microsoft.com/backoffice
|
Microsoft Performance Monitor (Perfmon) |
Windows NT administrative tool that enables viewing
behavior of processors, memory, cache, threads, and process objects. Each
object has an associated set of counters that provide information about
device usage, queue length, delays, and other data that measures
throughput and internal congestion. Details available at http://www.microsoft.com/ntserver
|
Microsoft Windows NT Resource Kit, version 3.51 |
Microsoft Press® kits contain both technical
documentation and a CD-ROM with useful utilities and accessory programs to
help install, configure, and troubleshoot Microsoft Windows NT. See
Details available at http://www.mspress.microsoft.com
|
|
Family of products with a single management framework
integrating disparate IBM systems management applications. Details
available at http://www.tivoli.com |
Microsoft HTTPMon |
Multithreaded Windows NT service that monitors web
server performance by measuring how quickly the web server responds to
requests from client browsers. Details available at http://www.microsoft.com/ntserver
|
HP OpenView |
Hewlett Packard family of products designed to manage
distributed computer systems and networks from computers running Windows
or UNIX operating systems. Details available at http://www.hp.com |
NetManage |
Single-source PC-to-host connectivity solutions from
NetManage. The company develops integrated applications, servers, and
development tools for Microsoft Windows, Windows® 95 and Windows NT
operating systems. Details available at http://www.netmanage.com |
PerlEx |
Utility for Web servers running under Windows NT that
improves the performance of Perl scripts. Details available at http://www.activestate.com |
SeNTry |
An SNMP-based monitoring tool. Details available at http:// www.missioncritical.com |