Monitoring is the process by which you identify and prevent problems with deployed
client/server or Web-based systems to ensure their continued functionality and performance.
Monitoring involves using a tool that periodically checks a Web site's overall health and
sends you alerts when it detects problems.
The key to effective monitoring is receiving alerts for every real problem while
preventing false positive alerts for insignificant issues. The only way to achieve
this delicate balance is to perform a mixture of application monitoring and hardware
monitoring, and ensure that urgent alerts are issued only for problems that impact
the application's functionality and performance.
Application monitoring involves checking application functionality and performance
by exercising the application as a user would. Application monitoring not only exposes
functionality and performance problems that will impact current users, but also exposes
emerging issues (such as subtle performance degradation caused by a memory leak) that
might not yet be apparent to users, but could eventually grow into serious problems.
If you receive this type of "early warning" alert, you can start fixing a problem before
it has the chance to impact how users perceive or use the application.
The more traditional type of monitoring--hardware monitoring--should also be performed
because the data it collects is essential for diagnosing the source of application-level
problems. However, relying on hardware monitoring alone typically leads to false positive
alerts--so many false positive alerts that you eventually grow desensitized to all alerts.
These false positives often occur because people try to map hardware failures to application
failures, but the two are not always connected. A piece of the hardware might fail without
affecting the user experience, or users might experience functionality problems even when
every piece of hardware is running perfectly.
The best way to obtain a reliable understanding of a system's health is to ensure that
your monitoring efforts cover all the pieces that come into play when a user exercises the
application--including the application logic, the data back-end, the hardware, and so forth--
and only sends alerts when the combination of results indicates that a real problem has
occurred or is emerging. For example, fully monitoring a Web-based enterprise system might
involve verifying whether:
- User click paths through critical transactions do not experience unacceptable delays,
path flow changes, or path content changes.
- User click paths through critical transactions execute within an acceptable period of time.
- Database transactions are completed within desired time limits and database operations
function correctly -- even as the amount of data in the database increases.
- A Web service or other third-party content provider delivers a valid response in the
expected format.
- Local machine hardware statistics (CPU utilization, memory space, disk space, buffer
cache, etc.) do not reach unacceptable levels.
- Client requests that travel through a Web service proxy match the expected security
patterns and inappropriate requests are not forwarded to the server.
Ideally, these tests are run from strategic locations within and without the system
to collect the data essential to rapid diagnosis.
Moreover, if you want to prevent emerging problems as well as identify existing ones,
you can run a mixture of passive tests and active tests. Active tests simulate user
actions using test drivers, virtual users, and so on to determine what problems could
affect potential users' experiences. Passive tests unobtrusively monitor system and
transaction details to identify major system problems (such as an offline server) and
to collect data that helps you diagnose the source of application-level problems. If
you frequently run a well-designed test suite that represents realistic user transactions
and loads, your tests will typically expose emerging bottlenecks and functionality issues
before your actual users have the opportunity to notice them. With this advance warning,
you can start diagnosing and repairing the problems before functionality is impaired for
actual users and service level agreements are violated.
See also:
|