[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]

Resource Monitoring and Control Guide and Reference


Overview of RSCT Resource Monitoring and Control

The RSCT Resource Monitoring and Control (RMC) application is part of Reliable Scalable Cluster Technology (RSCT). It provides consistent comprehensive monitoring of system resources. By monitoring conditions of interest and providing automated responses when these conditions occur, RSCT Resource Monitoring and Control helps maintain system availability. RSCT Resource Monitoring and Control is installed as part of the base operating system and is administered by means of the easy-to-use Web-based System Manager graphical user interface or the command line.

You can set up monitoring easily through the Web-based System Manager user interface or you can use the command line. See Using the Monitoring Application for more details.

The Monitoring application offers a comprehensive set of monitoring and response capabilities that lets you detect, and in many cases correct, system resource problems such as a critical filesystem becoming full. You can monitor virtually all aspects of your system resources and specify a wide range of actions to be taken when a problem occurs, from simple notification by e-mail to recovery that runs a user-written script. You can specify an unlimited number of actions to be taken in response to an event.

As system administrator, you have a great deal of flexibility in responding to events. You can respond to an event in different ways based on the day of the week and time of day. The following are some examples of how you can use monitoring:


Monitoring Concepts

Monitoring lets you detect conditions of interest in your machine and its associated resources, and automatically take action when those conditions occur. The key elements in monitoring are conditions and responses.

A condition identifies one or more resources you want to monitor, such as the /var file system, and the specific resource state you are interested in, such as /var > 90% full.

A response specifies one or more actions to be taken when the condition is found to be true. Actions can include notification, running commands, and logging.

About Conditions

To understand and use conditions, you need to know about the following:

System resources that you can monitor are organized into general categories called resource classes. Examples of resource classes include Processor, File System, Physical Volume, and Ethernet Device.

Each resource class includes individual system resources that belong to the class. For example, the File System resource class might include these resources:

When a resource is specified for use in a condition, it is called a monitored resource.

Each resource class also has a set of properties that you can monitor. For example, the File System resource class has the following properties available for monitoring:

When a resource property is specified for use in a condition, it is called a monitored property. (In the underlying subsystems, a monitored property is referred to as a dynamic attribute.)

In the condition, you specify the monitored property in a logical expression that defines the threshold or state of the monitored resources. The logical expression is the event expression of the condition. Event expressions are typically used to monitor potential problems and significant changes in the system. For example, the event expression for a /var space used condition might be PercentTotUsed > 90. When the logical expression is evaluated to be true, an event is generated.

The rearm expression of a condition is optional. A rearm expression typically indicates when the monitored resource has returned to an acceptable state. When the rearm expression is met, monitoring for the condition resumes.

If a rearm event is not specified, when the event expression becomes true an event is generated for certain properties every time the monitored property is evaluated.

If a rearm expression is specified, evaluation of the rearm expression starts once the event expression becomes true. When the rearm expression becomes true, a rearm event is generated; then the evaluation of the event expression starts again. For example, if the event expression for a /var space used condition is 90% full and the rearm expression is PercentTotUsed < 80, then an event is generated when /var is more than 90% full. The next time the condition is evaluated, the rearm expression is used. When /var is less than 80% full, an event is generated indicating that the condition has been reset, and the event expression is used again to evaluate the condition.

See Using Expressions for more information about data types and operators that you can use in an event expression or a rearm expression.

Predefined conditions are provided with the Monitoring application. To create a new condition you will need to set the following condition components:

Condition Component Description Example
Condition name The name you want to give the condition. /var space used
Resource class The resource class to be monitored. FileSystem
Monitored property The property of the resource class to be monitored. PercentTotUsed
Monitored resources The specific resources in the resource class that are to be monitored. /var
Event expression A logical expression defining the value or state of the monitored property that is to generate an event. PercentTotUsed > 90
Event description A text description of the event expression. An event occurs when /var is more than 90% full.
Rearm expression When a rearm expression is specified, the rearm expression is evaluated when the event expression becomes true. When the rearm expression becomes true, the event expression is used for evaluation again. PercentTotUsed < 80
Rearm description A text description of the rearm expression. A rearm event occurs when /var is less than 80% full.
Severity The severity of the condition: Informational, Warning, or Critical. Critical

About Responses

A response consists of one or more actions to be performed by the system when an event or rearm event occurs for a condition. In the Monitoring application you can use the predefined responses or create new responses, and associate them with conditions as needed. You can associate multiple responses with one condition, and associate a single response with multiple conditions.

The responses for a condition remain deactivated until you start monitoring for that condition. When you select a condition to start monitoring, you need to activate at least one of its responses. The responses that are not active remain available to be used at another time. This allows you to use different responses for a condition as needed, without having to redefine them.

Predefined responses are provided with the Monitoring application. To create a new response you will need to set the following response and action components:


Response Component Description Example
Response name The name you want to give the response. Response for critical conditions
Actions One or more actions to be taken as part of the response. Log events to a file

Action Component Description Example
Action name The name of an action to be taken as part of the response. Send e-mail to the operator
When in effect The days and times when this action is to be used to respond to the condition. 08:00 - 17:00 Monday-Friday
Use for event, rearm event, or both Whether the action is to be used to respond to an event, a rearm event, or both. Event
Command The command to be run when an event or rearm event occurs. A recovery script

You can associate multiple responses with a condition if you want to define different responses based on when the event occurs. For example, you might have a work day response and a weekend response, each containing one or more actions. Consider how you might respond to a /var space used condition with the following responses. During working hours, you might want to e-mail the operator, run a command, and broadcast a message to users who are logged on. During weekend hours, you might want to e-mail the system administrator and log a message to a file.

How Conditions and Responses Work Together

Once monitoring for the condition begins, the system evaluates the event expression to see if it is true. When the event expression becomes true, an event occurs that automatically notifies all of the associated event responses, which causes each event response to run its defined actions.

The event expression and the rearm expression work together as follows when a condition is monitored. First, the event expression is evaluated. When the event expression becomes true, an event occurs, and the specified actions are taken. When the event expression becomes true, the system begins evaluating the rearm expression. When the rearm expression becomes true, the rearm event occurs, which automatically starts the actions defined for the rearm event. When the rearm event occurs, the system returns to evaluating the event expression.

The following illustration shows this cycle:


top

When an event occurs for the monitored condition, the actions for the event are automatically taken. If the condition has a rearm expression, a rearm event causes the rearm actions to be taken.

The following illustration shows these interactions:


top

Security Considerations

A root user can perform all Monitoring tasks:

A non-root user can perform only the following Monitoring tasks:

For a complete description of the underlying components of the RSCT Resource Monitoring and Control application, see Components Provided for Monitoring.


[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]