[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]

Resource Monitoring and Control Guide and Reference


Components Provided for Monitoring

Resources have identifiable attributes (also called properties in Web-based System Manager) that can be expressed so that certain conditions of interest to the system administrator can be observed. Predefined thresholds can be set for conditions, and responses can be defined and associated with these conditions. When these thresholds are met, an event is generated, and the actions associated with the condition are run. Predefined conditions and responses can be used as is or as templates for defining the conditions most appropriate for your installation.

The major components of RSCT Resource Monitoring and Control are the Resource Monitoring and Control (RMC) subsystem and certain resource managers. These are described in the following sections.

Note:
Monitoring currently takes place on a single system. Commands that cause changes can only be executed by root.

Resource Monitoring and Control Subsystem

The Resource Monitoring and Control subsystem (RMC subsystem) monitors and queries resources. The RMC daemon manages an RMC session and recovers from communications failures.

The RMC subsystem is used by its clients to monitor the state of system resources and to send commands to resource managers. The RMC subsystem acts as a broker between the client processes that use it and the resource manager processes that control resources.


Resource Managers

A resource manager is a process that maps resource and resource-class abstractions into calls and commands for one or more specific types of resources. A resource manager is a stand-alone daemon. The resource manager contains definitions of all resource classes that the resource manager supports. A resource class definition includes a description of all attributes, actions, and other characteristics of a resource class.

These resource classes are accessible and their properties can be manipulated by the user through Web-based System Manager or through the command line.

See the RMC and ERRM commands to access the resource classes and manipulate their attributes through the command line interface.

Note:
Attributes and characteristics of a resource class are referred to as properties in Web-based System Manager.

The following resource managers are provided:


Audit Log Resource Manager

The Audit Log subsystem is implemented as a resource manager within the RMC subsystem. It has two resource classes, IBM.AuditLog for subsystem definitions and IBM.AuditLogTemplate for audit-log-template definitions. Entries in the audit log are called records. Records can be added, retrieved, and removed through actions on a specific subsystem or on the subsystem class. The template definition class contains a description of each record type that a subsystem can add to the audit log. The template definition contains the data type, a descriptive message, and other information for each subsystem-specific field within the record.

There are typically two types of clients for the audit-log subsystem, subsystems that need to add records to the audit log, and users who extract records from the audit log via the command line or the Web-based System Manager interface.

The formatted message for each record provides a concise description of the situation and allows a user to easily see at a high level what has been happening on the system.

Audit Log Resource Class

Each resource of this class represents a subsystem that will be adding records to the audit log. A resource of this class must be added before the subsystem can add records to the audit log. The resource can be added as part of the installation of the subsystem or at runtime.

The following properties can be monitored for this resource class:

RecordsAdded
Reflects the current number of records in the audit log. Whenever records are added to the audit log, this value is updated.

RecordsRemoved
Conveys which records have been removed. The following data elements comprise the value of this attribute:

RecordCount
Reflects the total number of records in the audit log after the records identified by SeqNumRanges have been removed.

SeqNumCount
Reflects the total number of elements in the SeqNumRanges array. The number of ranges in that array is actually SeqNumCount/2.

SeqNumRanges
Each consecutive pair of CT_INT64 integers defines an inclusive range of sequence numbers of records that have been deleted.

AuditLogSize
Reflects the amount of disk space in bytes that the audit log uses.

Audit Log Template Resource Class

This resource class holds all audit log templates. An audit log template describes the information that exists in each audit log record that is based on the template. In addition, an audit log template contains information on how to present records that use the template to an end user. Each template corresponds to a resource within this class. The attributes of this resource class are internal.


Event Response Resource Manager

The system administrator interacts with the Event Response resource manager (ERRM) through the Web-based System Manager or through the ERRM command-line interface.

When an event occurs, ERRM runs user-configured commands, which can include scripts provided by RSCT. A command and its attributes are a type of action, and many actions can be configured for a single Event Response resource. An action consists of a name, a command to be run, and other variables. You specify the range of times when the command is run (day, start time, and end time). If the condition occurs at a time outside the specified time ranges, the command is not run, and if all of the actions within this Event Response resource have the same time ranges, none of the commands are run. If no time ranges are specified, the command is always run. There are also event and rearm event flags that specify the events for which the command is run. Three options are allowable; only event set, only rearm event set, or both flags set.

The Event Response Resource Manager (ERRM) is automatically started when the RMC subsystem is started.

Although performance is important, ensuring that no events are lost and that the user's commands are executed is of greater importance. Other factors outside the control of ERRM may affect performance as well (for example, network load, system load, and the performance of other required subsystems).

The only userid that can define, undefine, and modify ERRM resources is root. All other users have read access to ERRM resources. Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. No security audits are generated, and no encryption mechanisms are used. ERRM communicates only with other local subsystems on the same node.

Information is handled as follows:

There are three Event Response resource classes:

  1. Condition

    The Condition resource class contains the necessary information (event expression and rearm expression) for the ERRM to register with the RMC for event notifications that the administrator deems important. Conditions contain essential information such as: the resource attributes of the resource to be monitored, the event expression, and the optional rearm expression.

    Note:
    Resource attributes are called properties in Web-based System Manager terminology.)

    Configuration of ERRM begins with the definition of a set of Condition resources. A Condition resource is registered with the RMC subsystem when the Condition resource is used in the definition of an active Association resource.

    Note:
    Registration with RMC is necessary for monitoring to take place. Registration does not occur when a new Condition resource is defined but rather when the resource is used in the definition of an active association resource.
  2. Event Response

    An Event Response resource is configured by defining one or more actions. Each action contains the name of the action, a command, and other fields within the action attribute. The Event Response resource runs any number of configured commands when an event with an active association occurs. When an event occurs, all of the actions associated with its Event Response resource are evaluated to determine whether they should be run.

    Predefined responses are available to use and to serve as templates for your own responses. For a description of predefined responses and how to use them, see Predefined Responses. Scripts for notification and logging of events and for broadcasting messages to logged-in user consoles also are provided in the AIX Commands Reference.

    Note:
    Commands are run in parallel.

    See Getting Started with the Monitoring Application for specific task information on how to configure actions for Event Response resources and Event Response resources for Conditions.

  3. Association

    The Association resource class joins the Condition resource class together with the Event Response resource class. It contains a flag that indicates whether the association between the condition and the event response is active. Event Responses and Conditions are separate entities, but for monitoring to take place, they need to be associated. An event cannot occur unless at least one Event Response is associated with a Condition. You can configure one or more actions for an Event Response, and one or more Event Responses for a Condition.

See Getting Started with the Monitoring Application for information on how to get started using the capabilities of the Event Response resource manager to monitor your system.


File System Resource Manager

The File System resource manager (FSRM) manages file systems. It can provide the following information:

There is one File System resource manager (FSRM) on a node. It is started implicitly by the RMC subsystem.

To enforce security, only root can start the FSRM resource manager (although it is strongly recommended that the FSRM resource manager not be started manually). Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. No security audits are generated, and no encryption mechanisms are used. The FSRM communicates only with other local subsystems on the same node and with the RMC subsystem. The FSRM has no direct contact with clients.

Information is handled as follows:

These properties of a file system resource can be monitored:

OpState
Monitors whether the current file system operational state is online (mounted) or offline (unmounted).

PercentTotUsed
Represents the percentage of space that is used in a specific filesystem so that preventative action can be taken if the amount available is approaching a predefined threshold. For example, /tmp PercentTotUsed, /var PercentTotUsed.

PercentINodeUsed
Represents the percentage of i-nodes that are in use for a specific file system; for example, /tmp PercentINodeUsed.

Predefined Conditions for Monitoring File Systems

The following table shows the predefined conditions and examples of expressions that are used to monitor the file system:


Condition Name Event Expression Event Description Rearm Expression Rearm Expression Monitored Resources

File system state

OpState != 1

An event is generated when any file system goes offline.

OpState == 1

The event is rearmed when any file system comes back online.

all

File system i-nodes used

PercentINodeUsed > 90

An event is generated when more than 90% of the total i-nodes in any file system are in use.

PercentINodeUsed < 85

The event is rearmed when the percentage of i-nodes used in the file system falls below 85%.

all

File system space used

PercentTotUsed > 90

An event is generated when more than 90% of the total space of any file system is in use.

PercentTotUsed< 85

The event is rearmed when the space used in the file system falls below 85%.

all

/tmp space used

PercentTotUsed > 90

An event is generated when more than 90% of the total space in the /tmp file system is in use.

PercentTotUsed < 85

The event is rearmed when the space used in the /tmp file system falls below 85%.

/tmp

/var space used

PercentTotUsed > 90

An event is generated when more than 90% of the total space in the /var file system is in use.

PercentTotUsed < 85

The event is rearmed when the space used in the /var file system falls below 85%.

/var


Host Resource Manager

The Host resource manager allows system resources for an individual machine to be monitored, particularly resources related to operating-system load and status.

The Host resource manager is started implicitly by the RMC subsystem only when a property of a Host resource class is first monitored (thus cutting down on performance overhead).

Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. The Host resource manager runs as root. No security audits are generated, no encryption mechanisms are used, and there is no communication outside the node. The RMC daemon detects any authentication or authorization failures. All interprocess communication is accomplished through pipes and shared memory.

Information is handled as follows:

The Host resource manager consumes minimal system resources during normal operation. This is because the following approaches have been implemented:

  1. Memory, CPU, and other system resources are not consumed for properties that are not monitored. If no properties are monitored, the Host resource manager is not started.
  2. To minimize disk access, information is maintained in memory as much as possible.
  3. The sampling of property values is aligned as much as possible to minimize the sampling overhead, in particular, thread or process context swaps.

The Host resource manager has the following resource classes that you can use to monitor system resources:

Host (IBM.Host)
This resource class externalizes the properties of a machine that is running a single copy of an operating system. Primarily the properties included are those that are advantageous in predicting or indicating when corrective action needs to be taken. See Host for more details.

Paging Device (IBM.PagingDevice)
This resource class externalizes the properties of paging devices. See Paging Device for more details.

Physical Volume (IBM.PhysicalVolume)
This resource class externalizes many properties of disks, such as the number of I/Os, wait time, etc. See Physical Volume for more details.

Processor (IBM.Processor)
This resource class externalizes the properties of individual processors, such as idle time and time spent in the kernel. See Processor for more details.

Program (IBM.Program)
This resource class allows a client to monitor properties of a program that is executing on a host. The program to monitor is identified by properties such as program name, arguments, etc. The resource class does not monitor processes as such because processes are very transient and therefore inefficient to monitor individually. See Program for more details.

Each type of adapter that is supported has its own resource class as follows:

ATM Device (IBM.ATMDevice)
All ATM adapters installed in a node are externalized through this resource manager. See ATM Device for more details.

Ethernet Device (IBM.EthernetDevice)
All Ethernet adapters installed in a node are externalized through this resource manager. See Ethernet Device for more details.

FDDI Device (IBM.FDDIDevice)
All FDDI adapters installed in a node are externalized through this resource manager. See FDDI Device for more details.

Token-Ring Device (IBM.TokenRingDevice)
All Token-Ring adapters installed in a node are externalized through this resource manager. See Token-Ring Device for more details.

Host

The program name of this resource class is IBM.Host. It allows the following resources of a host system to be monitored:

  1. Processes in the run queue of the operating system scheduler (see Monitoring the Operating System Scheduler).
  2. Global state of active paging spaces (see Monitoring the Global State of Active Paging Space).
  3. Total processor utilization across all active processors in the system (see Monitoring Processor Utilization).
  4. Real, virtual, and kernel memory utilization (see Memory Management).

Monitoring the Operating System Scheduler

The operating system scheduler maintains a run queue of all of the processes that are ready to be dispatched. Each second, the process table is scanned to determine which processes are ready to run. If one or more processes are ready, they are placed on the run queue, and a counter is incremented. The counter is used to compute the value of the ProcRunQueue variable as the average number of ready-to-run processes. The scheduler also scans the process table for processes that are inactive because they are waiting to be paged in. A swapped process may (or may not) have some or all of its pages moved to the swap (page) device. As with the ProcRunQueue variable, the system increments a counter for swapped processes, which is used to compute the value of the ProcSwapQueue variable as the average number of processes swapped out. A process must be paged in and marked non-swapped before it can be placed on the run queue for execution. These properties can be monitored:

ProcRunQueue
Average number of processes that are waiting for the processor

ProcSwapQueue
Average number of processes that are waiting to be paged in.

Predefined Conditions for Monitoring the Operating System Scheduler

The following table shows the predefined conditions that are available for monitoring the operating system scheduler, and example expressions:

Condition Name Event Expression Event Description Rearm Expression Rearm Description

Processes in run queue

(ProcRunQueue - ProcRunQueue @P) >= (ProcRunQueue @P * 0.5)

An event is generated each time the average number of processes on the run queue has increased by 50% or more between observations.

ProcRunQueue < 50

The event is rearmed when the run queue length drops below 50.

Processes in swap queue

(ProcSwapQueue > 50) &&

(ProcSwapQueue@P > 50)

An event is generated each time two consecutive observations find 50 processes or more in the swap queue.

(ProcSwapQueue < 40) &&

(ProcSwapQueue@P < 40)

The event is rearmed when the number of processes in the swap queue drops below 40 for two consecutive observations.

Monitoring the Global State of Active Paging Space

A paging space is fixed disk storage for information that is resident in virtual memory but is not currently being accessed. A paging space, or swap space, is a logical volume with the attribute type equal to paging. When the amount of free real memory in the system is low, programs or data that have not been used recently are moved from real memory to paging space to release real memory for other processes. The amount of paging space required depends upon the types of activities performed on the system. If paging space runs low, processes may be lost, and if paging space runs out, the system may panic. Paging-space shortage may cause memory performance degradation, and thrashing can occur (if VMM memory load control is turned off).

The system monitors the number of free paging-space blocks and detects when a paging-space shortage exists. When the number of free paging-space blocks falls below a threshold known as the paging-space warning level, the system informs all processes except kernel processes (kprocs) of this condition by sending the SIGDANGER signal. If the shortage continues and falls below a second threshold known as the paging-space terminate level, the system sends the SIGKILL signal to processes that are the major users of paging space and that do not have a signal handler for the SIGDANGER signal.

The warning-level and terminate-level thresholds can be obtained and altered by the command vmtune (npswarn and npskill parameters respectively). Processes executing in the early allocation environment avoid receiving the SIGKILL signal if a low paging space condition occurs. If the PSALLOC environment variable is set to early when a program starts, paging space is reserved at the time the process makes a memory request. If there is insufficient paging space, the early allocation algorithm used by the operating system causes the memory request to be unsuccessful. If the PSALLOC environment is not set, or is set to any value other than early, the operating system uses a late allocation algorithm for memory and paging-space allocation. Late allocation does not reserve paging space at the time the memory is requested but defers the reservation until the pages are touched.

Note:
The VMM is a complex system, and paging-space requirements depend on a number of factors, including the paging-space allocation policy used, amount of real memory, and type of activities performed on the system. A thorough understanding of system paging requirements and operating system memory management is recommended before attempting to alter VMM operating parameters.

These properties monitor the global state of all active paging spaces defined in the system (including NFS-mounted paging spaces):

TotalPgSpSize
Holds the total size of all active paging-space devices in the system

TotalPgSpFree
Represents the size (in 4KB pages) of available paging space for all active paging space devices in the system

PctTotalPgSpUsed
Represents the percentage of paging space in use for all active paging space devices in the system

PctTotalPgSpFree
Represents the percentage of free paging space available for all paging space devices in the system.

Predefined Conditions for Monitoring Global State of Active Paging Space

The following table shows the predefined conditions that are available for monitoring paging space, and example expressions:

Condition Name Event Expression Event Description Rearm Expression Rearm Description

Paging active space

TotalPgSpSize != TotalPgSpSize @P

An event is generated whenever the total amount of active paging space changes.

None

None

Paging free space

TotalPgSpFree<= 2560

An event is generated when the VMM is within 2MB (512 4KB pages) of reaching the paging space warning level.

TotalPgSpFree > 2560

The event is rearmed when the free paging space total becomes greater than the same threshold.

Paging percent space used

PctTotalPgSpUsed> 90

An event is generated when more than 90% of the total paging space is in use.

PctTotalPgSpUsed < 85

The event is rearmed when the percentage falls below 85%.

Paging percent space free

PctTotalPgSpFree< 10

An event is generated when the total amount of free paging space falls below 10%.

PctTotalPgSpFree > 15

The event is rearmed when the free paging space increases to 15%.

Monitoring Processor Utilization

The values represented for this attribute reflect total processor utilization across all of the active processors in a system.

The idle and wait states of a processor are monitored, and the time spent running in protection mode is monitored. At each clock tick, an array of counters is incremented to reflect processor activity based on the state of the current running processes. The PctTotalTimeKernel, PctTotalTimeUser, PctTotalTimeWait, and PctTotalTimeIdle properties provide the approximate average percentage of time all active processors are currently spending in each state. Therefore, the sum of these values is 100 at any given observation.

There are two protection modes that processes run in, kernel (or system) level and user level. Processes running in kernel mode run with kernel privileges and have access to kernel data. These processes include kernel processes (kprocs) and services (such as system calls and device drivers).

Processes running in user mode are normal applications with user level privileges and run in their own unique process space. When a user level process invokes a kernel service, for example, by making a system call, a mode switch occurs that causes the process to run in kernel mode while the service is running.

When the current running process makes a request that cannot be immediately satisfied, such as an I/O operation, the process is put into wait state. A processor is considered idle when the current running process is the wait process. The wait process is a kernel process (kproc) that is dispatched when no other processes are ready to run.

These properties can be monitored:

PctTotalTimeIdle
Represents the system-wide percentage of time that the processors are idle

PctTotalTimeKernel
Represents the system-wide percentage of time that the processors are running in kernel mode

PctTotalTimeUser
Represents the system-wide percentage of time that the processors are running in user mode

PctTotalTimeWait
Represents the system-wide percentage of time that the processors are in wait state.

Predefined Conditions for Monitoring Processor Utilization

The following table shows the predefined conditions that are available for monitoring system-wide processor idle time, and example expressions:

Condition Name Event Expression Event Description Rearm Expression Rearm Description

Processor idle time

PctTotalTimeIdle>= 70

An event is generated when the average time all processors are idle at least 70% of the time.

PctTotalTimeIdle < 10

The event is rearmed when the idle time decreases below 10%.

Processor kernel time

PctTotalTimeKernel>= 70

An event is generated when the average time all processors are executing in kernel mode is at least 70% of the time.

PctTotalTimeKernel < 10

The event is rearmed when the kernel time decreases below 10%.

Processor user time

PctTotalTimeUser>= 70

An event is generated when the average time all processors are executing in user mode is at least 70% of the time.

PctTotalTimeUser< 10

The event is rearmed when the user time decreases below 10%.

Processor wait time

PctTotalTimeWait >= 50

An event is generated when the average time all processors are waiting on I/O is at least 50% of the time.

PctTotalTimeWait < 10

The event is rearmed when the wait time decreases below 10%.

Memory Management

The VMM (Virtual Memory Manager) manages the allocation of real memory page frames, resolves references to virtual memory pages that are not currently in real memory (or do not yet exist), and manages the reading and writing of pages to disk storage.

The VMM maintains a list of free page frames that it uses to accommodate page faults. A page fault occurs when a page that is not in real memory is referenced. In most environments, the VMM must occasionally add to the free list by reassigning some page frames owned by running processes. The virtual-memory pages whose page frames are to be reassigned are selected by the VMM's page-replacement algorithm, which takes into consideration the segment type, statistics regarding rate of reoccurring page faults, and user-tunable thresholds. The number of frames reassigned to the free list is also determined by VMM thresholds.

Memory regions defined in either system or user space may be pinned. Pinning a memory region prohibits the pager from stealing pages from the pages backing the pinned memory region. After a memory region is pinned, accessing that region does not result in a page fault until the region is subsequently unpinned. While a portion of the kernel remains pinned, many regions are pageable and are only pinned while being accessed.

Thresholds used by the VMM include the minimum and maximum number of pages to be maintained on the free list (minfree and maxfree). These thresholds are used to determine when the VMM should start or stop stealing pages to replenish the free list. There is also a maximum percentage of real memory that may be pinned. The values of these thresholds may be queried or altered using the system command vmtune.

Virtual memory is partitioned into fixed-size units called pages. Each page may be in real memory (RAM) or stored on disk until needed. Real memory is partitioned into units that are equal in size to virtual pages and are referred to as page frames. To accommodate a large virtual memory space with a limited real memory space, the system uses real memory for work space and maps inactive data and programs to disk.

Pages of a virtual address space are considered to be persistent or working. Persistent pages have permanent storage locations on disk. Data files or executable programs are mapped to persistent pages. Since persistent pages have a permanent storage location, the VMM can write a changed page back to its permanent location or simply free the page frame if it was not altered and re-read the page on a subsequent request.

Working pages are transitory and exist only during their use by a process. Examples are process stack and data regions, kernel and kernel-extension text regions, and shared-library text and data regions. Working pages also require disk storage locations when they cannot be kept in real memory. Disk paging space is used for this purpose.

The operating system provides routines used by the kernel and by services executing at system level for allocating memory in kernel space. Counters are maintained in the kernel to track requests and use of kernel memory, based on the type of data structure or service. These properties can be used to monitor the number and size and the state of requests for buffers allocated in kernel memory. The types of kernel memory available are:

The following properties are available for monitoring real and virtual memory and kernel memory. The <x> in the names below refers to the type of kernel memory allocation as shown in the preceding bulleted list (28 possible monitors).

PctRealMemFree
Represents the percentage of real page frames that are currently available on the VMM free list.

PctRealMemPinned
Represents the percentage of real page frames that are currently pinned and cannot be paged out.

RealMemFramesFree
Represents the number of real page frames that are currently available on the VMM free list.

VMPgInRate
Represents the rate (in pages per second) that the VMM is reading both persistent and working pages from disk storage.

VMPgOutRate
Represents the rate (in pages per second) that the VMM is writing both persistent and working pages to disk storage.

VMPgFaultRate
Represents the average rate of page faults that occur per second.

VMPgSpInRate
Represents the rate (in pages per second) that the VMM is reading working pages from paging-space disk storage.

VMPgSpOutRate
Represents the rate (in pages per second) that the VMM is writing working pages to paging-space disk storage.

KMemReq<x>Rate
Represents the rate of requests per second for a kernel memory buffer of type <x>.

KMemFail<x>Rate
Represents the rate of requests per second for a kernel memory buffer of type <x> that were unsuccessful.

KMemNum<x>
Represents the number of kernel memory buffers of type <x> that are currently in use.

KMemSize<x>
Represents the amount, in bytes, of kernel memory buffers of type <x> that are currently in use.

Predefined Conditions for Memory Management

The following table shows the predefined conditions that are available for monitoring memory management, and example expressions:

Condition Name Event Expression Event Description Rearm Expression Rearm Description

Real memory free

PctRealMemFree < 5

An event is generated when the percentage of real page frames that are free falls below 5%.

PctRealMemFree> 10

The event is rearmed when the percentage of free frames exceeds 10%.

Real memory pinned

PctRealMemPinned > 75

An event is generated when the percentage of real page frames that are pinned exceeds 75%.

PctRealMemPinned < 70

The event is rearmed when the percentage falls below 70%.

Real memory free frames

PctMemFramesFree < 120

An event is generated when the number of free real page frames falls below 120.

PctMemFramesFree> 150

The event is rearmed when the number free exceeds 150.

Page in rate

VMPgInRate > 500

An event is generated when the rate of pages read by the VMM for both persistent and working pages exceeds 500 per second.

VMPgInRate < 400

The event is rearmed when the rate drops below 400.

Page out rate

VMPgOutRate > 500

An event is generated when the rate of pages written by the VMM for both persistent and working pages exceeds 500 per second.

VMPgOutRate < 400

The event is rearmed when the rate drops below 400.

Page fault rate

VMPgFaultRate > 500

An event is generated when there are more than 500 page faults per second.

VMPgFaultRate < 400

The event is rearmed when the rate drops to less than 400 pages per second.

Page space in rate

VMPgSpInRate > 500

An event is generated when more than 500 pages per second are read by the VMM from paging space devices (working pages only).

VMPgSpInRate< 400

The event is rearmed when the rate drops to less than 400 pages per second.

Page space out rate

VMPgSpOutRate> 500

An event is generated when more than 500 pages per second are written by the VMM to paging space devices (working pages only).

VMPgSpOutRate < 400

The event is rearmed when the rate drops to less than 400 pages per second.

Kernel Mbuf rate

KMemReqMbufRate> 5000

An event is generated when the number of requests for a kernel buffer of type <Mbuf> (network data buffer) exceeds 5000 per second.

KMemReqMbufRate< 4000

The event is rearmed when the rate falls below 4000 per second.

Kernel socket buffer rate

KMemReqSockRate > 5000

An event is generated when the number of requests for a kernel buffer of type <Socket> (kernel socket structure) exceeds 5000 per second.

KMemReqSockRate < 4000

The event is rearmed when the rate falls below 4000 per second.

Kernel protocol CB rate

KMemReqProtcbRate> 5000

An event is generated when the number of requests for a kernel buffer of type <Protcb> (Protocol Control Block) exceeds 5000 per second.

KMemReqProtcbRate < 4000

The event is rearmed when the rate falls below 4000 per second.

Kernel other IP CB rate

KMemReqOtherIPRate > 5000

An event is generated when the number of requests for a kernel buffer of type <OtherIP> (other buffers used by IP) exceeds 5000 per second.

KMemReqOtherIPRate < 4000

The event is rearmed when the rate falls below 4000 per second.

Kernel Mblk rate

KMemReqMblkRate > 5000

An event is generated when the number of requests for a kernel buffer of type <Mblk> (stream header and data) exceeds 5000 per second.

KMemReqMblkRate< 4000

The event is rearmed when the rate falls below 4000 per second.

Kernel streams buffer rate

KMemReqStreamsRate > 5000

An event is generated when the number of requests for a kernel buffer of type <Streams> (other streams related memory) exceeds 5000 per second.

KMemReqStreamsRate < 4000

The event is rearmed when the rate falls below 4000 per second.

Kernel other memory rate

KMemReqOtherRate > 5000

An event is generated when the number of requests for a kernel buffer of type <Other> (other kernel memory) exceeds 5000 per second.

KMemReqOtherRate < 4000

The event is rearmed when the rate falls below 4000 per second.

Kernel Mbuf failed rate

KMemFailMbufRate > 10

An event is generated when the number of failures of requests for a kernel buffer of type <Mbuf> (network data buffer) exceeds 10 per second.

KMemFailMbufRate < 5

The event is rearmed when the rate falls below 5 per second.

Kernel socket buffer failed rate

KMemFailSockRate> 10

An event is generated when the number of failures of requests for a kernel buffer of type <Socket> (kernel socket structure) exceeds 10 per second.

KMemFailSockRate< 5

The event is rearmed when the rate falls below 5 per second.

Kernel protocol CB failed rate

KMemFailProtcbRate > 10

An event is generated when the number of failures of requests for a kernel buffer of type <Protcb> (Protocol Control Block) exceeds 10 per second.

KMemFailProtcbRate < 5

The event is rearmed when the rate falls below 5 per second.

Kernel other IP CB failed rate

KMemFailOtherIPRate> 10

An event is generated when the number of failures of requests for a kernel buffer of type <OtherIP> (other buffers used by IP) exceeds 10 per second.

KMemFailOtherIPRate < 5

The event is rearmed when the rate falls below 5 per second.

Kernel Mblk failed rate

KMemFailMblkRate> 10

An event is generated when the number of failures of requests for a kernel buffer of type <Mblk> (stream header and data) exceeds 10 per second.

KMemFailMblkRate< 5

The event is rearmed when the rate falls below 5 per second.

Kernel streams buffer failed rate

KMemFailStreamsRate> 10

An event is generated when the number of failures of requests for a kernel buffer of type <Streams> (other stream related memory) exceeds 10 per second.

KMemFailStreamsRate < 5

The event is rearmed when the rate falls below 5 per second.

Kernel other memory failed rate

KMemFailOtherRate> 10

An event is generated when the number of failures of requests for a kernel buffer of type <Other> (other kernel memory) exceeds 10 per second.

KMemFailOtherRate < 5

The event is rearmed when the rate falls below 5 per second.

Kernel Mbufs

KMemNumMbuf > 10000

An event is generated when the allocated number of kernel buffers of type <Mbuf> (network data buffer) exceeds 10000.

KMemNumMbuf < 9000

The event is rearmed when the number falls below 9000.

Kernel socket buffers

KMemNumSock > 10000

An event is generated when the allocated number of kernel buffers of type <Socket> (kernel socket structure) exceeds 10000.

KMemNumSock< 9000

The event is rearmed when the number falls below 9000.

Kernel protocol CBs

KMemNumProtcb> 10000

An event is generated when the allocated number of kernel buffers of type <Protcb> (Protocol Control Block) exceeds 10000.

KMemNumProtcb< 9000

The event is rearmed when the number falls below 9000.

Kernel other IP CBs

KMemNumOtherIP> 10000

An event is generated when the allocated number of kernel buffers of type <OtherIP> (other buffers used by IP) exceeds 10000.

KMemNumOtherIP< 9000

The event is rearmed when the number falls below 9000.

Kernel Mblk buffers

KMemNumMblk> 10000

An event is generated when the allocated number of kernel buffers of type <Mblk> (stream header and data) exceeds 10000.

KMemNumMblk < 9000

The event is rearmed when the number falls below 9000.

Kernel stream buffers

KMemNumStreams> 10000

An event is generated when the allocated number of kernel buffers of type <Streams> (other streams related memory) exceeds 10000.

KMemNumStreams< 9000

The event is rearmed when the number falls below 9000.

Kernel other memory

KMemNumOther > 10000

An event is generated when the allocated number of kernel buffers of type <Other> (other kernel memory) exceeds 10000.

KMemNumOther < 9000

The event is rearmed when the number falls below 9000.

Kernel Mbufs size

KMemSizeMbuf> 0x4000000

An event is generated when the total space occupied by kernel buffers of type <Mbuf> (network data buffer) exceeds 64MB.

KMemSizeMbuf < 0x2000000

The event is rearmed when the allocated amount drops below 32MB.

Kernel socket buffers size

KMemSizeSock> 0x4000000

An event is generated when the total space occupied by kernel buffers of type <Socket> (kernel socket structure) exceeds 64MB.

KMemSizeSock < 0x2000000

The event is rearmed when the allocated amount drops below 32MB.

Kernel protocol CBs size

KMemSizeProtcb > 0x4000000

An event is generated when the total space occupied by kernel buffers of type <Protcb> (Protocol Control Block) exceeds 64MB.

KMemSizeProtcb< 0x2000000

The event is rearmed when the allocated amount drops below 32MB.

Kernel other IP CBs size

KMemSizeOtherIP> 0x4000000

An event is generated when the total space occupied by kernel buffers of type <OtherIP> (other buffers used by IP) exceeds 64MB.

KMemSizeOtherIP< 0x2000000

The event is rearmed when the allocated amount drops below 32MB.

Kernel Mblks size

KMemSizeMblk > 0x4000000

An event is generated when the total space occupied by kernel buffers of type <Mblk> (stream header and data) exceeds 64MB.

KMemSizeMblk < 0x2000000

The event is rearmed when the allocated amount drops below 32MB.

Kernel streams buffers size

KMemSizeStreams > 0x4000000

An event is generated when the total space occupied by kernel buffers of type <Streams> (other streams related memory) exceeds 64MB.

KMemSizeStreams < 0x2000000

The event rearmed when the allocated amount drops below 32MB.

Kernel other memory size

KMemSizeOther > 0x4000000

An event is generated when the total space occupied by kernel buffers of type <Other> (other kernel memory) exceeds 64MB.

KMemSizeOther < 0x2000000

The event is rearmed when the allocated amount drops below 32MB.

Paging Device

The program name of this resource class is IBM.PagingDevice. It can be used to monitor devices that are used by the operating system for paging. Each host may have one or more paging devices. On the operating system, the paging device is a logical volume.

Monitoring Amount of Free Paging Space for Device

These attributes can be monitored:

OpState
Monitors whether the current operational state of the page device is online or offline.

PctFree
Represents the percentage of free paging space available for a specific paging space device.

Predefined Conditions for Monitoring Paging Space for a Specific Device

The following table shows the predefined conditions and examples of expressions that are available for monitoring paging space for a specific device:

Condition Name Event Expression Event Description Rearm Expression Rearm Description

Paging device state

OpState != 1

An event is generated when the paging space device goes offline.

OpState == 1

The event is rearmed when the device comes back online.

Paging device percent free

PctFree < 20

An event is generated when less that 20% of the paging device is free.

PctFree > 25

The event is rearmed when the amount of free paging space on the device exceeds 25%.

Processor

The program name of this resource class is IBM.Processor.

Because the system tracks the amount of time each processor spends idle, in wait state, and running in kernel and user modes, this resource class can be used to monitor these processor activities. At each clock tick, an array of counters is incremented to reflect the processor activity based on the state of the current running process. The processor user, kernel, wait, and idle resource properties provide the approximate percentage of time that a specific processor is currently spending in each state. Therefore, the sum of these properties is 100 at any given observation.

There are two protection modes that processes run in, kernel (or system) level and user level. Processes executing in kernel mode run with kernel privileges and have access to kernel data. These processes include kernel processes (kprocs), and services (such as system calls and device drivers).

Processes running in user mode are normal applications with user level privileges and run in their own unique process space. When a user level process invokes a kernel service, for example, by making a system call, a mode switch occurs that causes the process to run in kernel mode while the service is executing.

When the current running process makes a request that cannot be immediately satisfied, such as an I/O operation, the process is put into wait state.

Monitoring Utilization of a Single Processor

The following properties can be monitored:

OpState
Monitors whether the current operational state of the processor is online or offline.

PctTimeIdle
Represents the percentage of time the processor is in the idle state.

PctTimeKernel
Represents the percentage of time the processor is running in kernel mode.

PctTimeUser
Represents the percentage of time the processor is running in user mode.

PctTimeWait
Represents the percentage of time the processor is running in wait state.

Predefined Conditions for Monitoring a Processor

This resource class represents the characteristics of the processors within a host. There is one instance of this resource for each processor installed in a host regardless of whether it is active or not. The following table shows the predefined conditions and examples of expressions that are available for monitoring a processor:

Condition Name Event Expression Event Description Rearm Expression Rearm Description

Processor state

OpState !=1

An event is generated when the processor goes offline.

OpState == 1

The event is rearmed when the processor returns online.

Processor idle time

(PctTimeIdle >= 80) &&

(PctTimeIdle @P >= 80)

An event is generated each time the processor is idle at least 80% of the time for two consecutive observations.

(PctTimeIdle < 50)

(PctTimeIdle @P < 50)

The event is rearmed when the idle time for the processor is below 50% for two consecutive observations.

Processor wait time

(PctTimeWait >= 50) &&

(PctTimeWait @P >= 50)

An event is generated when the average time the processor is in wait state is at least 50% for two consecutive observations.

(PctTimeWait < 30) &&

(PctTimeWait @P < 30)

The event is rearmed when the processor is in wait state at most 30% of the time for two consecutive observations.

Processor kernel time

(PctTimeKernel >= 70) &&

(PctTimeKernel @P >= 70)

An event is generated when the average time the processor is in kernel mode for two consecutive observations is 80%.

(PctTimeKernel < 20) &&

(PctTimeKernel @P < 20)

The event is rearmed when the kernel mode time for the processor is below 20% for two consecutive observations.

Processor user time

(PctTimeUser>=80) &&

(PctTimeUser@P > 80)

An event is generated when the average time the processor is in user mode for two consecutive observations is 80%.

(PctTimeUser < 50) &&

(PctTimeUser @P < 50)

The event is rearmed when the user mode time for the processor is below 50% for two consecutive observations.

Physical Volume

The program name of this resource class is IBM.Physical Volume. After a disk is added to the system, it must first be designated as a physical volume before it can be added to a volume group and used to contain a file system or paging space. A physical volume has certain configuration and identification information written on it. When a disk becomes a physical volume, it is divided into 512-byte physical blocks. Physical volumes have a unique name (typically hdiskx where x is a unique number on the system), which is permanently associated with the disk until it is undefined.

Monitoring Physical Disks

These properties, which reflect the basic performance of a physical disk, can be monitored:

PctBusy
Average percentage of time the disk is busy from one observation of the value to the next.

RdBlkRate
Average rate at which blocks are read from disk. The rate is calculated as the difference in total blocks read from the disk between two observations, divided by the time between observations.

WrBlkRate
Average rate at which blocks are written to disk. The rate is calculated as the difference in total blocks written to the disk between two consecutive observations, divided by the time between observations.

XferRate
Average rate of transfers per second that were issued to the physical disk. A transfer is an I/O request to the physical disk. Multiple logical requests can be combined into a single I/O request to the disk. A transfer is of indeterminate size. The rate is calculated as the difference in total transfers between two consecutive observations, divided by the time between observations.

Predefined Conditions for Monitoring Physical Disks

Each instance of this resource class represents a physical volume that has been defined to the system. All resources are monitored. The following table shows the predefined condition and examples of expressions that are available for monitoring physical disks:

Condition Name Event Expression Event Description Rearm Expression Rearm Description

Disk percent busy

(PctBusy >= 90) && (PctBusy@P >=90)

An event is generated when the disk has been busy at least 90% of the time for two consecutive observations.

PctBusy <80

The event is rearmed when the value decreases below 80%.

Disk read rate

RdBlkRate < 50

An event is generated when the rate per second of 512-byte blocks read from the disk is less than 50.

RdBlkRate > 100

The event is rearmed when the rate exceeds 100.

Disk write rate

WrBlkRate < 50

An event is generated when the rate per second of 512-byte blocks written to disk is less than 50.

WrBlkRate > 100

The event is rearmed when the rate exceeds 100.

Disk transfer rate

(XferRate > XferRate@P) &&

((XferRate - XferRate@P)

> (XferRate@P * 0.5))

An event is generated each time the rate of transfer to disk has increased 50%.

None

None

Adapters

The following adapters are supported, each by its own resource class:

ATM Device (IBM.ATMDevice)
All ATM adapters installed in a node are externalized through this resource manager. See ATM Device for more details.

Ethernet Device (IBM.EthernetDevice)
All Ethernet adapters installed in a node are externalized through this resource manager. See Ethernet Device for more details.

FDDI Device (IBM.FDDIDevice)
All FDDI adapters installed in a node are externalized through this resource manager. See FDDI Device for more details.

Token-Ring Device (IBM.TokenRingDevice)
All Token-Ring adapters installed in a node are externalized through this resource manager. See Token-Ring Device for more details.

See Ethernet Device for details on what can be monitored for an adapter. The other adapters have the same types of attributes. Only the adapter name is different.

ATM Device

The program name of this resource class is IBM.ATMDevice. The details of this class are identical to those of the IBM.EthernetDevice class except that the display name of the resource class is "ATM Device." See the description of Ethernet Device for details that also apply to this device.

Ethernet Device

The program name of this resource class is IBM.EthernetDevice. This resource class allows attributes of all Ethernet adapters that are installed in a system to be monitored. The network interfaces that may be defined on the adapters are not represented.

A network adapter card is the hardware that is physically attached to the network cabling. It is responsible for receiving and transmitting data at the physical level. The network adapter card is controlled by the network adapter device driver. A machine must have one network adapter card (or connection) for each network (not network type) to which it connects. For instance, if a host attaches to two Token-Ring networks, it must have two network adapter cards. When a new network adapter is physically installed in the system, the operating system assigns it a logical name. Some examples are: tok0 for a Token-Ring adapter, ent0 for an Ethernet adapter, or atm0 for an ATM adapter. The trailing number assigned, creates a unique logical number. For example, a second Token-ring adapter would have the logical name, tok1. The lsdev command can be used to display information about network adapters.

Messages received by a LAN adapter, referred to as frames, are encapsulated within destination, header, and trailer information added by the various network protocol layers. A counter, maintained for each adapter, tracks the number of frame-receive errors at the adapter device level that caused unsuccessful reception due to hardware or network errors. This counter is the raw value for RecErrorRate.

When frames are received by an adapter, they are transferred from the adapter into a device-managed receive queue. The number of packets accepted but dropped by the device driver level for any reason (for example, queue buffer shortage) is tracked by a counter, which provides the raw value of the RecDropRate property.

Messages and data sent by an application to a LAN adapter for transmission are broken up into packets and appended with address, header, and trailer information by the various network protocol layers. At the adapter device driver level, packets are placed in buffers on a transmit queue. The packets are appended with a network interface header, then transmitted as frames by the adapter device.

Counters are maintained for each adapter to track the number of transmission errors at the device level (due to hardware or network errors), number of transmission queue overflows at the device driver level (due to buffer shortage), and the number of packets dropped (packets not passed to the device by the driver for any reason). These counters provide the raw values for XmitErrorRate, XmitOverflowRate , and XmitDropRate, respectively.

Monitoring Device Performance

The following properties can be monitored:

RecErrorRate
Represents the number of receive errors per second that occurred at the adapter level.

RecDropRate
Represents the number of receive packets per second that were dropped by the adapter device driver.

XmitDropRate
Represents the number of outbound packets per second that were dropped by the adapter device driver.

XmitErrorRate
Represents the number of transmit errors per second that were detected at the adapter level.

XmitOverflowRate
Represents the number of transmit queue overflows per second that were detected by the adapter.

Predefined Conditions for Monitoring Device Performance

This resource class externalizes the characteristics of all Ethernet adapters that are installed in a system. It is important to note that this class does not represent the network interfaces that may be defined on the adapters. This class represents the actual adapters (i.e. ent0, etc.).

The characteristics are limited to a small set in the first release that are compatible with what is available through Event Management's aixos resource monitor.

The following table shows the predefined conditions and examples of expressions that are available for monitoring device performance. All resources are monitored.

Condition Name Event Expression Event Description Rearm Expression Rearm Description

Ethernet receive error rate

RecErrorRate > 1

An event is generated when the number of receive errors exceeds 1 per second.

(RecErrorRate == 0) &&

(RecErrorRate@P == 0)

The event is rearmed when the receive error rate is 0 for two consecutive observations.

Ethernet receive drop rate

RecDropRate > 10

An event is generated when the number of receive packets dropped exceeds 10 per second.

RecDropRate < 5

The event is rearmed when the number of dropped packets goes below 5 per second.

Ethernet transmit drop rate

XmitDropRate > 10

An event is generated when the number of outbound packets dropped exceeds 10 per second.

XmitDropRate< 5

The event is rearmed when the number of dropped packets goes below 5 per second.

Ethernet transmit error rate

XmitErrorRate > 1

An event is generated when the number of transmit errors exceeds 1 per second.

(XmitErrorRate == 0) &&

(XmitErrorRate@P == 0)

The event is rearmed when the transmit error rate is 0 for two consecutive observations.

Ethernet transmit overflow rate

XmitOverflowRate > 10

An event is generated when the number of transmit queue overflows exceeds 10 per second.

XmitOverflowRate < 2

The event is rearmed when the number of overflows goes below 2 per second.

FDDI Device

The program name of this resource class is IBM.FDDIDevice. The details of this class are identical to those of the IBM.EthernetDevice class except that the display name of the resource class is "FDDI Device." See the description of Ethernet Device for details that also apply to this device.

Token-Ring Device

The program name of this class is IBM.TokenRingDevice. The details of this class are identical to those of the IBM.EthernetDevice class except that the display name of the resource class is "Token-Ring Device." See the description of Ethernet Device for details that also apply to this device.

Program

The program name of this resource class is IBM.Program resource class. This resource class can monitor a set of processes that are running a specific program or command whose attributes match a filter criterion. The filter criterion includes the real or effective user name of the process, arguments that the process was started with, etc. The primary aspect of a program resource that can be monitored is the set of processes that meet the program definition. A client can be informed when processes with the properties that meet the program definition are initiated and when they are terminated. This resource class typically is used to detect when a required subsystem fails so that recovery actions can be performed, or the administrator can be notified, or both.

Program Definition

A program definition requires the program name and the user name of the owner of the program. The program should be identified by user name in addition to program name to avoid confusion when two or more programs have the same name. These persistent attributes are defined as follows:

ProgramName
Identifies the name of the command or program to be monitored. The program name is the base name of the file containing the program. This name is displayed by the ps command when the -l flag or -o comm is specified. Note that the program name displayed by ps when the -f flag or -o args is specified may not be the same as the base name of the file containing the program.

Filter
Specifies a filter that selects a subset of all processes executing the program identified by the persistent attribute ProgramName . For example, the filter may limit the process set to those processes that are running ProgramName under the user name foo.
Note:
Process IDs are not used to specify programs because they are transient and have no prior correlation with the program being run, nor can the restart of a program be detected because there is no way to anticipate the process ID that would be assigned to the restarted application.

In order for a process to match a program definition and thus be considered to be running the program, its executable name must match the ProgramName persistent attribute value. In addition, the expression defined by the Filter persistent attribute must evaluate to TRUE by using the properties of the process. The Filter attribute is a string that consists of the names of various properties of a process, comparison operators, and literal values. For example, a value of user==greg restricts the process set to those processes that run ProgramName under the userid greg. The syntax for the Filter value is the same as for a string.

For more information on selection strings, see Using Expressions.

Processes must have a minimum duration (approximately 15 seconds) to be monitored by the IBM.Program resource class. (If a program runs for only a few seconds, all processes that run the program may not be detected.)

This property can be monitored: Processes

These elements of Processes can be monitored:

CurPidCount
Represents the number of processes that currently match the program definition and thus are considered to be running the program.

PrevPidCount
Represents the number of processes that matched the program definition at the last state change (previous value of CurPidCount).

CurrentList
Contains a list of IDs for the processes that currently match the program definition and thus are considered to be running the program.

ChangeList
Contains a list of IDs for the processes that were added to or removed from the CurrentList since the last state change. Whether the list represents additions or deletions can be determined by comparing CurPidCount and PrevPidCount. If CurPidCount is greater, this list contains additions; otherwise, it contains deletions. Additions and deletions are not combined in the same state change.
For example, assume the six processes shown in the following ps output are running the biod program on node 1:
ps -e -o "ruser,pid,ppid,comm" | grep biod
 
root		7786	8040 biod
 
root		8040	5624 biod
 
root		8300	8040 biod
 
root		8558	8040 biod
 
root		8816	8040 biod
 
root		9074	8040 biod

To be informed when the number of processes running the specified program changes, you can define this event expression:

Processes.CurPidCount!=Processes.PrevPidCount

To be informed when no processes are running the specified program, you can define this event expression:

Processes.CurPidCount==0

Predefined Conditions for Monitoring Programs

This resource class is typically used to detect when a required subsystem fails so that some recovery action can be performed or an administrator can be notified. The following table shows the predefined conditions and examples of expression that are available for monitoring programs.

Condition Name Event Expression Event Description Rearm Expression Rearm Description Monitored Resources
sendmail daemon state

Processes .CurPidCount <=0

An event is generated whenever the sendmail daemon is not running.

Processes .CurPidCount> 1

The event is rearmed when the sendmail daemon is running.

sendmail

inetd daemon state

Processes .CurPidCount <=0

An event is generated whenever the inetd daemon is not running.

Processes .CurPidCount> 1

The event is rearmed when the inetd daemon is running. inetd

Predefined Responses

The following predefined responses are shipped as templates or as starting points for monitoring.

Use the Web-based System Manager online help or the ERRM commands (particularly, the chresponse command) to customize these predefined responses.

See Using Expressions for a summary of the data types and operators that you can use in selection strings for a customized response.

Response Name Action Command

Critical notification

Name: log critical event

  • Log an entry to /tmp/criticalEvents whenever an event or a rearm event occurs.

  • /usr/sbin/rsct/bin/logevent /tmp/criticalEvents
  • When in effect: All day everyday.

Name: e-mail root

  • Send an e-mail to root whenever an event or a rearm event occurs.

  • /usr/sbin/rsct/bin/notifyevent root
  • When in effect: All day everyday.

Name: broadcast message

  • Broadcast the event or the rearm event to all logged-in users.

  • /usr/sbin/rsct/bin/wallevent
  • When in effect: All day everyday.

Warning notification

Name: log warning event

  • Log an entry to /tmp/warningEvents whenever an event or a rearm event occurs.

  • /usr/sbin/rsct/bin/logevent /tmp/warningEvents
  • When in effect: All day everyday.

Name: e-mail root

  • Send an e-mail to root whenever an event or a rearm event occurs.

  • /usr/sbin/rsct/bin/notifyevent root
  • When in effect: All day everyday.

Informational notification

Name: log info event

  • Log an entry to /tmp/infoEvents whenever an event or a rearm event occurs.

  • /usr/sbin/rsct/bin/logevent /tmp/warningEvents
  • When in effect: All day everyday.

Name: e-mail root

  • Send e-mail to root when an event or a rearm event occurs during working hours.

  • /usr/sbin/rsct/bin/notifyevent root
  • When in effect: 8AM-5PM Monday to Friday.

Log event anytime

Name: log event

  • Log an entry to /tmp/systemlEvents whenever an event or a rearm event occurs.

  • /usr/sbin/rsct/bin/logevent /tmp/systemEvents
  • When in effect: All day everyday.

Send e-mail to root anytime

Name: e-mail root

  • Send an e-mail to root whenever an event or a rearm event occurs.

  • /usr/sbin/rsct/bin/notifyevent root
  • When in effect: All day everyday.

Send e-mail to root off-shift

Name: e-mail root

  • Send an e-mail to root whenever an event or a rearm event occurs during non-working hours.

  • /usr/sbin/rsct/bin/notifyevent root
  • When in effect: 5PM-12AM Monday to Friday; 12AM-8AM Monday to Friday; all day Saturday and Sunday.

Broadcast event anytime

Name: broadcast message

  • Broadcast an event or rearm event to all users that log in to the host.

  • /usr/sbin/rsct/bin/wallevent
  • When in effect: All day everyday.

Display in Events plug-in

Display an event in the Events plug-in. Available from Web-based System Manager only. This is the only response that can be used by a non-root user.

Predefined Commands, Scripts, Utilities, and Files

As an alternative to the Monitoring GUI, you can use the following scripts, utilities, commands, and files to control Monitoring on your system. See the man pages or AIX Commands Reference for detailed usage information.

ERRM commands

chcondition
Changes any of the attributes of a defined condition.

lscondition
Lists information about one or more conditions.

mkcondition
Creates a new condition definition which can be monitored.

rmcondition
Removes a condition.

chresponse
Adds or deletes the actions of a response or renames a response.

lsresponse
Lists information about one or more responses.

mkresponse
Creates a new response definition with one action.

rmresponse
Removes a response.

lscondresp
Lists information about a condition and its linked responses, if any.

mkcondresp
Creates a link between a condition and one or more responses.

rmcondresp
Deletes a link between a condition and one or more responses.

startcondresp
Starts monitoring a condition that has one or more linked responses.

stopcondresp
Stops monitoring a condition that has one or more linked responses.

RMC Commands

chrsrc
Changes the persistent attribute values of a resource or resource class.

lsactdef
Lists (displays) action definitions of a resource or resource class.

lsrsrc
Lists (displays) resources or a resource class.

lsrsrcdef
Lists a resource or resource class definition.

mkrsrc
Defines a new resource.

refrsrc
Refreshes the resources within the specified resource class.

rmrsrc
Removes a defined resource.

Scripts and Utilities

ctsnap
Gathers configuration, log, and trace information for the Reliable Scalable Cluster Technology (RSCT) product.

logevent
Logs event information generated by the Event Response resource manager to a specified log file.

lsaudrec
Lists records from the audit log.

notifyevent
E-mails event information generated by the Event Response resource manager to a specified userid.

rmaudrec
Removes records from the audit log.

rmcctrl
Manages the Resource Monitoring and Control (RMC) subsystem.

wallevent
Broadcasts an event or a rearm event to all users who are logged in.

Files

Resource Data Input File
Defines resources and persistent attribute values of a resource or resource class.

rmccli General Information File
Contains information global to the RMC command line interface.


[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]