System Checks


cd ${OE_AGENT_HOME}/checks_enabled
ln -s ../checks_available/{,,,,} ./


For most of installations, our defaults works well, but you can edit conf/system.ini if you need fine tuning. For each of system check you van enable or disable static alerts and set appropriate thresholds.

[Load Average]
static_enabled: True
high: 95
severe : 100

[Disk Stats]
static_enabled: True
high : 90
severe : 95

[Memory Stats]
static_enabled: True
high : 90
severe : 95

[CPU Stats]
static_enabled: True
percore_stats: False
high: 90
severe: 95

[Network Stats]
localhost = False
rated = True


${OE_AGENT_HOME}/ restart

System CPU statistics are collected via module: It reads /proc/stat file for CPU related information. The Check does not require any additional dependencies, and it should provide following metrics:


Name Description Type Unit
cpu_idle percent of free CPU resources gauge Percent
cpu_iowait CPU percent spent on waiting for I/O operation gauge Percent
cpu_irq CPU percent spent on handling hardware interrupts gauge Percent
cpu_load Total CPU/Core load percent gauge Percent
cpu_nice CPU percent of processing user level processes with positive nice gauge Percent
cpu_softirq CPU percent spent on handling soft interrupts gauge Percent
cpu_system CPU percent of processing system level processes gauge Percent
cpu_user CPU percent of processing user level processes gauge Percent

Collected via module, it takes memory related information from /proc/meminfo file, and provides following metrics:


Name Description Type Unit
mem_active Memory that is being used by a particular process gauge Bytes
mem_available Not active / free memory gauge Bytes
mem_buffers The total amount of memory used for critical system buffers gauge Bytes
mem_cached Amount of cached data. Free’d if there is not enough free memory in the system. gauge Bytes
mem_inactive Memory that was allocated to a process that is no longer running gauge Bytes
mem_swapcached Amount of swapped memory gauge Bytes
mem_total Total amount of memory gauge Bytes
mem_used_percent Aggregated from metrics above, total memory usage percentage gauge Percent

Collected via module, this check uses several resources, to provide statistics about disk IO and Space usage. Read/write statistics are taken from /sys/block/{DISK_NAME}/stat files, also we use Linux df command, to get information about space usage. IO statistics are taken from /proc/diskstats. For each disk we take following metrics:


Name Description Type Unit
drive_bytes_available Amounts of unused bytes gauge Bytes
drive_bytes_used Amounts of used bytes gauge Bytes
drive_io_percent_used Percent of used IO resources per second of disk drive gauge Percent
drive_percent_used Percentage of disk space usage gauge Percent
drive_reads Read operations per second performed on disk drive rate Bytes
drive_writes Write operations per second performed on disk drive rate Bytes

Collected via module, It collects metrics about all installed interfaces by reading /sys/class/net/{NIC}/statistics/rx_bytes file.


Name Description Type Unit
bytes_rx Amounts of received bytes rate Bytes
bytes_tx Amounts of sent bytes rate Bytes
IP Conntrack

Enable this module only if you use connection tracking. This check makes sense, if you have router like system, or if by some reason nf_conntrack kernel module is loaded. It reads /proc/sys/net/ipv4/netfilter/ip_conntrack_max|ip_conntrack_count files and provides :


Name Description Type Unit
conntrack_max Maximum configured Conntack value gauge None
conntrack_cur Current IP Conntack value gauge None
Load Average

This is one of the most important metrics in Linux (Maybe even the most). System load average shows ammount of processes, waiting in queue for CPU. It is calculater by 1,5 and 15 minute averages.In most of live systems it must have a value, which is less than total amount of CPUs detected by system. For example of your Server is equiped with 2x Quad cores CPUs, with enabled hyper trading Linux will see 16 CPUs, so Load Average should be below 16. Our check provides:


Name Description Type Unit
sys_load_1 System load average for last minute gauge None
sys_load_5 System load average for last 5 minutes gauge None
sys_load_15 System load average for last 15 minutes gauge None

As regular checks and also triggers special WARNING, when value of sys_load_1 is more that 90% of amount of CPU cores and ERROR, when its equal or more that 100%. This behavior can be changed by editing and changing :

warn_level = 90
crit_level = 100

to desired values. However these values are quite reasonable so, use it without modifications, if you are not for 100% sure that you need to change it

TCP Connections

This check provides status of TCP connections to systems. It parses /proc/net/tcp and provides following metrics.


Name Description Type Unit
tcp_close TCP connections with CLOSED state gauge None
tcp_close_wait The remote side has been shut down and is now waiting for the socket to close gauge None
tcp_closing TCP connections in closing state gauge None
tcp_established The socket has a connection established gauge None
tcp_fin_wait1 The socket is closed, and the connection is now shutting down. gauge None
tcp_last_ack TCP connections in last ack state gauge None
tcp_listen TCP listening sockets gauge None
tcp_max_states TCP connections in max state gauge None
tcp_new_syn_recv TCP connections in new syn recv state gauge None
tcp_syn_recv TCP connections in syn recvstate gauge None
tcp_syn_sent TCP connections in syn recv state gauge None
tcp_time_wait TCP connections in time wait state gauge None

On some very heavy loaded systems, this check may become expensive, please make sure its suits your system resources before enabling it on systems with 20k+ TCP ESTABLISHED connections.

BTRFS check

BTRFS check can be very useful in combinations with regular Drive IO checks on systems which uses BTRFS file system. Its monitors BTRFS volumes and checks for volume errors. It also contains special check which will send manually defined ERROR and WARNING messages if values of checked parameters are above Zero. Manual Error handling can be enabled or disable by setting up enable_special variable at the top of script. Its accepts True or False parameters, defaults is True.