Monitoring iRODS with Nagios
September 26, 2016
Justin James
jjames@renci.org
iRODS Application Engineer
Introduction
Our Use Cases
We will use Nagios to monitor iRODS servers in the following ways:
Sample Setup
The configuration files in this slide deck are based on the following iRODS grid setup.
One iCAT enabled server and two resource servers:
Two iRODS resources:
iping Command
We will create our own custom iping command to determine if a resource server is up or down.
The iping does an rcConnect() call and determines the resource servers status by the return value of this method.
The iping source code can be found at:
https://github.com/irods/contrib/blob/master/iping/src/iping.cpp
Building and Installing the iping Command
Perform the following steps to build and install the iping command:
$ export PATH=/opt/irods-externals/cmake3.5.2-0/bin:$PATH
$ cd ~
$ git clone https://github.com/irods/contrib
$ mkdir ipingbuild
$ cd ipingbuild
$ cmake ~/contrib/iping
$ make package
$ sudo dpkg -i irods-iping_4.2.0~trusty_amd64.deb
Testing the iping Command
The iping command has now been installed in the system.
Test the iping command:
$ iping -h Resource1.example.org
OK : connection to iRODS server successful
$ iping -h Resource2.example.org
OK : connection to iRODS server successful
$ echo $?
0
Bring down a Resource1.example.org and test iping again:
$ iping -h Resource1.example.org -p 1247
ERROR: _rcConnect: connectToRhost error, server on Resource1.example.org:1247 is probably down status = -305111 USER_SOCK_CONNECT_ERR, Connection refused
$ echo $?
2
Installing Nagios
Nagios and the Nagios plugins packages can be installed with the following command:
sudo apt install nagios3 nagios-nrpe-plugin
After installation test that Nagios is up by going to http://localhost/nagios3 and logging in with the nagiosadmin user account. Use the password that was set during Nagios installation.
Ping iRODS Resource
The first Nagios service that we will create is a ping service to determine the status of resource servers.
Ping iRODS Resource
Steps to build and test the resource ping service:
Ping iRODS Resource - Allowing nagios user login access
The nagios user on the local filesystem needs to be able to connect to iRODS using an administrator account.
Ping iRODS Resource - Configuration
Create a directory to hold our configuration file and update nagios.cfg to look for configuration files in this directory:
$ mkdir -p /etc/nagios3/irods
$ echo "cfg_dir=/etc/nagios3/irods" >> /etc/nagios3/nagios.cfg
Create /etc/nagios/irods/irods.cfg
define host {
name irods-server-template
use generic-host
check_interval 1
register 0
}
define host {
use irods-server-template
host_name Resource1
alias Resource1Resource
address Resource1.example.org
}
define host {
use irods-server-template
host_name Resource2
alias Resource2Resource
address Resource2.example.org
}
Ping iRODS Resource - Configuration
define hostgroup{
hostgroup_name irods-resource-servers
alias iRODS Servers
members Resource1,Resource2
}
define command{
command_name iping-irods-server
command_line /usr/lib/nagios/plugins/iping.sh -h $HOSTADDRESS$ -p $ARG1$
}
define command {
command_name update-irods-resource-state
command_line /usr/lib/nagios/plugins/update_irods_resc_state.sh $HOSTADDRESS$ $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}
define service {
use generic-service
hostgroups irods-resource-servers
service_description IPING
check_command iping-irods-server!1247
check_interval 1
event_handler update-irods-resource-state
}
/etc/nagios/irods/irods.cfg (cont)
Ping iRODS Resource - Configuration
Explanation:
Ping iRODS Resource - Configuration
Explanation (cont):
Ping iRODS Resource - Scripts
Next create the bash scripts for iping.sh and update_irods_resc_state.sh
Ping iRODS Resource - Scripts
/usr/lib/nagios/plugins/iping.sh
#!/bin/bash
export HOME=/var/lib/nagios
return=0
/usr/bin/iping "$@" 2>&1 || return=$?
if [ $return -gt 3 ]; then
exit 2
else
exit $return
fi
Ping iRODS Resource - Scripts
/usr/lib/nagios/plugins/update_irods_resc_state.sh
#/bin/bash
export HOME=/var/lib/nagios
LOGFILE=/tmp/update_resc.log
echo update_irods_resc_state.sh "$@" >> $LOGFILE
HOST=$1
SERVICE_STATE=$2
SERVICE_STATE_TYPE=$3
SERVICE_ATTEMPT=$4
RESOURCES=$(iquest "%s" "select RESC_NAME where RESC_LOC = '$HOST'")
echo RESOURCES = $RESOURCES >> $LOGFILE
echo SERVICES_STATE = $SERVICE_STATE >> $LOGFILE
case "$SERVICE_STATE" in
OK)
for RESOURCE in $RESOURCES; do
iadmin modresc $RESOURCE status up
done
;;
WARNING)
;;
UNKNOWN)
;;
CRITICAL)
for RESOURCE in $RESOURCES; do
iadmin modresc $RESOURCE status down
done
;;
esac
exit 0
Ping iRODS Resource - Testing
sudo service nagios3 restart
$ iadmin lr Resource1Resource | grep resc_status
resc_status: down
$ iadmin lr Resource1Resource | grep resc_status
resc_status: up
Ping iRODS Resource - Testing
Ping iRODS Resource - Testing
Ping iRODS Resource - Use in Replication Resc Hierarchy
$ iadmin mkresc RootResource replication
Creating resource:
Name: "RootResource"
Type: "replication"
Host: ""
Path: ""
Context: ""
$ iadmin addchildtoresc RootResource Resource1Resource
$ iadmin addchildtoresc RootResource Resource2Resource
$ ilsresc
demoResc
RootResource:replication
├── Resource1Resource
└── Resource2Resource
$ echo test > test.txt
$ iput -R RootResource test.txt
$ ils -l test.txt
rods 0 RootResource;Resource1Resource 5 2016-09-09.10:24 & test.txt
rods 1 RootResource;Resource2Resource 5 2016-09-09.10:24 & test.txt
Ping iRODS Resource - Use in Replication Resc Hierarchy
$ iget test.txt -
test
$ iget test.txt -
ERROR: getUtil: get error for - status = -305111 USER_SOCK_CONNECT_ERR, Connection refused
Monitoring Resource Use
The next Nagios service that we will create is a service to monitor the resource use.
Monitoring Resource Use - Update Resource Context
$ iadmin modresc context Resource1Resource "max_bytes=500"
$ iadmin modresc context Resource2Resource "max_bytes=500"
Monitoring Resource Use - Configuration
define command {
command_name check-resource-use
command_line /usr/lib/nagios/plugins/check_resource_use.sh $HOSTALIAS$ $ARG1$ $ARG2$
}
define service {
use generic-service
hostgroups irods-resource-servers
service_description check-resource-use
check_command check-resource-use!90.0!95.0
check_interval 1
}
Monitoring Resource Use - Configuration
Configuration Explanation:
Monitoring Resource Use - Script
Create bash script for check_resource_use.sh.
Monitoring Resource Use - Script
/usr/lib/nagios/plugins/check_resource_use.sh
#!/bin/bash
export HOME=/var/lib/nagios
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
if [ $# -lt 3 ]; then
echo "Use: check_resource_use.sh "
exit $STATE_UNKNOWN
fi
max_bytes=0
percent_used=0
warning_level=$2
critical_level=$3
if ! [[ $warning_level =~ ^[0-9]+([.][0-9]+)?$ ]]; then
echo "Warning level provided is not a valid number."
exit $STATE_UNKNOWN
fi
if ! [[ $critical_level =~ ^[0-9]+([.][0-9]+)?$ ]]; then
echo "Critical level provided is not a valid number."
exit $STATE_UNKNOWN
fi
Monitoring Resource Use - Script
/usr/lib/nagios/plugins/check_resource_use.sh (cont)
context=$(iquest "%s" "select RESC_CONTEXT where RESC_NAME = '$1'")
for i in $(echo $context | tr ";" "\n"); do
[[ $i =~ "max_bytes=" ]] && max_bytes=$(echo $i | cut -b11-)
done
used=$(iquest "%s" "select sum(DATA_SIZE) where RESC_NAME = '$1'")
if [ -z $used ]; then
used=0
fi
if [[ $max_bytes -gt 0 && $used -gt 0 ]]; then
percent_used=$(echo "scale=2; ($used / $max_bytes) * 100.0" | bc)
fi
if [ $(echo "$percent_used > $critical_level" | bc -l) -eq "1" ]; then
echo "CRITICAL - Resource use is above critical level. byte_used=$used; \
max_bytes=$max_bytes; critical_threshold=${critical_level}%"
exit $STATE_CRITICAL
fi
if [ $(echo "$percent_used > $warning_level" | bc -l) -eq "1" ]; then
echo "WARNING - Resource usage is above warning level. byte_used=$used; \
max_bytes=$max_bytes; warning_threshold=${warning_level}%"
exit $STATE_WARNING
fi
echo "OK - Resource use is below warning and critical levels. bytes_used=$used; \
max_bytes=$max_bytes"
exit $STATE_OK
Monitoring Resource Use - Testing
sudo service nagios3 restart
Monitoring Resource Use - Testing
Monitoring Connection Count
The next Nagios service that we will create is a service to monitor the connection count.
Monitoring Connection Count - Update irods_environment.json
"irods_server_control_plane_encryption_algorithm": "AES-256-CBC",
"irods_server_control_plane_encryption_num_hash_rounds": 16,
"irods_server_control_plane_key": "12345678901234567890123456789012",
"irods_server_control_plane_port": 1248
Monitoring Connection Count - Configuration
define host {
use irods-server-template
host_name ICAT
address ICAT.example.org
}
define hostgroup {
hostgroup_name irods-servers
alias iRODS Servers
members ICAT,Resource1,Resource2
}
define command {
command_name check-active-connections
command_line /usr/lib/nagios/plugins/check_agent_count.sh $HOSTADDRESS$
}
define service {
use generic-service
hostgroups irods-servers
service_description check active connections
check_command check-active-connections
check_interval 1
}
Monitoring Connection Count - Configuration
Configuration Exaplanation:
Monitoring Connection Count - Script
Create bash script for check_agent_count.sh.
Monitoring Connection Count - Script
/usr/lib/nagios/plugins/check_agent_count.sh
#!/bin/bash
export HOME=/var/lib/nagios
STATE_OK=0
STATE_UNKNOWN=3
if [ $# -lt 1 ]; then
echo "Use: check_agent_count.sh "
exit $STATE_UNKNOWN
fi
host_name=$1
agent_count=$(irods-grid --all status | jq ' .hosts[] | .hostname + " " + "\(.agents[].agent_pid)"' | grep $host_name | wc -l)
echo "OK - open connections = $agent_count"
Monitoring Connection Count - Testing
Restart Nagios and try it out.
Monitoring Connection Count - Testing