Monitoring iRODS with Nagios
September 26, 2016
Justin James
jjames@renci.org
iRODS Application Engineer
Introduction
- Nagios is an open source application that monitors network services, host resources, and hardware.
- Nagios has an inheritance based configuration scheme that is easily extendable and flexible.
- Nagios users can use plugins created by the Nagios team or create plugins themselves.
Our Use Cases
We will use Nagios to monitor iRODS servers in the following ways:
- Ping nagios resources to determine if they are up. If a resource status change is detected, mark the resc_status in the datbase as "up" or "down".
- Monitor resource use and warn when resource usage exceeds thresholds which are derived as a percentage of the max_bytes setting in the resource context.
- Monitor connection count on iRODS servers. This is information only and will not generated warnings or errors.
Sample Setup
The configuration files in this slide deck are based on the following iRODS grid setup.
One iCAT enabled server and two resource servers:
- ICAT.example.org – 192.168.1.150
- Resource1.example.org – 192.168.1.151
- Resource2.example.org – 192.168.1.152
Two iRODS resources:
- Resource1Resource - on Resource1.example.org
- Resource2Resource - on Resource2.example.org
iping Command
We will create our own custom iping command to determine if a resource server is up or down.
The iping does an rcConnect() call and determines the resource servers status by the return value of this method.
The iping source code can be found at:
https://github.com/irods/contrib/blob/master/iping/src/iping.cpp
Building and Installing the iping Command
Perform the following steps to build and install the iping command:
$ export PATH=/opt/irods-externals/cmake3.5.2-0/bin:$PATH
$ cd ~
$ git clone https://github.com/irods/contrib
$ mkdir ipingbuild
$ cd ipingbuild
$ cmake ~/contrib/iping
$ make package
$ sudo dpkg -i irods-iping_4.2.0~trusty_amd64.deb
Testing the iping Command
The iping command has now been installed in the system.
Test the iping command:
$ iping -h Resource1.example.org
OK : connection to iRODS server successful
$ iping -h Resource2.example.org
OK : connection to iRODS server successful
$ echo $?
0
Bring down a Resource1.example.org and test iping again:
$ iping -h Resource1.example.org -p 1247
ERROR: _rcConnect: connectToRhost error, server on Resource1.example.org:1247 is probably down status = -305111 USER_SOCK_CONNECT_ERR, Connection refused
$ echo $?
2
Installing Nagios
Nagios and the Nagios plugins packages can be installed with the following command:
sudo apt install nagios3 nagios-nrpe-plugin
After installation test that Nagios is up by going to http://localhost/nagios3 and logging in with the nagiosadmin user account. Use the password that was set during Nagios installation.
Ping iRODS Resource
The first Nagios service that we will create is a ping service to determine the status of resource servers.
- This service will rely on the iping command that was created and installed earlier.
- If a status change is detected, the resc_status in the database will be update to reflect the current status.
- This resc_status is used when resolving which resource to read from when the resource is in a replication hierarchy.
Ping iRODS Resource
Steps to build and test the resource ping service:
- Allow the Nagios user to login to iRODS under an administrator account.
- Update the Nagios configuration to define the hosts, commands, and services.
- Create scripts for the commands that will be used to monitor the resources and update the resc_status.
- Test the service using the Nagios web monitoring tool.
Ping iRODS Resource - Allowing nagios user login access
The nagios user on the local filesystem needs to be able to connect to iRODS using an administrator account.
- Switch user to the nagios user.
- May have to update /etc/passwd to give nagios a home directory and allow bash to be the default shell. Example /etc/passwd entry:
nagios:x:119:128::/var/lib/nagios:/bin/bash
- May have to update /etc/passwd to give nagios a home directory and allow bash to be the default shell. Example /etc/passwd entry:
- Perform iinit. When prompted for username and password, provide the information for an iRODS administrator account.
Ping iRODS Resource - Configuration
Create a directory to hold our configuration file and update nagios.cfg to look for configuration files in this directory:
$ mkdir -p /etc/nagios3/irods
$ echo "cfg_dir=/etc/nagios3/irods" >> /etc/nagios3/nagios.cfg
Create /etc/nagios/irods/irods.cfg
define host {
name irods-server-template
use generic-host
check_interval 1
register 0
}
define host {
use irods-server-template
host_name Resource1
alias Resource1Resource
address Resource1.example.org
}
define host {
use irods-server-template
host_name Resource2
alias Resource2Resource
address Resource2.example.org
}
Ping iRODS Resource - Configuration
define hostgroup{
hostgroup_name irods-resource-servers
alias iRODS Servers
members Resource1,Resource2
}
define command{
command_name iping-irods-server
command_line /usr/lib/nagios/plugins/iping.sh -h $HOSTADDRESS$ -p $ARG1$
}
define command {
command_name update-irods-resource-state
command_line /usr/lib/nagios/plugins/update_irods_resc_state.sh $HOSTADDRESS$ $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}
define service {
use generic-service
hostgroups irods-resource-servers
service_description IPING
check_command iping-irods-server!1247
check_interval 1
event_handler update-irods-resource-state
}
/etc/nagios/irods/irods.cfg (cont)
Ping iRODS Resource - Configuration
Explanation:
- Creates a template for all iRODS servers (irods-server-template)
- Derived from generic-host
- check_interval of 1 means every minute
- register 0 means it is only a template
- Defines the hosts
- Derived form the irods-server-template
- alias is the name of the resource on the host
- address is the FQDN on the host
Ping iRODS Resource - Configuration
Explanation (cont):
- Defines a hostgroup (irods-resource-servers) that includes the two hosts.
- Defines a command (iping-irods-server) which executes the iping.sh script (yet to be created).
- Defines a command (update-irods-resource-state) which executes the update_irods_resource_state.sh script (yet to be created).
- Defines a service (IPING)
- Executes the iping-irods-server command every second
- Sends 1247 (iRODS listen port) as $ARG1$
- Executes update-irods-resource-state command on status changes.
Ping iRODS Resource - Scripts
Next create the bash scripts for iping.sh and update_irods_resc_state.sh
- /usr/lib/nagios/plugins/iping.sh
- Wrapper for the iping command.
- Returns 2 on error and 0 on success.
- /usr/lib/nagios/plugins/update_irods_resc_state.sh
- Accepts HOST as first argument
- Accepts SERVICE_STATE as second argument
- Performs an iquest query to get the resource name(s) for the host.
- For OK (0) response, uses iadmin modresc to set the status to "up".
- For CRITICAL (2) response, uses iadmin modresc to set the status to "down".
Ping iRODS Resource - Scripts
/usr/lib/nagios/plugins/iping.sh
#!/bin/bash
export HOME=/var/lib/nagios
return=0
/usr/bin/iping "$@" 2>&1 || return=$?
if [ $return -gt 3 ]; then
exit 2
else
exit $return
fi
Ping iRODS Resource - Scripts
/usr/lib/nagios/plugins/update_irods_resc_state.sh
#/bin/bash
export HOME=/var/lib/nagios
LOGFILE=/tmp/update_resc.log
echo update_irods_resc_state.sh "$@" >> $LOGFILE
HOST=$1
SERVICE_STATE=$2
SERVICE_STATE_TYPE=$3
SERVICE_ATTEMPT=$4
RESOURCES=$(iquest "%s" "select RESC_NAME where RESC_LOC = '$HOST'")
echo RESOURCES = $RESOURCES >> $LOGFILE
echo SERVICES_STATE = $SERVICE_STATE >> $LOGFILE
case "$SERVICE_STATE" in
OK)
for RESOURCE in $RESOURCES; do
iadmin modresc $RESOURCE status up
done
;;
WARNING)
;;
UNKNOWN)
;;
CRITICAL)
for RESOURCE in $RESOURCES; do
iadmin modresc $RESOURCE status down
done
;;
esac
exit 0
Ping iRODS Resource - Testing
- Restart Nagios
sudo service nagios3 restart
- Bring Down a Resource.
- Shut the server down; or
- Stop the iRODS service (./irodsctl stop)
- After waiting a minute, check the resource status
$ iadmin lr Resource1Resource | grep resc_status
resc_status: down
- Bring Up the Resource
- After waiting a minute, check the resource status
$ iadmin lr Resource1Resource | grep resc_status
resc_status: up
Ping iRODS Resource - Testing
- Check the services on the Nagios monitoring tool. Click "Services"
Ping iRODS Resource - Testing
- Check the services on the Nagios monitoring tool. Click "Services"
Ping iRODS Resource - Use in Replication Resc Hierarchy
- Create a replication resource and place our two resources under it.
$ iadmin mkresc RootResource replication
Creating resource:
Name: "RootResource"
Type: "replication"
Host: ""
Path: ""
Context: ""
$ iadmin addchildtoresc RootResource Resource1Resource
$ iadmin addchildtoresc RootResource Resource2Resource
$ ilsresc
demoResc
RootResource:replication
├── Resource1Resource
└── Resource2Resource
- With both resources up put a file into iRODS
$ echo test > test.txt
$ iput -R RootResource test.txt
$ ils -l test.txt
rods 0 RootResource;Resource1Resource 5 2016-09-09.10:24 & test.txt
rods 1 RootResource;Resource2Resource 5 2016-09-09.10:24 & test.txt
Ping iRODS Resource - Use in Replication Resc Hierarchy
-
Bring Resource1Resource down and quickly try to get test.txt.
- Since Resource1Resource is the lowest replica number, iRODS will attempt to read test.txt from this resource.
- If Resource1Resource is still marked as up this will fail.
- Wait a minute and attempt to get test.txt again.
$ iget test.txt -
test
$ iget test.txt -
ERROR: getUtil: get error for - status = -305111 USER_SOCK_CONNECT_ERR, Connection refused
Monitoring Resource Use
The next Nagios service that we will create is a service to monitor the resource use.
- Determines the number of bytes the resource is using
- Produces a WARNING if the bytes are above threshold1.
- Produces an ERROR if the bytes are above threshold1.
Monitoring Resource Use - Update Resource Context
- Update the resource context with the max_bytes.
- For testing this is set to an arbitrarily small number (500 bytes)
- For testing this is set to an arbitrarily small number (500 bytes)
$ iadmin modresc context Resource1Resource "max_bytes=500"
$ iadmin modresc context Resource2Resource "max_bytes=500"
Monitoring Resource Use - Configuration
- Create a service to monitor the bytes. Append the following to /etc/nagios/irods/irods.cfg:
define command {
command_name check-resource-use
command_line /usr/lib/nagios/plugins/check_resource_use.sh $HOSTALIAS$ $ARG1$ $ARG2$
}
define service {
use generic-service
hostgroups irods-resource-servers
service_description check-resource-use
check_command check-resource-use!90.0!95.0
check_interval 1
}
Monitoring Resource Use - Configuration
Configuration Explanation:
- Command named check-resource-use created.
- This command executes check_resource_use.sh script.
- This command executes check_resource_use.sh script.
- New service created.
- Service calls check-resource-use command.
- Service runs against all hosts in the hostgroup irods-resource-servers.
- Sends 90.0 as $ARG1$. This is the WARNING threshold percentage.
- Sends 95.0 as $ARG2$. This is the CRITICAL threshold percentage.
- The $HOSTALIAS$ is the resource name.
Monitoring Resource Use - Script
Create bash script for check_resource_use.sh.
- Performs an "iquest" command to get the context string for the resource and parses the max_bytes value from this context.
- Performs another "iquest" command to get the total number of bytes used by the resource.
- Calculates the percentage of use.
- Returns CRITICAL (2) if the percentage of bytes used is greater than the critical threshold.
- Returns WARNING (1) if the percentage of bytes used is greater than the warning threshold.
- Returns OK (0) if the percentage is less than or equal to the warning threshold.
Monitoring Resource Use - Script
/usr/lib/nagios/plugins/check_resource_use.sh
#!/bin/bash
export HOME=/var/lib/nagios
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
if [ $# -lt 3 ]; then
echo "Use: check_resource_use.sh "
exit $STATE_UNKNOWN
fi
max_bytes=0
percent_used=0
warning_level=$2
critical_level=$3
if ! [[ $warning_level =~ ^[0-9]+([.][0-9]+)?$ ]]; then
echo "Warning level provided is not a valid number."
exit $STATE_UNKNOWN
fi
if ! [[ $critical_level =~ ^[0-9]+([.][0-9]+)?$ ]]; then
echo "Critical level provided is not a valid number."
exit $STATE_UNKNOWN
fi
Monitoring Resource Use - Script
/usr/lib/nagios/plugins/check_resource_use.sh (cont)
context=$(iquest "%s" "select RESC_CONTEXT where RESC_NAME = '$1'")
for i in $(echo $context | tr ";" "\n"); do
[[ $i =~ "max_bytes=" ]] && max_bytes=$(echo $i | cut -b11-)
done
used=$(iquest "%s" "select sum(DATA_SIZE) where RESC_NAME = '$1'")
if [ -z $used ]; then
used=0
fi
if [[ $max_bytes -gt 0 && $used -gt 0 ]]; then
percent_used=$(echo "scale=2; ($used / $max_bytes) * 100.0" | bc)
fi
if [ $(echo "$percent_used > $critical_level" | bc -l) -eq "1" ]; then
echo "CRITICAL - Resource use is above critical level. byte_used=$used; \
max_bytes=$max_bytes; critical_threshold=${critical_level}%"
exit $STATE_CRITICAL
fi
if [ $(echo "$percent_used > $warning_level" | bc -l) -eq "1" ]; then
echo "WARNING - Resource usage is above warning level. byte_used=$used; \
max_bytes=$max_bytes; warning_threshold=${warning_level}%"
exit $STATE_WARNING
fi
echo "OK - Resource use is below warning and critical levels. bytes_used=$used; \
max_bytes=$max_bytes"
exit $STATE_OK
Monitoring Resource Use - Testing
- Restart Nagios to pick up the new configuration.
sudo service nagios3 restart
- Check the monitoring tool ("Services") for Resource1 and Resource2. Note that the bytes used is listed.
- Add files so that the percentage of bytes used goes to to the WARNING and CRITICAL levels. Check the monitoring tool.
Monitoring Resource Use - Testing
Monitoring Connection Count
The next Nagios service that we will create is a service to monitor the connection count.
- Parses the output of the irods-grid command to get the number of active connections.
- This is informational only so it will always return an OK (0) status.
Monitoring Connection Count - Update irods_environment.json
- Because this script needs to run the irods-grid, a couple of entries need to be added to ~nagios/.irods/irods_environment.json
"irods_server_control_plane_encryption_algorithm": "AES-256-CBC",
"irods_server_control_plane_encryption_num_hash_rounds": 16,
"irods_server_control_plane_key": "12345678901234567890123456789012",
"irods_server_control_plane_port": 1248
- Use the same values that are stored in ~irods/.irods/irods_environment.json
- Perform an "iinit".
Monitoring Connection Count - Configuration
- Create a service to monitor the number of connections. Append the following to /etc/nagios/irods/irods.cfg:
define host {
use irods-server-template
host_name ICAT
address ICAT.example.org
}
define hostgroup {
hostgroup_name irods-servers
alias iRODS Servers
members ICAT,Resource1,Resource2
}
define command {
command_name check-active-connections
command_line /usr/lib/nagios/plugins/check_agent_count.sh $HOSTADDRESS$
}
define service {
use generic-service
hostgroups irods-servers
service_description check active connections
check_command check-active-connections
check_interval 1
}
Monitoring Connection Count - Configuration
Configuration Exaplanation:
- New host created for ICAT.example.org.
- Hostgroup irods-servers created which includes both resource servers and the ICAT server.
- Command named check-active-connections created.
- This command executes the check_agent_count.sh script.
- Service created to execute the check-active-connections command.
Monitoring Connection Count - Script
Create bash script for check_agent_count.sh.
- Performs an "irods-grid" command.
- Sends the output of the "irods-grid" command to jq to parse the JSON output do determine the number of agents.
- Returns OK (0) with a string indicating the number of agents.
Note: jq can be downloaded on Ubuntu with “sudo apt-get install jq”.
Monitoring Connection Count - Script
/usr/lib/nagios/plugins/check_agent_count.sh
#!/bin/bash
export HOME=/var/lib/nagios
STATE_OK=0
STATE_UNKNOWN=3
if [ $# -lt 1 ]; then
echo "Use: check_agent_count.sh "
exit $STATE_UNKNOWN
fi
host_name=$1
agent_count=$(irods-grid --all status | jq ' .hosts[] | .hostname + " " + "\(.agents[].agent_pid)"' | grep $host_name | wc -l)
echo "OK - open connections = $agent_count"
Monitoring Connection Count - Testing
Restart Nagios and try it out.
- Refresh the Nagios monitoring tool services page.
- By default all three servers should have a count of 1.
- Try running multiple iRODS commands in parallel (possibly large puts that take a long time) and wait for an update.
- As long as the processes are still running when the check is performed you should have a count greater than 1.
Monitoring Connection Count - Testing
Monitoring iRODS with Nagios
By iRODS Consortium
Monitoring iRODS with Nagios
- 2,387