Tuesday, June 16, 2009

Nagios and Tomcat Event Handler

The first thing that we need to make sure is understand how Nagios work. Assuming that Tomcat is in a remote server, then there is a "nagios" user, and this needs to have rights to restart tomcat (CATALINA_HOME/bin/catalina.sh stop). If you try to stop tomcat, the nagios user will get the following error:
su nagios
/usr/local/tomcat-18version/bin/catalina.sh stop
Jun 15, 2009 2:54:18 PM org.apache.catalina.startup.Catalina stopServer
SEVERE: Catalina.stop:
java.io.FileNotFoundException: /opt/apache-tomcat-6.0.18/conf/server.xml (Permission denied)

The best thing to do is to create a group "tomcat", provide privileges on CATALINA_HOME to this group, and add the user "nagios" to this group. In this case, the user download Tomcat in the following directory: /opt/apache-tomcat-6.0.18/. Use the "root" user to do the following steps:
I created a symbolic link so I don't have to change anything in case Tomcat is upgraded.
ln -s /opt/apache-tomcat-6.0.18/ /opt/tomcat
Now, if you do something like this:
ls -l /opt
tomcat -> /opt/apache-tomcat-6.0.18/
Create a group using the groupadd command and add the "nagios" user to this group:
groupadd tomcat
Add the existing nagios user to the tomcat group.
usermod -g tomcat nagios
Add privileges to the /opt/tomcat to the group "tomcat" and the original . First check the id for the user
[root@dev opt]# id nagios
uid=501(nagios) gid=503(tomcat) groups=503(tomcat)
chgrp -R tomcat apache-tomcat-6.0.18
chgrp -R tomcat tomcat

#To test that the nagios user is able to restart run the following command:
su nagios
/usr/local/tomcat-18version/bin/catalina.sh stop
Privileges also need to be provided to restart the tomcat server and killed in case the tomcat doesn't shutdown. Since only root can start certain ports (i.e. port 80), edit the sudoers file (visudo):
##add the following line below "root    ALL=(ALL)       ALL"
nagios ALL=(ALL) NOPASSWD:/opt/tomcat/bin/catalina.sh,/bin/kill
Now, add the event handler. Create a file in /user/local/nagios/libexec/eventhandler/restart-tomcat.sh

# tomcat-restart.sh - tomcat restart script for cron
echo "`date`------------ Shutting down tomcat---------------"

# Verify that tomcat is not running. If it is, stop it gracefully
# get the tomcat pid
tomcat_pid=`ps -ef | grep java | grep tomcat | cut -c10-14`
echo "Tomcat PID is: $tomcat_pid"

if [ -n "$tomcat_pid" ]
echo "Stopping tomcat ..."
# give tomcat 60 seconds to shutdown gracefully
sleep 60

tomcat_pid=`ps -ef | grep java | grep tomcat | cut -c10-14`
# if tomcat_pid exists, kill the process
if [ -n "$tomcat_pid" ]
echo "Noticed that process is still running trying to kill it"
sudo kill $tomcat_pid
sleep 60

tomcat_pid=`ps -ef | grep java | grep tomcat | cut -c10-14`
# if tomcat_pid still exists, really kill the process
if [ -n "$tomcat_pid" ]
echo "Forcefully killing the process for tomcat $tomcat_pid..."
sudo kill -n 9 $tomcat_pid
sleep 60

# restart tomcat
echo "`date` Starting tomcat..."
echo "`date` Finished starting tomcat"
Configure an application that runs the event-handler.sh (restart-tomcat-eventhandler.sh). This way when the application restart, a log that monitors everything:

echo "Restarting Tomcat `date`" >> /usr/local/nagios/libexec/eventhandlers/restart-tomcat.log
/usr/local/nagios/libexec/eventhandlers/restart-tomcat.sh >> /usr/local/nagios/libexec/eventhandlers/restart-tomcat.log
echo "Finished `date`" >> /usr/local/nagios/libexec/eventhandlers/restart-tomcat.log
echo "-------------------------Finished `date`-----------------------------"

In the Nagios server
Create the event handler: /opt/user/local/nagios/event-handler/restart-tomcat-eventhandler.sh

# Event handler script for restarting the web server on the local machine
# Note: This script will only restart the web server if the service is
# retried 3 times (in a "soft" state) or if the web service somehow
# manages to fall into a "hard" error state.

# What state is the HTTP service in?

case "$1" in
# The service just came back up, so don't do anything...
# We don't really care about warning states, since the service is probably still running...
# We don't know what might be causing an unknown error, so don't do anything...
# Aha! The HTTP service appears to have a problem - perhaps we should restart the server...

# Is this a "soft" or a "hard" state?
case "$2" in

# We're in a "soft" state, meaning that Nagios is in the middle of retrying the
# check before it turns into a "hard" state and contacts get notified...

# What check attempt are we on? We don't want to restart the web server on the first
# check, because it may just be a fluke!
case "$3" in

# Attempt number
echo -n "Hard-> Restarting JBoss..."
echo -n "/usr/local/nagios/libexec/check_nrpe -H " $4 " -c restart_jboss"

/usr/local/nagios/libexec/check_nrpe -H $4 -c restart_jboss


# The HTTP service somehow managed to turn into a hard error without getting fixed.
# It should have been restarted by the code above, but for some reason it didn't.
# Let's give it one last try, shall we?
# Note: Contacts have already been notified of a problem with the service at this
# point (unless you disabled notifications for this service)
echo -n "Hard-> Restarting Tomcat..."
echo -n "/usr/local/nagios/libexec/check_nrpe -H " $4 " -c restart-tomcat"

/usr/local/nagios/libexec/check_nrpe -H $4 -c restart-tomcat


Finally, add these event handler as a command by editing the /usr/local/nagios/etc/nrpe.cfg:
Test that the command is working correctly by executing the following command from the Nagios server:
Now, add the service to restart the server:

/usr/local/nagios/libexec/check_nrpe -H tomcatserver -c restart-tomcat -t 30
define service{
use generic-service
host_name midc
service_description check_midc_login_page
process_perf_data 1
check_command check_http!-H midc.up-mobile.com -u /midc/doLogin.do -w 5 -c 10
event_handler restart-tomcat

Now restart nagios (service nagios restart) and you should be ready.

No comments:

Post a Comment