An It-slave in the digital saltmine

15
Jun

Using Nagios or op5 Monitor eventhandler to start a service that has stopped

Posted by peter

Background

I use MythTV quite frequently and noticed that it is instable when using sasc-ng as a decoder to decrypt encrypted DVB-T channels. So approximatly every third day the MythTVbackend server stops and need to be started again. I have wriiten an earlier article about howto monitor MythTV with Nagios or op5 Monitor so I get noticed that it has stopped. But I need to manually start it again. This article describe howto make Nagios or op5 Monitor to start a stopped MythTVbackend. It can be used for starting almost any service.

I have used the examples provided by Ethan at Nagios official documentation describing eventhandlers.

Normally it is not recommended to let a tool like Nagios or op5 Monitor start a service that has stopped, because it is probably a reason why the service has stopped and the correct procedure is to fix the root cause of the problem, not the symptom.

The MythTV backend runs on one machine called lala (after a character in Teletubbies) which is not the same as the Nagios or op5 Monitor server. I use nrpe to run the start script i.e.

 /etc/init.d/mythtv-backend start

There is several options here but I already setup the nrpe agent and it is simple to make Nagios or op5 Monitor to use nrpe to run a script.

Implementation

I used the script I found at Nagios documentation about eventhandlers as a base and modiied it slightly.

At my op5 Monitor machine

/opt/plugins/custom/restart-mythtv-lala.sh

#!/bin/sh
#
# Event handler script for restarting the mythTVbackend server on lala
#
# Note: This script will only restart the mythtvbackend if the service is
#       retried 2 times (in a "soft" state) or if the service somehow
#       manages to fall into a "hard" error state.
#

# What state is the mythbackend service in?
case "$1" in
OK)
	# The service just came back up, so don't do anything...
	;;
WARNING)
	# We don't really care about warning states, since the service is probably still running...
	;;
UNKNOWN)
	# We don't know what might be causing an unknown error, so don't do anything...
	;;
CRITICAL)
	# Aha!  The HTTP service appears to have a problem - perhaps we should restart the server...

	# Is this a "soft" or a "hard" state?
	case "$2" in

	# We're in a "soft" state, meaning that Nagios is in the middle of retrying the
	# check before it turns into a "hard" state and contacts get notified...
	SOFT)

		# What check attempt are we on?  We don't want to restart the web server on the first
		# check, because it may just be a fluke!
		case "$3" in

		# Wait until the check has been tried 3 times before restarting the web server.
		# If the check fails on the 4th time (after we restart the web server), the state
		# type will turn to "hard" and contacts will be notified of the problem.
		# Hopefully this will restart the web server successfully, so the 4th check will
		# result in a "soft" recovery.  If that happens no one gets notified because we
		# fixed the problem!
		2)
			echo "`date` Restarting mythtv service (2rd soft critical state)..." >> /tmp/mythtvstart
			# Call the init script to restart the mythbackend server
			#/etc/rc.d/init.d/httpd restart
			#date >> /tmp/mythtvstart
			/opt/plugins/check_nrpe -H lala -c start_mythtvbackend
			;;
			esac
		;;

	# The mythtvbackend service somehow managed to turn into a hard error without getting fixed.
	# It should have been restarted by the code above, but for some reason it didn't.
	# Let's give it one last try, shall we?
	# Note: Contacts have already been notified of a problem with the service at this
	# point (unless you disabled notifications for this service)
	HARD)
		echo "`date` Restarting mythtv service (hard state)..." >> /tmp/mythtvstart
		# Call the init script to restart the HTTPD server
		#/etc/rc.d/init.d/httpd restart
		#date >> /tmp/mythtvstart
		/opt/plugins/check_nrpe -H lala -c start_mythtvbackend
		;;
	esac
	;;
esac
exit 0

/opt/monitor/misccomands.cfg

# command 'restart-mythtv-lala'
define command{
    command_name                   restart-mythtv-lala
    command_line                   /opt/plugins/custom/start-mythtv-lala.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
    }

/opt/monitor/etc/services.cfg

# service 'Mythbackend'
define service{
    use                            default-service
    host_name                      lala
    service_description            Mythbackend
    check_command                  check_tcp!6543
    servicegroups                  MythTV,it-slav
    event_handler                  restart-mythtv-lala!$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
    contact_groups                 it-slav_sms,it-slav_jabber,it_slav_mail
    }

At my mythbackend machine lala

/etc/nrpe.d/mycommands.cfg
command[start_mythtvbackend]=/usr/bin/sudo /etc/init.d/mythtv-backend start

/etc/sudoers
nobody ALL= (root) NOPASSWD:/etc/init.d/mythtv-backend start

Notice that my nrpe agent run as user nobody

Test

I stopped the mythtvbackend by running:

peter@lala:/etc/nrpe.d$ date
Mon Jun 15 20:40:55 CEST 2009
peter@lala:/etc/nrpe.d$ sudo /etc/init.d/mythtv-backend stop
 * Stopping MythTV server: mythbackend

And run

[root@op5 ~]# tail -f /tmp/mythtvstart
Mon Jun 15 20:47:09 CEST 2009 Restarting mythtv service (2rd soft critical state)...

YES it works!

Recent Posts

Recent Comments

Archives

Categories

Meta

Using Nagios or op5 Monitor eventhandler to start a service that has stopped

Background

Implementation

At my op5 Monitor machine

At my mythbackend machine lala

Test

Links:

2 Responses to “Using Nagios or op5 Monitor eventhandler to start a service that has stopped”

Leave a Reply