Background
This article will describe how easy it is to integrate other tools to Nagios or op5 Monitor. I will use an example with a webshop where a business view of how the webshop is doing is implemented by using a GPL’d rule engine, NodeBrain. I have in an earlier article described the ruleset for this implementation but now I will show how the integration can be done.
Scenario
The scenario is a webshop with:
- 5 frontend webservers
- 2 application servers
- 3 databasservers
Management want to monitor how the webshop is doing. They do not want to know if a redundant part is down instead management want to have the overview of the webshop status.
A management consultant is hired and do an investigation and after a ridiculous amount of money the following rules are defined:
- Webserver rules
- If 3 or more webserver works the webservice is OK
- If 2 webservers works the webservice is WARNING
- If 1 webserver or less is working the webservice is CRITICAL
- Applicationserver rules
- If 1 or 2 application servers works the application layer is OK
- If zero application servers works the application layer is CRITICAL
- Database server rules
- If 2 or more database server works the database layer is OK
- if 1 database server works the database layer is WARNING
- If no database servers works the database layer is CRITICAL
- The webserver layer, application layer and database layer should be viewed seperatly
- The total webshop status has the highest status value of webserver layer, application layer and database layer
I use Nagvis to illustrate the releationship between the layers.
Case 1
The picture shows when everything is fine:
Case 2
The rules in place:
- 2 webservers CRITICAL and the webserverlayer is OK
- 1 appserver CRITICAL and the application layer is OK
- 2 databases are down and the database layer is WARNING
- The total webshop status is WARNING because it has the highest status of the different layers
Case 3
Now it has been even worse:
- 4 webservers CRITICAL and the webserverlayer is CRITICAL
- 1 appserver CRITICAL and the application layer is OK
- 2 databases are down and the database layer is WARNING
- The total webshop status is CRITICAL because it has the highest status of the different layers
Conclusions
This article shows the power of Open Source and what is possible todo when integration different projects with each other. A solution like this with one of the Big Four (IBM, BMC, CA, HP) would have cost alot when it comes to license and highly specialised consultants had to be hirered.
Links
- Op5 A company that package and support enterprise class systems- and networkmanagement products
- NodeBrain, A powerful GPL’d rule engine
- Nagios, An enterprise class Monitoring sofware
- Nagvis, Nagios Visualization addon
Implementation
Hosts and services
The hosts and services are created:
- Webserver layer: 5 hosts each with 1 service
- Application layer: 2 hosts each with 1 service
- Database layer: 3 hosts each with 1 service
- Webshop layer: 1 host called webshopcontainer and 4 services: webserversstatus, appsserverstatus, dbserversstatus and webshop status. The services represent each layer in the model and webshop is the total status of the webshop.
To make it easy to control the status of all these devices I will use passive checks. So if I want to change the status of a service I just use the GUI and send in a passive check result. In real life active checks would have been used to monitor the different services.
The result showing Service Detail using op5 developed Nagios GUI Ninja:
The statechanges are sent to NodeBrain via an eventhandler that sends the NodeBrain commands via a namedpipe:
#!/bin/sh
#
# Event handler script for sending nagios data to nodebrain
#
# This script has these arguments $SERVICESTATEID$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $SERVICEDESC$ $HOSTNAME$
NODEBRAINPIPE=/opt/plugins/custom/nodebrainpipe
DATE=`date`
echo "$DATE SERVICESTATEID=$1 SERVICESTATETYPE=$2 SERVICEATTEMPT=$3 SERVICEDESC=$4 HOSTNAME=$5" >> /tmp/eventhandler_out
# What state is the service in?
case "$1" in
OK)
# The service just came back up
# Send ok to nodebrain
echo "assert $4=0;" > $NODEBRAINPIPE
echo "assert $4=0;" >> /tmp/eventhandler_out
;;
WARNING)
# We don't really care about warning states, since the service is probably still running...
;;
UNKNOWN)
# We don't know what might be causing an unknown error, so don't do anything...
;;
CRITICAL)
# Is this a "soft" or a "hard" state?
case "$2" in
SOFT)
# We're in a "soft" state, meaning that Nagios is in the middle of retrying the
# check before it turns into a "hard" state and contacts get notified...
# Don't do anything
;;
# The HTTP service somehow managed to turn into a hard error without getting fixed.
# It should have been restarted by the code above, but for some reason it didn't.
# Let's give it one last try, shall we?
# Note: Contacts have already been notified of a problem with the service at this
# point (unless you disabled notifications for this service)
HARD)
# AHA! Hard state, send data to nodebrain
echo "assert $4=2;" > $NODEBRAINPIPE
echo "assert $4=2;" >>/tmp/eventhandler_out
;;
esac
;;
esac
exit 0
The Nodebrainrules, that has all the logic and opens the namedpipe for commands:
#!/usr/local/bin/nb -d
#
-rm webshop.log
set log="webshop.log",out=".";
declare indata identity owner;
define webshop node pipe.server("indata@nodebrainpipe");
#Start with webservers
#Set OK
assert weba=0;
assert webb=0;
assert webc=0;
assert webd=0;
assert webe=0;
assert webserversstatus=0;
#define webserver rules
#5 frontwebservers, if 3 or more ok status is ok
#if 2 is ok, status warning
#if 1 or 0 ok, status critical
define webservers cell weba+webb+webc+webd+webe;
define webserversok on(webservers<=4) webserversstatus=0;
define webseroksend on(webserversstatus=0):-./send_to_monitor.sh webserversstatus 0 "OK: Websservers are fine";
define webserverswarning on(webservers>4 and webservers<8) webserversstatus=1;
define webserverswarningsend on(webserversstatus=1):-./send_to_monitor.sh webserversstatus 1 "WARNING: Webservers have problems";
define webserverscritical on(webservers>=8) webserversstatus=2;
define webserverscriticalsend on(webserversstatus=2):-./send_to_monitor.sh webserversstatus 2 "CRITICAL: Webservers have serious problems ";
#appservers
assert appa=0;
assert appb=0;
assert appserversstatus=0;
#2 appservers, 1 down is ok, 2 down critical
define appservers cell appa+appb;
define appserversok on(appservers<=2) appserversstatus=0;
define appserversoksend on(appserversstatus=0):-./send_to_monitor.sh appserversstatus 0 "OK: Appservers are fine";
define appserverscritical on(appservers>2) appserversstatus=2;
define appserverscriticalsend on(appserversstatus=2):-./send_to_monitor.sh appserversstatus 2 "CRITICAL: Appservers have serious problems";
#Databaseservers
assert dba=0;
assert dbb=0;
assert dbc=0;
assert dbserversstatus=0;
#3 db servers
#if 2 or more ok, status ok
#if 1 ok, status warning
define dbservers cell dba+dbb+dbc;
define dbserversok on(dbservers<=2) dbserversstatus=0;
define dbserversoksend on(dbserversstatus=0):-./send_to_monitor.sh dbserversstatus 0 "OK: DataBaseservers are fine";
define dbserverswarning on(dbservers>=4 and dbservers <6)dbserversstatus=1;
define dbserverswarningsend on(dbserversstatus=1):-./send_to_monitor.sh dbserversstatus 1 "WARNING: DataBaseservers have problems";
define dbservercritical on(dbservers>=6)dbserversstatus=2;
define dbservercriticalsend on(dbserversstatus=2):-./send_to_monitor.sh dbserversstatus 2 "CRITICAL: DataBaseservers have serious problems";
#Total rules
assert webshopstatus=0;
#If all serverstatus ok, the whole webshop is ok
define webshopok on(webserversstatus=0 and appserversstatus=0 and dbserversstatus=0) webshopstatus=0;
define webshopoksend on(webshopstatus=0):-./send_to_monitor.sh webshopstatus 0 "OK: Webshop is fine";
#If any serverstatus is critical the whole webshop is critical
define webshopscritical on(webserversstatus=2 or appserversstatus=2 or dbserversstatus=2) webshopstatus=2;
define webshopscriticalsend on(webshopstatus=2):-./send_to_monitor.sh webshopstatus 2 "CRITICAL: Webshop has serious problems";
#If not any serverstatuscritical and in warning, the whole shop is warning.
define webshopwarning on((!webserversstatus=2 and !appserversstatus=2 and !dbserversstatus=2) and (webserversstatus=1 or dbserversstatus=1)) webshopstatus=1;
define webshopwarningsend on(webshopstatus=1):-./send_to_monitor.sh webshopstatus 1 "WARNING: Webshop has some problems";
The NodeBrain rules runs this script when fired:
#!/bin/sh
HOSTNAME=webshopcontainer
SERVICEDESC=$1
STATUS=$2
MESSAGE=$3
now=`date +%s`
commandfile='/opt/monitor/var/rw/nagios.cmd'
/usr/bin/printf "[%lu] PROCESS_SERVICE_CHECK_RESULT;$HOSTNAME;$SERVICEDESC;$STATUS;$MESSAGE\n" $now > $commandfile
The Nagios or op5 Monitor hosts.cfg
###############################################################################
# Generated by op5 Monitor webconfiguration exporter
#
# Exported 2009-10-22 19:33 by monitor
#
# host template 'Dummy-template'
define host{
name Dummy-template
initial_state o
hostgroups NodeBrainDemo
check_command check-host-alive
max_check_attempts 5
check_interval 5
retry_interval 1
obsess_over_host 0
check_freshness 0
active_checks_enabled 1
passive_checks_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
flap_detection_options n
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 0
notification_period 24x7
notification_options d,u,r,f
notifications_enabled 1
stalking_options n
register 0
}
# host template 'default-host-template'
define host{
name default-host-template
check_command check-host-alive
max_check_attempts 3
check_interval 5
retry_interval 0
check_period 24x7
active_checks_enabled 1
passive_checks_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 0
notification_period 24x7
notification_options d,u,r,f,s
notifications_enabled 1
register 0
}
# host 'app-host-a'
define host{
use Dummy-template
host_name app-host-a
alias App Host A
address 127.0.0.1
hostgroups NodeBrainDemo
contact_groups support-group
}
# host 'app-host-b'
define host{
use Dummy-template
host_name app-host-b
alias App Host B
address 127.0.0.1
contact_groups support-group
}
# host 'db-host-a'
define host{
use Dummy-template
host_name db-host-a
alias DB Host A
address 127.0.0.1
contact_groups support-group
}
# host 'db-host-b'
define host{
use Dummy-template
host_name db-host-b
alias DB Host B
address 127.0.0.1
contact_groups support-group
}
# host 'db-host-c'
define host{
use Dummy-template
host_name db-host-c
alias DB Host C
address 127.0.0.1
contact_groups support-group
}
# host 'web-host-a'
define host{
use Dummy-template
host_name web-host-a
alias Web Host A
address 127.0.0.1
contact_groups support-group
}
# host 'web-host-b'
define host{
use Dummy-template
host_name web-host-b
alias Web Host B
address 127.0.0.1
contact_groups support-group
}
# host 'web-host-c'
define host{
use Dummy-template
host_name web-host-c
alias Web Host C
address 127.0.0.1
contact_groups support-group
}
# host 'web-host-d'
define host{
use Dummy-template
host_name web-host-d
alias Web Host D
address 127.0.0.1
contact_groups support-group
}
# host 'web-host-e'
define host{
use Dummy-template
host_name web-host-e
alias Web Host E
address 127.0.0.1
contact_groups support-group
}
# host 'webshopcontainer'
define host{
use Dummy-template
host_name webshopcontainer
alias webshopcontainer
address 127.0.0.1
contact_groups support-group
}
The Nagios or op5 Monitor services.cfg
###############################################################################
# Generated by op5 Monitor webconfiguration exporter
#
# Exported 2009-10-22 19:33 by monitor
#
# service template 'Dummy-service-template'
define service{
name Dummy-service-template
display_name Dummy-service-template
is_volatile 0
check_command check_dummy!0
initial_state o
max_check_attempts 1
check_interval 1
retry_interval 1
active_checks_enabled 0
passive_checks_enabled 1
check_period 24x7
parallelize_check 1
obsess_over_service 1
check_freshness 0
event_handler_enabled 1
flap_detection_enabled 1
flap_detection_options n
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 0
notification_period 24x7
notification_options c,w,u,r,f
notifications_enabled 1
stalking_options n
register 0
}
# service template 'default-service'
define service{
name default-service
is_volatile 0
max_check_attempts 3
check_interval 5
retry_interval 1
active_checks_enabled 1
passive_checks_enabled 1
check_period 24x7
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 0
notification_period 24x7
notification_options c,w,u,r,f,s
notifications_enabled 1
contact_groups support-group
register 0
}
####################################################
#
# Services for host app-host-a
#
# service 'appa'
define service{
use default-service
host_name app-host-a
service_description appa
check_command check_dummy!0
servicegroups webshop
max_check_attempts 1
parallelize_check 0
obsess_over_service 0
check_freshness 0
event_handler eventhandler_send_to_nodebrain
flap_detection_enabled 0
flap_detection_options n
contact_groups support-group
stalking_options n
}
####################################################
#
# Services for host app-host-b
#
# service 'appb'
define service{
use default-service
host_name app-host-b
service_description appb
check_command check_dummy!0
servicegroups webshop
max_check_attempts 1
event_handler eventhandler_send_to_nodebrain
flap_detection_enabled 0
}
####################################################
#
# Services for host db-host-a
#
# service 'dba'
define service{
use default-service
host_name db-host-a
service_description dba
check_command check_dummy!0
servicegroups webshop
max_check_attempts 1
event_handler eventhandler_send_to_nodebrain
flap_detection_enabled 0
}
####################################################
#
# Services for host db-host-b
#
# service 'dbb'
define service{
use default-service
host_name db-host-b
service_description dbb
check_command check_dummy!0
servicegroups webshop
max_check_attempts 1
event_handler eventhandler_send_to_nodebrain
flap_detection_enabled 0
}
####################################################
#
# Services for host db-host-c
#
# service 'dbc'
define service{
use default-service
host_name db-host-c
service_description dbc
check_command check_dummy!0
servicegroups webshop
max_check_attempts 1
event_handler eventhandler_send_to_nodebrain
flap_detection_enabled 0
}
####################################################
#
# Services for host web-host-a
#
# service 'weba'
define service{
use default-service
host_name web-host-a
service_description weba
check_command check_dummy!0
servicegroups webshop
max_check_attempts 1
event_handler eventhandler_send_to_nodebrain
flap_detection_enabled 0
}
####################################################
#
# Services for host web-host-b
#
# service 'webb'
define service{
use default-service
host_name web-host-b
service_description webb
check_command check_dummy!0
servicegroups webshop
max_check_attempts 1
event_handler eventhandler_send_to_nodebrain
flap_detection_enabled 0
}
####################################################
#
# Services for host web-host-c
#
# service 'webc'
define service{
use default-service
host_name web-host-c
service_description webc
check_command check_dummy!0
servicegroups webshop
max_check_attempts 1
event_handler eventhandler_send_to_nodebrain
flap_detection_enabled 0
}
####################################################
#
# Services for host web-host-d
#
# service 'webd'
define service{
use default-service
host_name web-host-d
service_description webd
check_command check_dummy!0
servicegroups webshop
max_check_attempts 1
event_handler eventhandler_send_to_nodebrain
flap_detection_enabled 0
}
####################################################
#
# Services for host web-host-e
#
# service 'webe'
define service{
use default-service
host_name web-host-e
service_description webe
check_command check_dummy!0
servicegroups webshop
max_check_attempts 1
event_handler eventhandler_send_to_nodebrain
flap_detection_enabled 0
}
####################################################
#
# Services for host webshopcontainer
#
# service 'appserversstatus'
define service{
use Dummy-service-template
host_name webshopcontainer
service_description appserversstatus
servicegroups webshop
flap_detection_enabled 0
}
# service 'dbserversstatus'
define service{
use Dummy-service-template
host_name webshopcontainer
service_description dbserversstatus
servicegroups webshop
flap_detection_enabled 0
}
# service 'webserversstatus'
define service{
use Dummy-service-template
host_name webshopcontainer
service_description webserversstatus
servicegroups webshop
flap_detection_enabled 0
}
# service 'webshopstatus'
define service{
use Dummy-service-template
host_name webshopcontainer
service_description webshopstatus
servicegroups webshop
flap_detection_enabled 0
}
October 22nd, 2009 at 11:29 pm
Hi,
using the Nagios plugin check_multi you could do the whole stuff
pretty easy 😉
check_multi uses perl expressions to do the state evaluation and
is therefore flexible and powerful.
It took me about 5 minutes to write down the sketch of these four services below according to your rules, where there are three services for the server types and one top level service for the webshop itself which ties everything together.
You can find check_multi here:
http://www.my-plugin.de/wiki/projects/check_multi/start
Cheers,
-Matthias
> * Webserver rules
> o If 3 or more webserver works the webservice is OK
> o If 2 webservers works the webservice is WARNING
> o If 1 webserver or less is working the webservice is CRITICAL
> * Applicationserver rules
> o If 1 or 2 application servers works the application layer is OK
> o If zero application servers works the application layer is CRITICAL
> * Database server rules
> o If 2 or more database server works the database layer is OK
> o if 1 database server works the database layer is WARNING
> o If no database servers works the database layer is CRITICAL
> * The webserver layer, application layer and database layer should be viewed seperatly
> * The total webshop status has the highest status value of webserver layer, application layer a
web.cmd:
# call: check_multi -f web.cmd
statusdat [ web1 ] = webserver1:webservice1
statusdat [ web2 ] = webserver2:webservice2
statusdat [ web3 ] = webserver3:webservice3
statusdat [ web4 ] = webserver4:webservice4
statusdat [ web5 ] = webserver5:webservice5
state [ WARNING ] = count(OK)<=2
state [ CRITICAL ] = count(OK)<=1
app.cmd:
# call: check_multi -f app.cmd
statusdat [ app1 ] = appserver1:appservice1
statusdat [ app2 ] = appserver2:appservice2
state [ CRITICAL ] = count(OK)<=1
db.cmd:
# call: check_multi -f db.cmd
statusdat [ db1 ] = dbserver1:dbservice1
statusdat [ db2 ] = dbserver2:dbservice2
state [ WARNING ] = count(OK)<=2
state [ CRITICAL ] = count(OK)<=1
webshop.cmd:
# call: check_multi -f webshop.cmd
statusdat [ web ] = nagiosserver:web
statusdat [ app ] = nagiosserver:app
statusdat [ db ] = nagiosserver:db
October 23rd, 2009 at 9:41 am
Hi Matthias,
Yes you are right, using check_multi would be easier to use in this case.
The purpose of the article was to show how to integrate Nagios with NodeBrain, not be the perfect implementation of the webshop scenario.
Imho the biggest lack in Nagios is that it do not have a rule engine. In most cases it is not necessary but in some cases it is needed. Solutions like check_multi and check_cluster could help a bit. But if you need more advanced rules with for example correlations over time, logs, snmptraps and so on you need a rule engine.
My experience is that management that want a business view of the environment and the people implementing Nagios do not speak with each other. An advanced rule engine could bridge that gap by attract business consultants that normally works with the big four. Solutions like this are seldom a technical problem.
November 9th, 2009 at 9:01 am
NodeBrain sounds an awful lot like Prolog. Why wouldn’t one just use Prolog? It is much more mature and has lots of documentation. It seems to have been created for just this very thing decades ago.
November 9th, 2009 at 9:28 am
Feel free to use prolog if you want 🙂
As a former Tivoli consultant I have used Prolog to program Tivoli Enterprise Console and yes Prolog is probably gone do the job. Personally I prefer NodeBrain, after a few hours with NodeBrain I could do more then I could do with T/EC Prolog after a week training.
May 5th, 2010 at 12:30 pm
Hi,
NodeBrain looks nice and I know check_multi before but I use the Nagios Addon “Business Process View”.
It has the same abilities and, from my point of view, is much easier to deploy.
See: http://nagiosbp.projects.nagiosforge.org/
It also has a Impact Analys Tool where you can set the state of a service to see the Impacts on your defined processes.
Integrating this processes in Nagios or NagVis is also possible via the bp_cfg2service_cfg.pl that comes with Business Process View.
Cheers,
Khark
March 1st, 2011 at 12:52 am
Peter great tutorial. We are taking a look at nodebrian for our environment but experiencing install trouble/
OS CentOS release 5.5
uname -a Linux 2.6.18-194.26.1.el5 #1 SMP Tue Nov 9 12:54:20 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
from ./configure
….
checking for pcre_compile in -lpcre… no
configure: error: Required library -lpcre not found. You may want to download it from http://www.pcre.org or locate it and include directory in LD_LIBRARY_PATH to support this build.
configure: error: ./configure failed for lib
Setting ld_library_path does not seem to help.
Did you experience anything like this?
Do you have any suggestions or recommended resources that might help us resolve it?
March 4th, 2011 at 9:49 am
I have the same experience on CentOS 5.4 and 5.5. I did never figure out howto get it to compile.
I solved it by running Nodebrain on a Ubuntu box instead, not a good solution…
July 2nd, 2012 at 3:27 pm
$ sudo yum whatprovides \*/libpcre\*
pcre-devel-6.6-6.el5_6.1.x86_64 : Development files for pcre
Repo : base
Matched from:
Filename : /usr/lib64/libpcre.so
Filename : /usr/lib64/libpcrecpp.so
Filename : /usr/lib64/libpcreposix.so
Filename : /usr/lib64/libpcre.a
Filename : /usr/lib64/pkgconfig/libpcre.pc
Filename : /usr/lib64/libpcreposix.a
Filename : /usr/lib64/libpcrecpp.a
$ sudo yum -y install pcre pcre-devel