22
Oct

Rule engine integration with Nagios using NodeBrain

Posted by peter

Background

This article will describe how easy it is to integrate other tools to Nagios or op5 Monitor. I will use an example with a webshop where a business view of how the webshop is doing is implemented by using a GPL’d rule engine, NodeBrain. I have in an earlier article described the ruleset for this implementation but now I will show how the integration can be done.

Scenario

The scenario is a webshop with:

5 frontend webservers
2 application servers
3 databasservers

Management want to monitor how the webshop is doing. They do not want to know if a redundant part is down instead management want to have the overview of the webshop status.

A management consultant is hired and do an investigation and after a ridiculous amount of money the following rules are defined:

Webserver rules
- If 3 or more webserver works the webservice is OK
- If 2 webservers works the webservice is WARNING
- If 1 webserver or less is working the webservice is CRITICAL
Applicationserver rules
- If 1 or 2 application servers works the application layer is OK
- If zero application servers works the application layer is CRITICAL
Database server rules
- If 2 or more database server works the database layer is OK
- if 1 database server works the database layer is WARNING
- If no database servers works the database layer is CRITICAL
The webserver layer, application layer and database layer should be viewed seperatly
The total webshop status has the highest status value of webserver layer, application layer and database layer

I use Nagvis to illustrate the releationship between the layers.

Case 1

The picture shows when everything is fine:

Screenshot3

Case 2

The rules in place:

2 webservers CRITICAL and the webserverlayer is OK
1 appserver CRITICAL and the application layer is OK
2 databases are down and the database layer is WARNING
The total webshop status is WARNING because it has the highest status of the different layers

Screenshot4

Case 3

Now it has been even worse:

4 webservers CRITICAL and the webserverlayer is CRITICAL
1 appserver CRITICAL and the application layer is OK
2 databases are down and the database layer is WARNING
The total webshop status is CRITICAL because it has the highest status of the different layers

Screenshot5

Conclusions

This article shows the power of Open Source and what is possible todo when integration different projects with each other. A solution like this with one of the Big Four (IBM, BMC, CA, HP) would have cost alot when it comes to license and highly specialised consultants had to be hirered.

Implementation

Hosts and services

The hosts and services are created:

Webserver layer: 5 hosts each with 1 service
Application layer: 2 hosts each with 1 service
Database layer: 3 hosts each with 1 service
Webshop layer: 1 host called webshopcontainer and 4 services: webserversstatus, appsserverstatus, dbserversstatus and webshop status. The services represent each layer in the model and webshop is the total status of the webshop.

To make it easy to control the status of all these devices I will use passive checks. So if I want to change the status of a service I just use the GUI and send in a passive check result. In real life active checks would have been used to monitor the different services.

The result showing Service Detail using op5 developed Nagios GUI Ninja:

screenshoot2

The statechanges are sent to NodeBrain via an eventhandler that sends the NodeBrain commands via a namedpipe:

#!/bin/sh # # Event handler script for sending nagios data to nodebrain # # This script has these arguments $SERVICESTATEID$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $SERVICEDESC$ $HOSTNAME$ NODEBRAINPIPE=/opt/plugins/custom/nodebrainpipe DATE=`date` echo "$DATE SERVICESTATEID=$1 SERVICESTATETYPE=$2 SERVICEATTEMPT=$3 SERVICEDESC=$4 HOSTNAME=$5" >> /tmp/eventhandler_out


# What state is the service in?

case "$1" in

OK)

    # The service just came back up

    # Send ok to nodebrain

    echo "assert $4=0;" > $NODEBRAINPIPE

    echo "assert $4=0;" >> /tmp/eventhandler_out

    ;;

WARNING)

    # We don't really care about warning states, since the service is probably still running...

    ;;

UNKNOWN)

    # We don't know what might be causing an unknown error, so don't do anything...

    ;;

CRITICAL)

    # Is this a "soft" or a "hard" state?

    case "$2" in

        

    SOFT)

        # We're in a "soft" state, meaning that Nagios is in the middle of retrying the

        # check before it turns into a "hard" state and contacts get notified...

        # Don't do anything        

        ;;

                

    # The HTTP service somehow managed to turn into a hard error without getting fixed.

    # It should have been restarted by the code above, but for some reason it didn't.

    # Let's give it one last try, shall we?  

    # Note: Contacts have already been notified of a problem with the service at this

    # point (unless you disabled notifications for this service)

    HARD)

        # AHA! Hard state, send data to nodebrain

        echo "assert $4=2;" > $NODEBRAINPIPE

        echo "assert $4=2;" >>/tmp/eventhandler_out 

        ;;

    esac

    ;;

esac

exit 0

The Nodebrainrules, that has all the logic and opens the namedpipe for commands:

#!/usr/local/bin/nb -d # -rm webshop.log set log="webshop.log",out="."; declare indata identity owner; define webshop node pipe.server("indata@nodebrainpipe");


#Start with webservers

#Set OK

assert weba=0;

assert webb=0;

assert webc=0;

assert webd=0;

assert webe=0;

assert  webserversstatus=0;
#define webserver rules

#5 frontwebservers, if 3 or more ok status is ok

#if 2 is ok, status warning

#if 1 or 0 ok, status critical

define webservers cell weba+webb+webc+webd+webe;
define webserversok on(webservers<=4) webserversstatus=0;

define webseroksend on(webserversstatus=0):-./send_to_monitor.sh webserversstatus 0 "OK: Websservers are fine";
define webserverswarning on(webservers>4 and webservers<8) webserversstatus=1;

define webserverswarningsend on(webserversstatus=1):-./send_to_monitor.sh webserversstatus 1 "WARNING: Webservers have problems";
define webserverscritical on(webservers>=8) webserversstatus=2;

define webserverscriticalsend on(webserversstatus=2):-./send_to_monitor.sh webserversstatus 2 "CRITICAL: Webservers have serious problems ";
#appservers

assert appa=0;

assert appb=0;

assert appserversstatus=0;

#2 appservers, 1 down is ok, 2 down critical

define appservers cell appa+appb;
define appserversok on(appservers<=2) appserversstatus=0;

define appserversoksend on(appserversstatus=0):-./send_to_monitor.sh appserversstatus 0 "OK: Appservers are fine";
define appserverscritical on(appservers>2) appserversstatus=2;

define appserverscriticalsend on(appserversstatus=2):-./send_to_monitor.sh appserversstatus 2 "CRITICAL: Appservers have serious problems";
#Databaseservers

assert dba=0;

assert dbb=0;

assert dbc=0;

assert dbserversstatus=0;

#3 db servers

#if 2 or more ok, status ok

#if 1 ok, status warning

define dbservers cell dba+dbb+dbc;
define dbserversok on(dbservers<=2) dbserversstatus=0;

define dbserversoksend on(dbserversstatus=0):-./send_to_monitor.sh dbserversstatus 0 "OK: DataBaseservers are fine";
define dbserverswarning on(dbservers>=4 and dbservers <6)dbserversstatus=1;

define dbserverswarningsend on(dbserversstatus=1):-./send_to_monitor.sh dbserversstatus 1 "WARNING: DataBaseservers have problems";
define dbservercritical on(dbservers>=6)dbserversstatus=2;

define dbservercriticalsend on(dbserversstatus=2):-./send_to_monitor.sh dbserversstatus 2 "CRITICAL: DataBaseservers have serious problems";

#Total rules assert webshopstatus=0; #If all serverstatus ok, the whole webshop is ok define webshopok on(webserversstatus=0 and appserversstatus=0 and dbserversstatus=0) webshopstatus=0; define webshopoksend on(webshopstatus=0):-./send_to_monitor.sh webshopstatus 0 "OK: Webshop is fine"; #If any serverstatus is critical the whole webshop is critical define webshopscritical on(webserversstatus=2 or appserversstatus=2 or dbserversstatus=2) webshopstatus=2; define webshopscriticalsend on(webshopstatus=2):-./send_to_monitor.sh webshopstatus 2 "CRITICAL: Webshop has serious problems"; #If not any serverstatuscritical and in warning, the whole shop is warning. define webshopwarning on((!webserversstatus=2 and !appserversstatus=2 and !dbserversstatus=2) and (webserversstatus=1 or dbserversstatus=1)) webshopstatus=1; define webshopwarningsend on(webshopstatus=1):-./send_to_monitor.sh webshopstatus 1 "WARNING: Webshop has some problems";

The NodeBrain rules runs this script when fired:

#!/bin/sh

HOSTNAME=webshopcontainer SERVICEDESC=$1 STATUS=$2 MESSAGE=$3

now=`date +%s` commandfile='/opt/monitor/var/rw/nagios.cmd' /usr/bin/printf "[%lu] PROCESS_SERVICE_CHECK_RESULT;$HOSTNAME;$SERVICEDESC;$STATUS;$MESSAGE\n" $now > $commandfile

The Nagios or op5 Monitor hosts.cfg

############################################################################### # Generated by op5 Monitor webconfiguration exporter # # Exported 2009-10-22 19:33 by monitor #



# host template 'Dummy-template'

define host{

    name                           Dummy-template

    initial_state                  o

    hostgroups                     NodeBrainDemo

    check_command                  check-host-alive

    max_check_attempts             5

    check_interval                 5

    retry_interval                 1

    obsess_over_host               0

    check_freshness                0

    active_checks_enabled          1

    passive_checks_enabled         1

    event_handler_enabled          1

    flap_detection_enabled         1

    flap_detection_options         n

    process_perf_data              1

    retain_status_information      1

    retain_nonstatus_information   1

    notification_interval          0

    notification_period            24x7

    notification_options           d,u,r,f

    notifications_enabled          1

    stalking_options               n

    register                       0

    }
# host template 'default-host-template'

define host{

    name                           default-host-template

    check_command                  check-host-alive

    max_check_attempts             3

    check_interval                 5

    retry_interval                 0

    check_period                   24x7

    active_checks_enabled          1

    passive_checks_enabled         1

    event_handler_enabled          1

    flap_detection_enabled         1

    process_perf_data              1

    retain_status_information      1

    retain_nonstatus_information   1

    notification_interval          0

    notification_period            24x7

    notification_options           d,u,r,f,s

    notifications_enabled          1

    register                       0

    }
# host 'app-host-a'

define host{

    use                            Dummy-template

    host_name                      app-host-a

    alias                          App Host A

    address                        127.0.0.1

    hostgroups                     NodeBrainDemo

    contact_groups                 support-group

    }
# host 'app-host-b'

define host{

    use                            Dummy-template

    host_name                      app-host-b

    alias                          App Host B

    address                        127.0.0.1

    contact_groups                 support-group

    }
# host 'db-host-a'

define host{

    use                            Dummy-template

    host_name                      db-host-a

    alias                          DB Host A

    address                        127.0.0.1

    contact_groups                 support-group

    }
# host 'db-host-b'

define host{

    use                            Dummy-template

    host_name                      db-host-b

    alias                          DB Host B

    address                        127.0.0.1

    contact_groups                 support-group

    }
# host 'db-host-c'

define host{

    use                            Dummy-template

    host_name                      db-host-c

    alias                          DB Host C

    address                        127.0.0.1

    contact_groups                 support-group

    }
# host 'web-host-a'

define host{

    use                            Dummy-template

    host_name                      web-host-a

    alias                          Web Host A

    address                        127.0.0.1

    contact_groups                 support-group

    }
# host 'web-host-b'

define host{

    use                            Dummy-template

    host_name                      web-host-b

    alias                          Web Host B

    address                        127.0.0.1

    contact_groups                 support-group

    }
# host 'web-host-c'

define host{

    use                            Dummy-template

    host_name                      web-host-c

    alias                          Web Host C

    address                        127.0.0.1

    contact_groups                 support-group

    }
# host 'web-host-d'

define host{

    use                            Dummy-template

    host_name                      web-host-d

    alias                          Web Host D

    address                        127.0.0.1

    contact_groups                 support-group

    }
# host 'web-host-e'

define host{

    use                            Dummy-template

    host_name                      web-host-e

    alias                          Web Host E

    address                        127.0.0.1

    contact_groups                 support-group

    }

# host 'webshopcontainer' define host{ use Dummy-template host_name webshopcontainer alias webshopcontainer address 127.0.0.1 contact_groups support-group }

The Nagios or op5 Monitor services.cfg

############################################################################### # Generated by op5 Monitor webconfiguration exporter # # Exported 2009-10-22 19:33 by monitor #


# service template 'Dummy-service-template'

define service{

    name                           Dummy-service-template

    display_name                   Dummy-service-template

    is_volatile                    0

    check_command                  check_dummy!0

    initial_state                  o

    max_check_attempts             1

    check_interval                 1

    retry_interval                 1

    active_checks_enabled          0

    passive_checks_enabled         1

    check_period                   24x7

    parallelize_check              1

    obsess_over_service            1

    check_freshness                0

    event_handler_enabled          1

    flap_detection_enabled         1

    flap_detection_options         n

    process_perf_data              1

    retain_status_information      1

    retain_nonstatus_information   1

    notification_interval          0

    notification_period            24x7

    notification_options           c,w,u,r,f

    notifications_enabled          1

    stalking_options               n

    register                       0

    }
# service template 'default-service'

define service{

    name                           default-service

    is_volatile                    0

    max_check_attempts             3

    check_interval                 5

    retry_interval                 1

    active_checks_enabled          1

    passive_checks_enabled         1

    check_period                   24x7

    event_handler_enabled          1

    flap_detection_enabled         1

    process_perf_data              1

    retain_status_information      1

    retain_nonstatus_information   1

    notification_interval          0

    notification_period            24x7

    notification_options           c,w,u,r,f,s

    notifications_enabled          1

    contact_groups                 support-group

    register                       0

    }

####################################################

#

# Services for host app-host-a

#
# service 'appa'

define service{

    use                            default-service

    host_name                      app-host-a

    service_description            appa

    check_command                  check_dummy!0

    servicegroups                  webshop

    max_check_attempts             1

    parallelize_check              0

    obsess_over_service            0

    check_freshness                0

    event_handler                  eventhandler_send_to_nodebrain

    flap_detection_enabled         0

    flap_detection_options         n

    contact_groups                 support-group

    stalking_options               n

    }
####################################################

#

# Services for host app-host-b

#
# service 'appb'

define service{

    use                            default-service

    host_name                      app-host-b

    service_description            appb

    check_command                  check_dummy!0

    servicegroups                  webshop

    max_check_attempts             1

    event_handler                  eventhandler_send_to_nodebrain

    flap_detection_enabled         0

    }
####################################################

#

# Services for host db-host-a

#
# service 'dba'

define service{

    use                            default-service

    host_name                      db-host-a

    service_description            dba

    check_command                  check_dummy!0

    servicegroups                  webshop

    max_check_attempts             1

    event_handler                  eventhandler_send_to_nodebrain

    flap_detection_enabled         0

    }
####################################################

#

# Services for host db-host-b

#
# service 'dbb'

define service{

    use                            default-service

    host_name                      db-host-b

    service_description            dbb

    check_command                  check_dummy!0

    servicegroups                  webshop

    max_check_attempts             1

    event_handler                  eventhandler_send_to_nodebrain

    flap_detection_enabled         0

    }
####################################################

#

# Services for host db-host-c

#
# service 'dbc'

define service{

    use                            default-service

    host_name                      db-host-c

    service_description            dbc

    check_command                  check_dummy!0

    servicegroups                  webshop

    max_check_attempts             1

    event_handler                  eventhandler_send_to_nodebrain

    flap_detection_enabled         0

    }
####################################################

#

# Services for host web-host-a

#
# service 'weba'

define service{

    use                            default-service

    host_name                      web-host-a

    service_description            weba

    check_command                  check_dummy!0

    servicegroups                  webshop

    max_check_attempts             1

    event_handler                  eventhandler_send_to_nodebrain

    flap_detection_enabled         0

    }
####################################################

#

# Services for host web-host-b

#
# service 'webb'

define service{

    use                            default-service

    host_name                      web-host-b

    service_description            webb

    check_command                  check_dummy!0

    servicegroups                  webshop

    max_check_attempts             1

    event_handler                  eventhandler_send_to_nodebrain

    flap_detection_enabled         0

    }
####################################################

#

# Services for host web-host-c

#
# service 'webc'

define service{

    use                            default-service

    host_name                      web-host-c

    service_description            webc

    check_command                  check_dummy!0

    servicegroups                  webshop

    max_check_attempts             1

    event_handler                  eventhandler_send_to_nodebrain

    flap_detection_enabled         0

    }
####################################################

#

# Services for host web-host-d

#
# service 'webd'

define service{

    use                            default-service

    host_name                      web-host-d

    service_description            webd

    check_command                  check_dummy!0

    servicegroups                  webshop

    max_check_attempts             1

    event_handler                  eventhandler_send_to_nodebrain

    flap_detection_enabled         0

    }
####################################################

#

# Services for host web-host-e

#
# service 'webe'

define service{

    use                            default-service

    host_name                      web-host-e

    service_description            webe

    check_command                  check_dummy!0

    servicegroups                  webshop

    max_check_attempts             1

    event_handler                  eventhandler_send_to_nodebrain

    flap_detection_enabled         0

    }
####################################################

#

# Services for host webshopcontainer

#
# service 'appserversstatus'

define service{

    use                            Dummy-service-template

    host_name                      webshopcontainer

    service_description            appserversstatus

    servicegroups                  webshop

    flap_detection_enabled         0

    }
# service 'dbserversstatus'

define service{

    use                            Dummy-service-template

    host_name                      webshopcontainer

    service_description            dbserversstatus

    servicegroups                  webshop

    flap_detection_enabled         0

    }
# service 'webserversstatus'

define service{

    use                            Dummy-service-template

    host_name                      webshopcontainer

    service_description            webserversstatus

    servicegroups                  webshop

    flap_detection_enabled         0

    }

# service 'webshopstatus' define service{ use Dummy-service-template host_name webshopcontainer service_description webshopstatus servicegroups webshop flap_detection_enabled 0 }

8 Responses to “Rule engine integration with Nagios using NodeBrain”

Matthias Flacke Says:
October 22nd, 2009 at 11:29 pm
Hi,

using the Nagios plugin check_multi you could do the whole stuff
pretty easy 😉
check_multi uses perl expressions to do the state evaluation and
is therefore flexible and powerful.

It took me about 5 minutes to write down the sketch of these four services below according to your rules, where there are three services for the server types and one top level service for the webshop itself which ties everything together.

You can find check_multi here:
http://www.my-plugin.de/wiki/projects/check_multi/start

Cheers,
-Matthias

> * Webserver rules
> o If 3 or more webserver works the webservice is OK
> o If 2 webservers works the webservice is WARNING
> o If 1 webserver or less is working the webservice is CRITICAL
> * Applicationserver rules
> o If 1 or 2 application servers works the application layer is OK
> o If zero application servers works the application layer is CRITICAL
> * Database server rules
> o If 2 or more database server works the database layer is OK
> o if 1 database server works the database layer is WARNING
> o If no database servers works the database layer is CRITICAL
> * The webserver layer, application layer and database layer should be viewed seperatly
> * The total webshop status has the highest status value of webserver layer, application layer a

web.cmd:
# call: check_multi -f web.cmd
statusdat [ web1 ] = webserver1:webservice1
statusdat [ web2 ] = webserver2:webservice2
statusdat [ web3 ] = webserver3:webservice3
statusdat [ web4 ] = webserver4:webservice4
statusdat [ web5 ] = webserver5:webservice5
state [ WARNING ] = count(OK)<=2
state [ CRITICAL ] = count(OK)<=1

app.cmd:
# call: check_multi -f app.cmd
statusdat [ app1 ] = appserver1:appservice1
statusdat [ app2 ] = appserver2:appservice2
state [ CRITICAL ] = count(OK)<=1

db.cmd:
# call: check_multi -f db.cmd
statusdat [ db1 ] = dbserver1:dbservice1
statusdat [ db2 ] = dbserver2:dbservice2
state [ WARNING ] = count(OK)<=2
state [ CRITICAL ] = count(OK)<=1

webshop.cmd:
# call: check_multi -f webshop.cmd
statusdat [ web ] = nagiosserver:web
statusdat [ app ] = nagiosserver:app
statusdat [ db ] = nagiosserver:db
peter Says:
October 23rd, 2009 at 9:41 am
Hi Matthias,

Yes you are right, using check_multi would be easier to use in this case.

The purpose of the article was to show how to integrate Nagios with NodeBrain, not be the perfect implementation of the webshop scenario.

Imho the biggest lack in Nagios is that it do not have a rule engine. In most cases it is not necessary but in some cases it is needed. Solutions like check_multi and check_cluster could help a bit. But if you need more advanced rules with for example correlations over time, logs, snmptraps and so on you need a rule engine.

My experience is that management that want a business view of the environment and the people implementing Nagios do not speak with each other. An advanced rule engine could bridge that gap by attract business consultants that normally works with the big four. Solutions like this are seldom a technical problem.
Tracy R Reed Says:
November 9th, 2009 at 9:01 am
NodeBrain sounds an awful lot like Prolog. Why wouldn’t one just use Prolog? It is much more mature and has lots of documentation. It seems to have been created for just this very thing decades ago.
peter Says:
November 9th, 2009 at 9:28 am
Feel free to use prolog if you want 🙂
As a former Tivoli consultant I have used Prolog to program Tivoli Enterprise Console and yes Prolog is probably gone do the job. Personally I prefer NodeBrain, after a few hours with NodeBrain I could do more then I could do with T/EC Prolog after a week training.
Khark Says:
May 5th, 2010 at 12:30 pm
Hi,

NodeBrain looks nice and I know check_multi before but I use the Nagios Addon “Business Process View”.
It has the same abilities and, from my point of view, is much easier to deploy.
See: http://nagiosbp.projects.nagiosforge.org/

It also has a Impact Analys Tool where you can set the state of a service to see the Impacts on your defined processes.

Integrating this processes in Nagios or NagVis is also possible via the bp_cfg2service_cfg.pl that comes with Business Process View.

Cheers,
Khark
Chet Says:
March 1st, 2011 at 12:52 am
Peter great tutorial. We are taking a look at nodebrian for our environment but experiencing install trouble/

OS CentOS release 5.5
uname -a Linux 2.6.18-194.26.1.el5 #1 SMP Tue Nov 9 12:54:20 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
from ./configure
….
checking for pcre_compile in -lpcre… no
configure: error: Required library -lpcre not found. You may want to download it from http://www.pcre.org or locate it and include directory in LD_LIBRARY_PATH to support this build.
configure: error: ./configure failed for lib

Setting ld_library_path does not seem to help.

Did you experience anything like this?
Do you have any suggestions or recommended resources that might help us resolve it?
peter Says:
March 4th, 2011 at 9:49 am
I have the same experience on CentOS 5.4 and 5.5. I did never figure out howto get it to compile.
I solved it by running Nodebrain on a Ubuntu box instead, not a good solution…
Edgar Says:
July 2nd, 2012 at 3:27 pm
$ sudo yum whatprovides \*/libpcre\*
pcre-devel-6.6-6.el5_6.1.x86_64 : Development files for pcre
Repo        : base
Matched from:
Filename    : /usr/lib64/libpcre.so
Filename    : /usr/lib64/libpcrecpp.so
Filename    : /usr/lib64/libpcreposix.so
Filename    : /usr/lib64/libpcre.a
Filename    : /usr/lib64/pkgconfig/libpcre.pc
Filename    : /usr/lib64/libpcreposix.a
Filename    : /usr/lib64/libpcrecpp.a

$ sudo yum -y install pcre pcre-devel

You must be logged in to post a comment.

Filled Under: Cool things, english, Hints, Nagios, op5 Monitor, sysadmin

An It-slave in the digital saltmine

Recent Posts

Recent Comments

Archives

Categories

Meta