Welcome to It-Slav.Net blog
Peter Andersson
peter@it-slav.net

I've already got a female to worry about. Her name is the Enterprise.
-- Kirk, "The Corbomite Maneuver", stardate 1514.0

Background

This article will describe how easy it is to integrate other tools to Nagios or op5 Monitor. I will use an example with a webshop where a business view of how the webshop is doing is implemented by using a GPL’d rule engine, NodeBrain. I have in an earlier article described the ruleset for this implementation but now I will show how the integration can be done.

 

Scenario

The scenario is a webshop with:

  • 5 frontend webservers
  • 2 application servers
  • 3 databasservers

Management want to monitor how the webshop is doing. They do not want to know if a redundant part is down instead management want to have the overview of the webshop status.

A management consultant is hired and do an investigation and after a ridiculous amount of money the following rules are defined:

  • Webserver rules
    • If 3 or more webserver works the webservice is OK
    • If 2 webservers works the webservice is WARNING
    • If 1 webserver or less is working the webservice is CRITICAL
  • Applicationserver rules
    • If 1 or 2 application servers works the application layer is OK
    • If zero application servers works the application layer is CRITICAL
  • Database server rules
    • If 2 or more database server works the database layer is OK
    • if 1 database server works the database layer is WARNING
    • If no database servers works the database layer is CRITICAL
  •  The webserver layer, application layer and database layer should be viewed seperatly
  • The total webshop status has the highest status value of webserver layer, application layer and database layer

I use Nagvis to illustrate the releationship between the layers.

 

Case 1

The picture shows when everything is fine:

Screenshot3

Case 2

The rules in place:

  • 2 webservers CRITICAL and the webserverlayer is OK
  • 1 appserver CRITICAL and the application layer is OK
  • 2 databases are down and the database layer is WARNING
  • The total webshop status is WARNING because it has the highest status of the different layers

Screenshot4

 

Case 3

Now it has been even worse:

  • 4 webservers CRITICAL and the webserverlayer is CRITICAL
  • 1 appserver CRITICAL and the application layer is OK
  • 2 databases are down and the database layer is WARNING
  • The total webshop status is CRITICAL because it has the highest status of the different layers

 Screenshot5

 

Conclusions

This article shows the power of Open Source and what is possible todo when integration different projects with each other. A solution like this with one of the Big Four (IBM, BMC, CA, HP) would have cost alot when it comes to license and highly specialised consultants had to be hirered.

 

Links

  • Op5 A company that package and support enterprise class systems- and networkmanagement products
  • NodeBrain, A powerful GPL’d rule engine
  • Nagios, An enterprise class Monitoring sofware
  • Nagvis, Nagios Visualization addon

 

 

Implementation

Hosts and services

The hosts and services are created:

  • Webserver layer: 5 hosts each with 1 service
  • Application layer: 2 hosts each with 1 service
  • Database layer: 3 hosts each with 1 service
  • Webshop layer: 1 host called webshopcontainer and 4 services: webserversstatus, appsserverstatus, dbserversstatus and webshop status. The services represent each layer in the model and webshop is the total status of the webshop.

To make it easy to control the status of all these devices I will use passive checks. So if I want to change the status of a service I just use the GUI and send in a passive check result. In real life active checks would have been used to monitor the different services.

 

The result showing Service Detail using op5 developed Nagios GUI Ninja:

screenshoot2

 

The statechanges are sent to NodeBrain via an eventhandler that sends the NodeBrain commands via a namedpipe:

#!/bin/sh
#
# Event handler script for sending nagios data to nodebrain
#
# This script has these arguments $SERVICESTATEID$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $SERVICEDESC$ $HOSTNAME$
NODEBRAINPIPE=/opt/plugins/custom/nodebrainpipe
DATE=`date`
echo "$DATE SERVICESTATEID=$1 SERVICESTATETYPE=$2 SERVICEATTEMPT=$3 SERVICEDESC=$4 HOSTNAME=$5" >> /tmp/eventhandler_out

# What state is the service in?
case "$1" in
OK)
    # The service just came back up
    # Send ok to nodebrain
    echo "assert $4=0;" > $NODEBRAINPIPE
    echo "assert $4=0;" >> /tmp/eventhandler_out
    ;;
WARNING)
    # We don't really care about warning states, since the service is probably still running...
    ;;
UNKNOWN)
    # We don't know what might be causing an unknown error, so don't do anything...
    ;;
CRITICAL)
    # Is this a "soft" or a "hard" state?
    case "$2" in
       
    SOFT)
        # We're in a "soft" state, meaning that Nagios is in the middle of retrying the
        # check before it turns into a "hard" state and contacts get notified...
        # Don't do anything       
        ;;
               
    # The HTTP service somehow managed to turn into a hard error without getting fixed.
    # It should have been restarted by the code above, but for some reason it didn't.
    # Let's give it one last try, shall we? 
    # Note: Contacts have already been notified of a problem with the service at this
    # point (unless you disabled notifications for this service)
    HARD)
        # AHA! Hard state, send data to nodebrain
        echo "assert $4=2;" > $NODEBRAINPIPE
        echo "assert $4=2;" >>/tmp/eventhandler_out
        ;;
    esac
    ;;
esac

exit 0

 

 

The Nodebrainrules, that has all the logic and opens the namedpipe for commands:

#!/usr/local/bin/nb -d
#
-rm webshop.log
set log="webshop.log",out=".";
declare indata identity owner;
define webshop node pipe.server("indata@nodebrainpipe");

#Start with webservers
#Set OK
assert weba=0;
assert webb=0;
assert webc=0;
assert webd=0;
assert webe=0;
assert  webserversstatus=0;

#define webserver rules
#5 frontwebservers, if 3 or more ok status is ok
#if 2 is ok, status warning
#if 1 or 0 ok, status critical
define webservers cell weba+webb+webc+webd+webe;

define webserversok on(webservers<=4) webserversstatus=0;
define webseroksend on(webserversstatus=0):-./send_to_monitor.sh webserversstatus 0 "OK: Websservers are fine";

define webserverswarning on(webservers>4 and webservers<8) webserversstatus=1;
define webserverswarningsend on(webserversstatus=1):-./send_to_monitor.sh webserversstatus 1 "WARNING: Webservers have problems";

define webserverscritical on(webservers>=8) webserversstatus=2;
define webserverscriticalsend on(webserversstatus=2):-./send_to_monitor.sh webserversstatus 2 "CRITICAL: Webservers have serious problems ";

#appservers
assert appa=0;
assert appb=0;
assert appserversstatus=0;
#2 appservers, 1 down is ok, 2 down critical
define appservers cell appa+appb;

define appserversok on(appservers<=2) appserversstatus=0;
define appserversoksend on(appserversstatus=0):-./send_to_monitor.sh appserversstatus 0 "OK: Appservers are fine";

define appserverscritical on(appservers>2) appserversstatus=2;
define appserverscriticalsend on(appserversstatus=2):-./send_to_monitor.sh appserversstatus 2 "CRITICAL: Appservers have serious problems";

#Databaseservers
assert dba=0;
assert dbb=0;
assert dbc=0;
assert dbserversstatus=0;
#3 db servers
#if 2 or more ok, status ok
#if 1 ok, status warning
define dbservers cell dba+dbb+dbc;

define dbserversok on(dbservers<=2) dbserversstatus=0;
define dbserversoksend on(dbserversstatus=0):-./send_to_monitor.sh dbserversstatus 0 "OK: DataBaseservers are fine";

define dbserverswarning on(dbservers>=4 and dbservers <6)dbserversstatus=1;
define dbserverswarningsend on(dbserversstatus=1):-./send_to_monitor.sh dbserversstatus 1 "WARNING: DataBaseservers have problems";

define dbservercritical on(dbservers>=6)dbserversstatus=2;
define dbservercriticalsend on(dbserversstatus=2):-./send_to_monitor.sh dbserversstatus 2 "CRITICAL: DataBaseservers have serious problems";

#Total rules
assert webshopstatus=0;
#If all serverstatus ok, the whole webshop is ok
define webshopok on(webserversstatus=0 and appserversstatus=0 and dbserversstatus=0) webshopstatus=0;
define webshopoksend on(webshopstatus=0):-./send_to_monitor.sh webshopstatus 0 "OK: Webshop is fine";
#If any serverstatus is critical the whole webshop is critical
define webshopscritical on(webserversstatus=2 or appserversstatus=2 or dbserversstatus=2) webshopstatus=2;
define webshopscriticalsend on(webshopstatus=2):-./send_to_monitor.sh webshopstatus 2 "CRITICAL: Webshop has serious problems";
#If not any serverstatuscritical and in warning, the whole shop is warning.
define webshopwarning on((!webserversstatus=2 and !appserversstatus=2 and !dbserversstatus=2) and (webserversstatus=1 or dbserversstatus=1)) webshopstatus=1;
define webshopwarningsend on(webshopstatus=1):-./send_to_monitor.sh webshopstatus 1 "WARNING: Webshop has some problems";

 

 

The NodeBrain rules runs this script when fired:

#!/bin/sh

HOSTNAME=webshopcontainer
SERVICEDESC=$1
STATUS=$2
MESSAGE=$3

now=`date +%s`
commandfile='/opt/monitor/var/rw/nagios.cmd'
/usr/bin/printf "[%lu] PROCESS_SERVICE_CHECK_RESULT;$HOSTNAME;$SERVICEDESC;$STATUS;$MESSAGE\n" $now > $commandfile

 

The Nagios or op5 Monitor hosts.cfg

###############################################################################
#  Generated by op5 Monitor webconfiguration exporter
#
#  Exported 2009-10-22 19:33 by monitor
#

# host template 'Dummy-template'
define host{
    name                           Dummy-template
    initial_state                  o
    hostgroups                     NodeBrainDemo
    check_command                  check-host-alive
    max_check_attempts             5
    check_interval                 5
    retry_interval                 1
    obsess_over_host               0
    check_freshness                0
    active_checks_enabled          1
    passive_checks_enabled         1
    event_handler_enabled          1
    flap_detection_enabled         1
    flap_detection_options         n
    process_perf_data              1
    retain_status_information      1
    retain_nonstatus_information   1
    notification_interval          0
    notification_period            24x7
    notification_options           d,u,r,f
    notifications_enabled          1
    stalking_options               n
    register                       0
    }

# host template 'default-host-template'
define host{
    name                           default-host-template
    check_command                  check-host-alive
    max_check_attempts             3
    check_interval                 5
    retry_interval                 0
    check_period                   24x7
    active_checks_enabled          1
    passive_checks_enabled         1
    event_handler_enabled          1
    flap_detection_enabled         1
    process_perf_data              1
    retain_status_information      1
    retain_nonstatus_information   1
    notification_interval          0
    notification_period            24x7
    notification_options           d,u,r,f,s
    notifications_enabled          1
    register                       0
    }

# host 'app-host-a'
define host{
    use                            Dummy-template
    host_name                      app-host-a
    alias                          App Host A
    address                        127.0.0.1
    hostgroups                     NodeBrainDemo
    contact_groups                 support-group
    }

# host 'app-host-b'
define host{
    use                            Dummy-template
    host_name                      app-host-b
    alias                          App Host B
    address                        127.0.0.1
    contact_groups                 support-group
    }

# host 'db-host-a'
define host{
    use                            Dummy-template
    host_name                      db-host-a
    alias                          DB Host A
    address                        127.0.0.1
    contact_groups                 support-group
    }

# host 'db-host-b'
define host{
    use                            Dummy-template
    host_name                      db-host-b
    alias                          DB Host B
    address                        127.0.0.1
    contact_groups                 support-group
    }

# host 'db-host-c'
define host{
    use                            Dummy-template
    host_name                      db-host-c
    alias                          DB Host C
    address                        127.0.0.1
    contact_groups                 support-group
    }

# host 'web-host-a'
define host{
    use                            Dummy-template
    host_name                      web-host-a
    alias                          Web Host A
    address                        127.0.0.1
    contact_groups                 support-group
    }

# host 'web-host-b'
define host{
    use                            Dummy-template
    host_name                      web-host-b
    alias                          Web Host B
    address                        127.0.0.1
    contact_groups                 support-group
    }

# host 'web-host-c'
define host{
    use                            Dummy-template
    host_name                      web-host-c
    alias                          Web Host C
    address                        127.0.0.1
    contact_groups                 support-group
    }

# host 'web-host-d'
define host{
    use                            Dummy-template
    host_name                      web-host-d
    alias                          Web Host D
    address                        127.0.0.1
    contact_groups                 support-group
    }

# host 'web-host-e'
define host{
    use                            Dummy-template
    host_name                      web-host-e
    alias                          Web Host E
    address                        127.0.0.1
    contact_groups                 support-group
    }

# host 'webshopcontainer'
define host{
    use                            Dummy-template
    host_name                      webshopcontainer
    alias                          webshopcontainer
    address                        127.0.0.1
    contact_groups                 support-group
    }

 

 

The Nagios or op5 Monitor services.cfg

###############################################################################
#  Generated by op5 Monitor webconfiguration exporter
#
#  Exported 2009-10-22 19:33 by monitor
#

# service template 'Dummy-service-template'
define service{
    name                           Dummy-service-template
    display_name                   Dummy-service-template
    is_volatile                    0
    check_command                  check_dummy!0
    initial_state                  o
    max_check_attempts             1
    check_interval                 1
    retry_interval                 1
    active_checks_enabled          0
    passive_checks_enabled         1
    check_period                   24x7
    parallelize_check              1
    obsess_over_service            1
    check_freshness                0
    event_handler_enabled          1
    flap_detection_enabled         1
    flap_detection_options         n
    process_perf_data              1
    retain_status_information      1
    retain_nonstatus_information   1
    notification_interval          0
    notification_period            24x7
    notification_options           c,w,u,r,f
    notifications_enabled          1
    stalking_options               n
    register                       0
    }

# service template 'default-service'
define service{
    name                           default-service
    is_volatile                    0
    max_check_attempts             3
    check_interval                 5
    retry_interval                 1
    active_checks_enabled          1
    passive_checks_enabled         1
    check_period                   24x7
    event_handler_enabled          1
    flap_detection_enabled         1
    process_perf_data              1
    retain_status_information      1
    retain_nonstatus_information   1
    notification_interval          0
    notification_period            24x7
    notification_options           c,w,u,r,f,s
    notifications_enabled          1
    contact_groups                 support-group
    register                       0
    }

####################################################
#
# Services for host app-host-a
#

# service 'appa'
define service{
    use                            default-service
    host_name                      app-host-a
    service_description            appa
    check_command                  check_dummy!0
    servicegroups                  webshop
    max_check_attempts             1
    parallelize_check              0
    obsess_over_service            0
    check_freshness                0
    event_handler                  eventhandler_send_to_nodebrain
    flap_detection_enabled         0
    flap_detection_options         n
    contact_groups                 support-group
    stalking_options               n
    }

####################################################
#
# Services for host app-host-b
#

# service 'appb'
define service{
    use                            default-service
    host_name                      app-host-b
    service_description            appb
    check_command                  check_dummy!0
    servicegroups                  webshop
    max_check_attempts             1
    event_handler                  eventhandler_send_to_nodebrain
    flap_detection_enabled         0
    }

####################################################
#
# Services for host db-host-a
#

# service 'dba'
define service{
    use                            default-service
    host_name                      db-host-a
    service_description            dba
    check_command                  check_dummy!0
    servicegroups                  webshop
    max_check_attempts             1
    event_handler                  eventhandler_send_to_nodebrain
    flap_detection_enabled         0
    }

####################################################
#
# Services for host db-host-b
#

# service 'dbb'
define service{
    use                            default-service
    host_name                      db-host-b
    service_description            dbb
    check_command                  check_dummy!0
    servicegroups                  webshop
    max_check_attempts             1
    event_handler                  eventhandler_send_to_nodebrain
    flap_detection_enabled         0
    }

####################################################
#
# Services for host db-host-c
#

# service 'dbc'
define service{
    use                            default-service
    host_name                      db-host-c
    service_description            dbc
    check_command                  check_dummy!0
    servicegroups                  webshop
    max_check_attempts             1
    event_handler                  eventhandler_send_to_nodebrain
    flap_detection_enabled         0
    }

####################################################
#
# Services for host web-host-a
#

# service 'weba'
define service{
    use                            default-service
    host_name                      web-host-a
    service_description            weba
    check_command                  check_dummy!0
    servicegroups                  webshop
    max_check_attempts             1
    event_handler                  eventhandler_send_to_nodebrain
    flap_detection_enabled         0
    }

####################################################
#
# Services for host web-host-b
#

# service 'webb'
define service{
    use                            default-service
    host_name                      web-host-b
    service_description            webb
    check_command                  check_dummy!0
    servicegroups                  webshop
    max_check_attempts             1
    event_handler                  eventhandler_send_to_nodebrain
    flap_detection_enabled         0
    }

####################################################
#
# Services for host web-host-c
#

# service 'webc'
define service{
    use                            default-service
    host_name                      web-host-c
    service_description            webc
    check_command                  check_dummy!0
    servicegroups                  webshop
    max_check_attempts             1
    event_handler                  eventhandler_send_to_nodebrain
    flap_detection_enabled         0
    }

####################################################
#
# Services for host web-host-d
#

# service 'webd'
define service{
    use                            default-service
    host_name                      web-host-d
    service_description            webd
    check_command                  check_dummy!0
    servicegroups                  webshop
    max_check_attempts             1
    event_handler                  eventhandler_send_to_nodebrain
    flap_detection_enabled         0
    }

####################################################
#
# Services for host web-host-e
#

# service 'webe'
define service{
    use                            default-service
    host_name                      web-host-e
    service_description            webe
    check_command                  check_dummy!0
    servicegroups                  webshop
    max_check_attempts             1
    event_handler                  eventhandler_send_to_nodebrain
    flap_detection_enabled         0
    }

####################################################
#
# Services for host webshopcontainer
#

# service 'appserversstatus'
define service{
    use                            Dummy-service-template
    host_name                      webshopcontainer
    service_description            appserversstatus
    servicegroups                  webshop
    flap_detection_enabled         0
    }

# service 'dbserversstatus'
define service{
    use                            Dummy-service-template
    host_name                      webshopcontainer
    service_description            dbserversstatus
    servicegroups                  webshop
    flap_detection_enabled         0
    }

# service 'webserversstatus'
define service{
    use                            Dummy-service-template
    host_name                      webshopcontainer
    service_description            webserversstatus
    servicegroups                  webshop
    flap_detection_enabled         0
    }

# service 'webshopstatus'
define service{
    use                            Dummy-service-template
    host_name                      webshopcontainer
    service_description            webshopstatus
    servicegroups                  webshop
    flap_detection_enabled         0
    }

 

 


8 Responses to “Rule engine integration with Nagios using NodeBrain”

  1. Matthias Flacke Says:

    Hi,

    using the Nagios plugin check_multi you could do the whole stuff
    pretty easy 😉
    check_multi uses perl expressions to do the state evaluation and
    is therefore flexible and powerful.

    It took me about 5 minutes to write down the sketch of these four services below according to your rules, where there are three services for the server types and one top level service for the webshop itself which ties everything together.

    You can find check_multi here:
    http://www.my-plugin.de/wiki/projects/check_multi/start

    Cheers,
    -Matthias

    > * Webserver rules
    > o If 3 or more webserver works the webservice is OK
    > o If 2 webservers works the webservice is WARNING
    > o If 1 webserver or less is working the webservice is CRITICAL
    > * Applicationserver rules
    > o If 1 or 2 application servers works the application layer is OK
    > o If zero application servers works the application layer is CRITICAL
    > * Database server rules
    > o If 2 or more database server works the database layer is OK
    > o if 1 database server works the database layer is WARNING
    > o If no database servers works the database layer is CRITICAL
    > * The webserver layer, application layer and database layer should be viewed seperatly
    > * The total webshop status has the highest status value of webserver layer, application layer a

    web.cmd:
    # call: check_multi -f web.cmd
    statusdat [ web1 ] = webserver1:webservice1
    statusdat [ web2 ] = webserver2:webservice2
    statusdat [ web3 ] = webserver3:webservice3
    statusdat [ web4 ] = webserver4:webservice4
    statusdat [ web5 ] = webserver5:webservice5
    state [ WARNING ] = count(OK)<=2
    state [ CRITICAL ] = count(OK)<=1

    app.cmd:
    # call: check_multi -f app.cmd
    statusdat [ app1 ] = appserver1:appservice1
    statusdat [ app2 ] = appserver2:appservice2
    state [ CRITICAL ] = count(OK)<=1

    db.cmd:
    # call: check_multi -f db.cmd
    statusdat [ db1 ] = dbserver1:dbservice1
    statusdat [ db2 ] = dbserver2:dbservice2
    state [ WARNING ] = count(OK)<=2
    state [ CRITICAL ] = count(OK)<=1

    webshop.cmd:
    # call: check_multi -f webshop.cmd
    statusdat [ web ] = nagiosserver:web
    statusdat [ app ] = nagiosserver:app
    statusdat [ db ] = nagiosserver:db

  2. peter Says:

    Hi Matthias,

    Yes you are right, using check_multi would be easier to use in this case.

    The purpose of the article was to show how to integrate Nagios with NodeBrain, not be the perfect implementation of the webshop scenario.

    Imho the biggest lack in Nagios is that it do not have a rule engine. In most cases it is not necessary but in some cases it is needed. Solutions like check_multi and check_cluster could help a bit. But if you need more advanced rules with for example correlations over time, logs, snmptraps and so on you need a rule engine.

    My experience is that management that want a business view of the environment and the people implementing Nagios do not speak with each other. An advanced rule engine could bridge that gap by attract business consultants that normally works with the big four. Solutions like this are seldom a technical problem.

  3. Tracy R Reed Says:

    NodeBrain sounds an awful lot like Prolog. Why wouldn’t one just use Prolog? It is much more mature and has lots of documentation. It seems to have been created for just this very thing decades ago.

  4. peter Says:

    Feel free to use prolog if you want 🙂
    As a former Tivoli consultant I have used Prolog to program Tivoli Enterprise Console and yes Prolog is probably gone do the job. Personally I prefer NodeBrain, after a few hours with NodeBrain I could do more then I could do with T/EC Prolog after a week training.

  5. Khark Says:

    Hi,

    NodeBrain looks nice and I know check_multi before but I use the Nagios Addon “Business Process View”.
    It has the same abilities and, from my point of view, is much easier to deploy.
    See: http://nagiosbp.projects.nagiosforge.org/

    It also has a Impact Analys Tool where you can set the state of a service to see the Impacts on your defined processes.

    Integrating this processes in Nagios or NagVis is also possible via the bp_cfg2service_cfg.pl that comes with Business Process View.

    Cheers,
    Khark

  6. Chet Says:

    Peter great tutorial. We are taking a look at nodebrian for our environment but experiencing install trouble/

    OS CentOS release 5.5
    uname -a Linux 2.6.18-194.26.1.el5 #1 SMP Tue Nov 9 12:54:20 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
    from ./configure
    ….
    checking for pcre_compile in -lpcre… no
    configure: error: Required library -lpcre not found. You may want to download it from http://www.pcre.org or locate it and include directory in LD_LIBRARY_PATH to support this build.
    configure: error: ./configure failed for lib

    Setting ld_library_path does not seem to help.

    Did you experience anything like this?
    Do you have any suggestions or recommended resources that might help us resolve it?

  7. peter Says:

    I have the same experience on CentOS 5.4 and 5.5. I did never figure out howto get it to compile.
    I solved it by running Nodebrain on a Ubuntu box instead, not a good solution…

  8. Edgar Says:

    $ sudo yum whatprovides \*/libpcre\*
    pcre-devel-6.6-6.el5_6.1.x86_64 : Development files for pcre
    Repo        : base
    Matched from:
    Filename    : /usr/lib64/libpcre.so
    Filename    : /usr/lib64/libpcrecpp.so
    Filename    : /usr/lib64/libpcreposix.so
    Filename    : /usr/lib64/libpcre.a
    Filename    : /usr/lib64/pkgconfig/libpcre.pc
    Filename    : /usr/lib64/libpcreposix.a
    Filename    : /usr/lib64/libpcrecpp.a

    $ sudo yum -y install pcre pcre-devel

Leave a Reply





Book reviews
FreePBX 2.5
Powerful Telephony Solutions






Asterisk 1.6
Build a feature rich telephony system with Asterisk






Learning NAGIOS 3.0





Cacti 0.8 Network Monitoring,
Monitor your network with ease!