Background
When I worked as a Tivoli Consultant I spend a lot of time investigating the customer processes and workflow and try to make the monitoring solution to support this. It was often long term projects and involved alot of people like, project managers, support staff, maintenance staff, application owners, applicataion engineers, operating systemens managers, DBAs and so on.
Very often the investigation followed the following principles:
- Investigate the problem management workflow
- Document the different parts that builds up the system
- Find spots where a probe could be inserted to monitor a particular part in the system
- Define the thresholds for the probe
- Iterate 3 and 4 until all possible problems in the system could be detected
- Define who should have which alarm and when ifthe probes shows abnormal behavior.
- Write the ruleset that implements 7
- Test
- Document and hand over to the customer
When it comes to a product like Nagios or op5 Monitor, the product has a built in rule engine that fullfills most of the requirements in the steps above. In most cases it is just a matter of configuration and in some cases some extra scripting.
Solution
I have run into rare cases where the builtin rule engine is not good enough and I have looked for a rule engine that fullfill the following requirements:
- Gnu Public Licence or another acceptable license
- Standalone
- Advanced enough
- Possible to integrate with other solutions like Nagios or op5 Monitor
Now I think I have found a good candidate, NodeBrain and right now I’m testing it.
I will do a follow up article where I describe howto integrate NodeBrain with Nagios or op5 Monitor.
An example
This is as an example of what a rule engine can do:
Webshop example
- 5 webfrontends, 1 or 2 down is OK, 3 is down is WARNING, 4 or 5 down is CRITICAL
- 2 appservers, 1 down is OK, 2 DOWN is CRITICAL
- 3 database backends, 1 down is OK, 2 down is WARNING, 3 down is CRITICAL
- Overall, the layer with the highest severity is the total severerity.
Implementation:
The ruleset
#Start with webservers
#Set OK
assert weba=0;
assert webb=0;
assert webc=0;
assert webd=0;
assert webe=0;
assert webserversstatus=0;
#define webserver rules
#5 frontwebservers, if 3 or more ok status is ok
#if 2 is ok, status warning
#if 1 or 0 ok, status critical
define webservers cell weba+webb+webc+webd+webe;
define webserversok on(webservers<=4) webserversstatus=0;
define webserverswarning on(webservers>4 and webservers<8) webserversstatus=1;
define webserverscritical on(webservers>=8) webserversstatus=2;
#appservers
assert appa=0;
assert appb=0;
assert appserversstatus=0;
#2 appservers, 1 down is ok, 2 down critical
define appservers cell appa+appb;
define appserversok on(appservers<=2) appserversstatus=0;
define appserverscritical on(appservers>2) appserversstatus=2;
#Databaseservers
assert dba=0;
assert dbb=0;
assert dbc=0;
assert dbserversstatus=0;
#3 db servers
#if 2 or more ok, status ok
#if 1 ok, status warning
define dbservers cell dba+dbb+dbc;
define dbserversok on(dbservers<=2) dbserversstatus=0;
define dbserverswarning on(dbservers>=4 and dbservers <6)dbserversstatus=1;
define dbservercritical on(dbservers>=6)dbserversstatus=2;
#Total rules
assert webshopstatus=0;
#If all serverstatus ok, the whole webshop is ok
define webshopok on(webserversstatus=0 and appserversstatus=0 and dbserversstatus=0) webshopstatus=0;
#If any serverstatus is critical the whole webshop is critical
define webshopscritical on(webserversstatus=2 or appserversstatus=2 or dbserversstatus=2) webshopstatus=2;
#If not any serverstatuscritical and in warning, the whole shop is warning.
define webshopwarning on((!webserversstatus=2 and !appserversstatus=2 and !dbserversstatus=2) and (webserversstatus=1 or dbserversstatus=1)) webshopstatus=1;
Testing:
./nb webshop.nb -
> assert weba=2;
> assert webb=2;
> assert webc=2;
2009/10/02 09:44:42 NB000I Rule webserverswarning fired (webserversstatus=1)
2009/10/02 09:44:42 NB000I Rule webshopwarning fired (webshopstatus=1)
> assert webd=2;
2009/10/02 09:45:06 NB000I Rule webserverscritical fired (webserversstatus=2)
2009/10/02 09:45:06 NB000I Rule webshopscritical fired (webshopstatus=2)
> assert webd=0;
2009/10/02 09:46:27 NB000I Rule webserverswarning fired (webserversstatus=1)
2009/10/02 09:46:27 NB000I Rule webshopwarning fired (webshopstatus=1)
> assert weba=0;
2009/10/02 09:46:32 NB000I Rule webserversok fired (webserversstatus=0)
2009/10/02 09:46:32 NB000I Rule webshopok fired (webshopstatus=0)
> assert appa=2;
> assert appb=2;
2009/10/02 09:47:12 NB000I Rule appserverscritical fired (appserversstatus=2)
2009/10/02 09:47:12 NB000I Rule webshopscritical fired (webshopstatus=2)
> assert weba=2;
2009/10/02 09:47:40 NB000I Rule webserverswarning fired (webserversstatus=1)
> assert webd=2;
2009/10/02 09:48:07 NB000I Rule webserverscritical fired (webserversstatus=2)
> assert appb=0;
2009/10/02 09:49:08 NB000I Rule appserversok fired (appserversstatus=0)
> assert weba=0;
2009/10/02 09:49:33 NB000I Rule webserverswarning fired (webserversstatus=1)
2009/10/02 09:49:33 NB000I Rule webshopwarning fired (webshopstatus=1)
> assert dba=2;
> assert dbb=2;
2009/10/02 09:51:05 NB000I Rule dbserverswarning fired (dbserversstatus=1)
> assert dbc=2;
2009/10/02 09:51:09 NB000I Rule dbservercritical fired (dbserversstatus=2)
2009/10/02 09:51:09 NB000I Rule webshopscritical fired (webshopstatus=2)
> show -t
@ = ! == node
webshopwarning = ! == on(((!(webserversstatus=2))&((!(appserversstatus=2))&(!(dbserversstatus=2))))&((webserversstatus=1)|(dbserversstatus=1))) webshopstatus=1;
webshopscritical = ! == on((webserversstatus=2)|((appserversstatus=2)|(dbserversstatus=2))) webshopstatus=2;
webshopok = ! == on((webserversstatus=0)&((appserversstatus=0)&(dbserversstatus=0))) webshopstatus=0;
webshopstatus = 2
dbservercritical = ! == on(dbservers>=6) dbserversstatus=2;
dbserverswarning = ! == on((dbservers>=4)&(dbservers<6)) dbserversstatus=1;
dbserversok = ! == on(dbservers<=2) dbserversstatus=0;
dbservers = 6 == ((dba+dbb)+dbc)
dbserversstatus = 2
dbc = 2
dbb = 2
dba = 2
appserverscritical = ! == on(appservers>2) appserversstatus=2;
appserversok = ! == on(appservers<=2) appserversstatus=0;
appservers = 2 == (appa+appb)
appserversstatus = 0
appb = 0
appa = 2
webserverscritical = ! == on(webservers>=8) webserversstatus=2;
webserverswarning = ! == on((webservers>4)&(webservers<8)) webserversstatus=1;
webserversok = ! == on(webservers<=4) webserversstatus=0;
webservers = 6 == ((((weba+webb)+webc)+webd)+webe)
webserversstatus = 1
webe = 0
webd = 2
webc = 2
webb = 2
weba = 0
> assert dbc=0;
2009/10/02 09:52:12 NB000I Rule dbserverswarning fired (dbserversstatus=1)
2009/10/02 09:52:12 NB000I Rule webshopwarning fired (webshopstatus=1)
> assert webb=0;
2009/10/02 09:52:31 NB000I Rule webserversok fired (webserversstatus=0)
> assert dba=0;
2009/10/02 09:52:45 NB000I Rule dbserversok fired (dbserversstatus=0)
2009/10/02 09:52:45 NB000I Rule webshopok fired (webshopstatus=0)
Links
2 Responses to “An advanced GPL’d rule engine, NodeBrain”
Leave a Reply
You must be logged in to post a comment.
March 9th, 2010 at 6:30 pm
Looks interesting; any thoughts as follow-up?
– rob
March 9th, 2010 at 9:10 pm
Well, we are at op5 investigating to have a rule engine included in our Nagios based solution.