An It-slave in the digital saltmine

30
Aug

General Systems- and Network management guideline, part 1

Posted by peter

Background

This article is describing best practice I have collected after a 10 year period with Systems- and Network management. My role has varied between consultant, trainer, systems engineer, project manager and architect.

Very many programs creates log files. They are great for trouble shooting and in some cases the only place to find out if there is something wrong.

A monitor friendly software has a way of figuring out the status this very moment by providing relevant information via for example: SNMP, a web page, status stored in a database, a well documented API available in several different languages or similiar.

A NON monitor friendly application has for an example a javascript webpage where all the status can be read, unfortunatly this becomes more and more common.

When someone writes a piece of software, think about that someone else might want to know the status of it.

As a consultant working with monitoring and surveillance, I have meet several developers that has the attitude that their software never breaks if everything else works.:

-Just make sure that I get the data and all the surrounding pieces like databases, network, hardware and so on works, then my code will work flawless.

IMHO, unfortunatly the best way to meet that type of arguments is by just sit back and let them put their application into test and production. Focus on getting the other stuff monitored. Within a rather short while the development team will come back and ask how the best way of sending information into the monitoring solution is done.

The best way of building a monitored solution is to design the application to be monitored from the begining instead of trying to fix it afterwards.

In most cases a good enough solution can be created relativly simple afterwards and in some cases it is a little bit harder.

Guideline

My experience as surveilence expert working with Tivoli, BMC, HP OpenView, IBM NetView, BigBrother, Nagios and op5 Monitor can be collected in these simple guidelines:

Keep it simple stupid.
- Start with simple monitoring, like PING to make sure that the host is up and standard checks like http, smtp and so on to make sure that standard services are up. Going from no monitoring to basic monitoring is a huge step and many organizations do not have the processes to handle more complex monitoring.
- The next step is disk, CPU and memory on hosts. On network devices are port load, cpu load, network link.
- Third step is to dig into bussiness critical applications and services.
Small iterations. Do not try to build a top of the line monitoring solution from day one. You will never leave the startup phase.
Let the monitoring solution pull the status instead of sending the status to the monitroing solution. This avoid complicated rules when different types of information is sent to the monitoring solution. So avoid sending SNMPtraps.
The monitoring solution is NOT a trashcan where to send tons of uninteresting garbage. It is far to common that HW vendors recommends to send thousands of unnecessary SNMPtraps to the monitoring solution and just a few is interesting. It is a nightmare to create the ruleset to figure out what is interesting, especially if there is dependencies where one message is interesting if another message has been sent before. The documentation is a badly written MIB on a couple of hundered pages. In almost every case I’ve run into with this approch the implementation never ends and test cases are hard to create. When the systems are in production I can bet on that a critical event will occour which has not been taken care of and the production will stop. Managers will be upset, vendors will blame each other and customers will be angry.
Let the status be availiable easily:
- via standard APIs, Perl and Bash is the most common.
- SNMP via SNMPget instead of SNMPtraps
- Status stored in a database, the monitoring solution can run SQL quries to get the status.
- Commands the can be runned by the monitoring solution and the output parsed, or even better, exit codes are used and documented.
Normally it is not a good idea to read a log file to understand the status of of the software.

Afterword

This guide describes the guidelines howto collect the data for your monitor solution. In the next article I will describe howto implement and work with the information collected.

Stay tuned…

Recent Posts

Recent Comments

Archives

Categories

Meta

General Systems- and Network management guideline, part 1

Background

Guideline

Afterword

Links

Leave a Reply