Nagios: Performance tuning and system optimization tips

Optimization is a buzz word that every manager likes to say and hear. We maximize the performance of servers and software all the time. Recently, I took the opportunity  to optimize the Nagios installation at my organization, because we had a series of unfortunate events that lead to a slow Nagios response time.

I concentrated on one area, latency. Monitoring and notifications need to be as close to real time as possible. You do not want to receive an event notification after your end users start calling about a problem. Reducing latency will improve your monitoring tremendously. So, here are a few things I did to reduce latency.

Numbers do not lie, so here they are:

Monitoring Performance
Service Check Execution Time:       	0.01 / 30.03 / 0.210 sec
Service Check Latency:                	213 / 1655.57 / 1510.79 sec
Host Check Execution Time:        	0.01 / 10.03 / 0.322 sec
Host Check Latency:                	0.00 / 1708 / 589.98 sec
# Active Host / Service Checks:        	1718 / 3279
# Passive Host / Service Checks:	0 / 0

As you can see we are suffering from a bad case of service and host latency. Our overall hosts checks amount to +1700 and our service checks +3300. So we are a busy enterprise. We run: Nagios, Nagvis and NDOUTILS. I needed to reduce the latency, so here are the steps I took.

  1. Placed $PREFIX, /usr/lib/mysql and /var/$PREFIX on fibre disk. $PREFIX is your default Nagios installation directory.
  2. Truncated the Nagios MySQL database. This is not a problem since trending is set up using the Nagios flat files under /var/$PREFIX/var/archives. Plus the database will rebuild in no time.
  3. nagios.cfg file: #check_result_reaper_frequency=10
    check_result_reaper_frequency=5
  4. nagios.cfg file: #max_check_result_reaper_time=30
    max_check_result_reaper_time=15
  5. We have hosts that all we are doing is checking them with a PING command. If you read my article about PING then you know I am not a big proponent of the command to verify the status of a host. However, if you need to use it, then use check_fping. I also set my max_check_attempts to 2.

Once you have made these changes, give them time to take effect. I waited 12 hours and here are the results:

 Monitoring Performance
Service Check Execution Time:        		0.01 / 30.03 / 0.219 sec
Service Check Latency:                		0.00 / 1626.57 / 1.566 sec
Host Check Execution Time:        		0.01 / 10.03 / 0.324 sec
Host Check Latency:                		0.04 / 4.42 / 1.101 sec
# Active Host / Service Checks:		        1718 / 3279
# Passive Host / Service Checks:		0 / 0

Notice the drop in latency times? I am looking to increase the performance. What I will be doing in the future is migrating heavy-hitting services to passive checks, like check_disk. Notice the second column in the Service Check Latency column? That is maximum latency and I feel I can reduce that by working with the heavy-hitting service objects. My feeling is based upon the Host Check Latency’s maximum value going from 1708 seconds to 4.42 seconds by implementing check_fping.

These are just a couple of tips that I wanted to share with everyone, so you too can improve your performance. Other tips can be found in the performance tuning of your Nagios documentation.

Enjoy and please remember to patronize the sponsors on this site,

Mike Kniaziewicz, MIS

Comments are closed.