Web site health monitoring: a simple example with Nagios + WebInject

I noticed that in the past few months this site has gone down a few times for unexpected reasons, and more worryingly, that these down times were not caught by my naive nagios setup, which was setup only to ping the server.

As my server, Nginx, was technically responsive, these ping tests passed, even though clearly incorrect content was being served to the end-user. All a visitor to the site would have seen would have been a scary and fairly nondescript error message indicative of a deeper problem in the backend (a broken php process, a broken database, etc).

This was a clear illustration that in order to ensure that down times are caught quickly and in an automated manner one needs to go beyond a simple Nagios setup that only does a PING on (some port of) the domain, and on to a setup which allows testing of the actual content being served (ie: if I access page X, does the html response pass tests W, Y, Z).

To accomplish this, I decided I needed to start testing my site as I would any more sophisticated web service. A fairly accepted way of doing this is via Nagios + WebInject, my current go to combination.

The simplest non-trivial way of testing the content of a site would be to have webinject scrape the home page and check that the resulting html contains some expected string. Ideally, this test string:

  • Should not to be part of the user-visible content, as this would introduce the possibility of it inadvertently being changed one day by a well-meaning content manager and thus breaking the test.
  • Should be highly unique, so that it couldn’t conceivably be produced by some random error message that indicates the site is broken.

A good way of accomplishing the above is to add a hidden html element to the home page, with a message specifically for Nagios + WebInject to find. For example, one could add the following hidden span to say, the footer:

<span style='display:none;'>Hello Nagios! This is the footer speaking.</span>

and then have Nagios periodically test for this string using webinject. Depending on your backend setup, this may suffice to test your entire stack. For example, on a wordpress install where this is inserted as customized content into the footer, it would constitute a test of pretty much the entire backend stack – nginx/apache, php, and the db – as all are involved in getting this text to display correctly.

(And if different parts of your page are actually generated by multiple, distinct backend pathways, you could insert multiple hidden tags / messages throughout your page in a corresponding manner, one for each execution pathway you want to test.)

To actually setup the test, we will need three config files (one for nagios and another two for webinject). Here is my nagios config file, etc/servers/davidSimic.cfg (where all such paths will be relative to your nagios install dir):

define host {
        use                             linux-server
        host_name                       davidsimic.com
        alias                           davidsimic.com
        address                         davidsimic.com
        max_check_attempts              5
        check_command                   check_tcp!80
        check_interval                  5
        check_period                    24x7
        notification_interval           30
        notification_period             24x7
}

define service {
        use                             generic-service
        host_name                       davidsimic.com
        service_description             WebInject test
        check_command                   webinject!/usr/local/nagios/etc/webinject/testConfig.xml!/usr/local/nagios/etc/webinject/davidSimicTests.xml
}

Notice the “check_command” field in the service definition which tells nagios to run webinject using two input files as configs (arguments to a command are delimited by “!”).

Also notice the ‘use’ fields in both the host and service definitions. These reference default templates defined in:

 ./etc/objects/templates.cfg 

The linux-server template specifies a default value for check_command as check-host-alive (which performs a ping). Because I typically have a firewall blocking all non-http or ssh traffic, I have instead set this field to check_tcp!80.

Here is my webinject test config file, etc/webinject/testConfig.xml:

<useragent>WebInject Application Tester</useragent>
<timeout>10</timeout>
<globaltimeout>20</globaltimeout>
<reporttype>nagios</reporttype>

And finally the actual webinject test definition file, etc/webinject/davidSimicTests.xml:

<testcases repeat="1">
<case
    id="1"
    description1="Connecting to davidsimic.com"
    method="get"
    url="http://davidsimic.com"
    verifypositive="Hello Nagios! This is the footer speaking."
    errormessage="Unable to connect to home page of davidsimic.com"
/>
</testcases>

Notice how the ‘verifypositive’ field is set to the content of our hidden element. The above config specifies that the webinject test will fail if a GET request to davidsimic.com fails to contain this string in its response. Nagios, via the first config, will take care of the frequency with which the tests are run and what to do when the tests start failing.

In the above, a failed test triggers two more tests at a higher frequency, with an alert email being sent out to me when three tests fail in a row (ie: to ride out occasional server / internet hiccups). The trigger and resulting action is actually defined elsewhere, specifically, see the files:

  • etc/objects/contacts.cfg
  • etc/objects/commands.cfg

in your nagios install. These specify the contact info and actions to take upon test failure, respectively.


No fancy tricks or popups, simply an article like the above, which I write a few times a month - just for my subscribers.