While most cloud providers have a monitoring system in place, we prefer to run our own. The main reason is that if you drill down you want to be able to monitor the services that are relevant to your business. This often means you monitor very specific services or build your own plugins to scan whatever you find important.

What to use

There are many monitoring solutions out there that offer a multitude of plugins and add-ons. There are two aspects to consider. You need a collector and a system to get the information to the collector. Some systems have a server pull the information from the client, others have clients pushing the information to the server and of course anything in between.

For our purpose, we were looking for something simple and as “standard” as possible. All our cloud servers run Linux and our cloud applications are running on Linux based platforms. Running open-source solutions is the obvious choice and Nagios is providing a tried and true way forward. Nagios is mostly a “pull” system where a central monitoring instance polls the various machines and extracts the necessary information. Writing additional plugins is fairly simple as there is no “programming language” barrier. Simple shell scripts can be added to deal with specific needs.

Nagios comes with an extensive set of plugins to do various checks. They provide their own client called NRPE (Nagios Remote Plugin Executor). This allows the remote server to execute small local (to the client) routines to execute a check on the machine itself, paying the necessary attention to security risks. But there is no need to use this particular client application.

As most of our cloud machines have a reduced memory footprint, minimizing the resources is key. Therefore, we decided to build an SNMP-only based client environment. Our internal VLAN running as a backbone to all our virtual servers allows the SNMP port to be accessed from our monitoring instance. Net-SNMP is a free and open-source SNMP implementation that is available on many Linux distributions. Out-of-the-box, it provides a decent set of default checks. Most importantly, it is very stable, easy to configure and extend and uses a few resources.

Setting up the monitoring system

Our Nagios instance runs on a minimal cloud virtual machine (1 CPU, 1GB of memory) which makes it very affordable. Although you can find a prepackaged Nagios for many Linux distributions, we decided to build it from scratch. Building from source took a little effort but there is no specific advantage in doing so. So, our suggestion is to go with a stock build Nagios.

On the client-side, installing Net-SNMP is also pretty easy and straight forward. Looking at the configuration file of the Net-SNMP service you will notice it has some interesting capabilities that allow you to monitor load and disk usage. They are typically commented out. Removing the ‘#’ comment sign at the beginning of the line will open up their use. The other configuration parameter to look for is the one that limits to which Object Identifier or OID access is granted. An OID sometimes referred to as a MIB is a unique identifier that can be used to access a specific set of the information under the SNMP protocol. This system allows you to set up and publish your own set of identifiers. As a reference, http://oid-info.com/cgi-bin/display?oid=1.3.6.1.4.1.45490 shows you our entry in the database. Although setting up your own set of identifiers is possible, we will stay within the boundaries of existing OIDs.

As mentioned before, Nagios comes with a range of default plugins. check_host_alive is a simple check that pings the server and check_load and check_disk directly tap into the SNMP OIDs we made available by removing the comment sign in the configuration. Obviously, Nagios requires some additional configuration which is well explained in their documentation.

Note that since we run our Nagios monitoring (SNMP access) via our internal VLAN, we have not implemented any additional SNMP security measures other than only allowing SNMP traffic to go over the internal IP addresses. If you intend to use the SNMP service over the internet, please make sure that you install the appropriate security measures.

Organizing things

To keep an overview of our configuration settings, we decided to group tests into Nagios groups. Our default group called Linux-server will do the basic testing such as ping, load and disk as described above. We have a specific group for mail servers that checks if: our mail server is running, the DKIM server is up, the virus scanner is operational and the SMIME encryptor is available. Our web service group checks on a webserver running, database up, SSL certificate expiration etc. To complete the configuration, we assign the servers to the appropriate groups. The default group gets assigned to all servers; if it runs email servers it is also assigned to the mail group. If it runs web services it is assigned to that group too and so forth.

We have a good reason to use this system. Imagine you figure out something is wrong or missing for a particular set of services. By adding it to the group, you add the check to all relevant servers. Another more administrative reason is that we like to build our configuration from the export of all our running cloud servers.

For all our IBM Cloud targets, we execute the following command:

$ bx target -c <TargetID>
$ bx sl vs list
$ bx sl hardware list

It provides us with a small list of all virtual and physical machines in our portfolio:

Id     hostname      domain public_ip     private_ip           datacenter    status
123    web    nexpertem.cloud      159.8.14.238  10.1.2.6      ams01         ACTIVE
...

With a little script the appropriate machine description files are generated:

define host {
   use          linux-server
   hostgroups   genericserver
   host_name    web.nexperteam.cloud
   display_name web.nexperteam.cloud@ams01
   alias        web.nexperteam.cloud
   address      159.8.14.238
}

define host {
   use          linux-server
   hostgroups   genericlocalserver,webserver
   host_name    web.nexperteam.local
   display_name web.nexperteam.local@ams01
   alias        web.nexperteam.local
   address      10.1.2.6
   parents        web.nexperteam.cloud
}

Note that for every server we have an internal and an external configuration linked up to appropriate server interface.

Keeping an eye on the infrastructure

Once you have Nagios installed and configured there are many ways to set up the notification systems. Nagios can send out email or text messages and can easily be integrated with other API interfaces to warn when something goes wrong. We prefer the smartphone plugin called Anag which contacts regularly the Nagios web API to pull monitoring information. It alerts the admin in case a service is down and gives quick insights into what is actually going wrong.

Nagios also offers a web interface that allows you to look for a specific service, host or group of the above.

Although Nagios is intended to be used as a monitoring tool, by default, it allows you to generate availability reports based on service, host or group.

If your infrastructure gets a little bigger or there are various dependencies from one service on the other a dependency, it is important to specify the parent-child relations. A quick overview map could come in handy if things go wrong.

Conclusion

It is easy to set up monitoring and tune it to your specific needs. Open source tools and standard available services provide an easy-to-set-up system. Modern (mobile) apps offer a cheap way to keep an eye on what is going on.

Nexperteam can help you set up your own internal or external monitoring system and provide 24×7 services to keep an eye open and make sure that your cloud infrastructure delivers what it should at scale and with high performance.