When it comes to our on-premise cloud infrastructure we are old school. We use a RedHat RHEV visualization stack and a GlusterFS storage stack. It’s a pretty nifty setup and one can hardly call it outdated. It has been up and running for several years now with little problems.
Monitoring the infrastructure and taking care of small errors when they occur will ensure that our infrastructure is resilient and stable.
In this post we’ll share a few insights into how we troubleshoot an NTP issue which was undercover as a network stack issue.
NTP time service
To make sure that all our machines run with a comparable clock, we count on ntpd. NTP is a protocol that is intended to keep servers in sync with a reference time frame. If you hook it up to the internet, you should be able to synchronize the time on all your physical and virtual machines within milliseconds from each other.
“The Network Time Protocol (NTP) is a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks.
NTP is intended to synchronize all participating computers to within a few milliseconds of Coordinated Universal Time (UTC). NTP can usually maintain time to within tens of milliseconds over the public Internet, and can achieve better than one millisecond accuracy in local area networks under ideal conditions.”
Source – Wikipedia
The need to synchronize various machines to a single clock becomes obvious when events on specific machines cause interactions with other machines. Both from a debug as well as from a security perspective it is imperative that machines run with synchronized clocks. Before we describe our internal NTP setup, it’s important to understand that the NTP protocol has had a few security issues in the past.
In particular, if a machine directly synchronizes with an internet source, the clock reference could influence the local clock. This could lead to internal machines being tricked into shifting their clock in comparison to other machines.
NTP setup on local infrastructure
Synchronizing your internal infrastructure against an internal reference would prevent things from going awry. If the internal reference would be pushed by some external reference, all machines will be off but internally synchronized. A simplified view of our NTP setup shows two machines connected to the internet providing independent NTP services to all machines on the inside.
This NTP setup balances the security, resilience and stability needs required to run the in-house service.
To make sure that we keep track of synchronization on all our servers, we use a Nagios plugin that tracks the offset with our two internal NTP servers. At a certain point, we noticed all machines triggering a “could not sync” with one of our NTP servers. Due to the latency of the NTP and the lag of Nagios plugin, typically we would find that our NTP server was syncing again at the moment we would be notified. The NTP server would indicate that it could not access the NTP sources it was supposed to synchronize with.
After having a thorough look at the NTP configuration, we presumed the problem was linked to a network issue which would block the internal server from connecting with the source. During the internal network investigation, we detected a number of issues. One of them was a malformed bonding interface. Another caused some spoofed traffic to be sent out over the wrong interface. Every time another issue was detected, we hoped to see the NTP problem disappear.
After countless hours of network monitoring on the local network and the WAN connection, we concluded that the problem was not the network stack, but a transitional phenomenon. It could happen any time, causing our Nagios to go crazy for a short period of time and would return back to normal with little to no impact on our services.
And the winner is …
Despite the in-depth investigations, the annoying Nagios bumps kept the team challenged.
We wanted to get to the bottom of this. Switching to a higher level of NTP debug information gave us the final clue. Every time the Nagios would color deep red, the NTP server would go out of sync with a message telling that the 128ms threshold would have been surpassed.
Diving deep into NTP, it turns out that:
Under ordinary conditions, ntpd slews the clock so that the time is effectively continuous and never runs backwards. If due to extreme network congestion an error spike exceeds the step threshold, by default 128 ms, the spike is discarded. However, if the error persists for more than the stepout threshold, by default 900 s, the system clock is stepped to the correct value. In practice the need for a step has is extremely rare and almost always the result of a hardware failure. With the -x option the step threshold is increased to 600s.
Due to the virtual character of our NTP servers, the internal clock would get bumps at irregular intervals. These bumps occasionally would be of an order of magnitude just above 128ms. The logs indicated values in the order of 132ms, slightly over the maximum amount allowed.
The solution came with the explanation of the limit. Just add the ‘-x’ option at start-up.
Since the option has been added, we have not have any more issues with our NTP setup.
Tracking down a transitional phenomenon that seemingly happens randomly is one of the most difficult things to do. It takes a lot of hard work, insight and determination to find out what the cause can be. Sometimes the root cause is quite different from what is expected.
Nexperteam is determined to find root causes and fix problems permanently. If you have network related issues that you feel require extra attention, feel free to reach out and meet our experts.