Performance and where it all can go wrong

Modern web applications come with a complex set of underlying services. When the applications are not performing and the response time to deliver a page is high, the search to track the problem down starts.

In this article, we’ll go through performance on different levels of delivering a web page to the browser and we’ll provide an example of troubleshooting performance issues.

Performance on different levels

Understanding why a web page under-performs is seldom easy. Let’s have a look at what actually happens when things get slow. Understanding the full process of delivering a web page to the web browser is imperative.

1. Connecting

Before the client-server communication even starts, a slew of processes run. Your computer will convert the domain name to an internet protocol address. Then a number of packets are sent between the server and the client to establish that the communication is possible. The server and the client agree on a way to communicate setting up Transport Layer Security. Only then can the server and the client exchange request and response.

Note that we are only skimming the surface here. All of these steps require highly responsive, secure and sometimes even distributed services. All of them can introduce unwanted delays or strange behavior at the most unexpected moments.

2. Information exchange

Now that a connection has been established, information exchange can take place. But this is where the next problem arises. Usually, the content management systems (CMSs) are set up using plugins or extensions. Most of them come with their own set of files. These files have to be transferred to the other side. Some of them even refer to other sites for more files or content to be picked up.

Often without help these tools require dozens or even hundreds of files to be loaded. Imagine having to set up connections for all of these. Having “keep-alive” connections will help improve performance substantially, implementing caching options or using tools and plugins that combine various files will obviously be beneficial too. Fortunately, most CMSs come armed with a wide range of tools.

One of the critical things when exchanging information is the speed at which all of this is possible. Nobody likes to wait for a web page to load because the connection is slow or the data set is big. Using compression can obviously help when it comes to HTML, CSS and JavaScript files. For images, audio and video the use of a content delivery network can be beneficial.

All of this will have to be carefully orchestrated to provide your user with an optimal web experience.

3. Data collection

The most relevant entry point to the service are the objects that are not static but require some interaction with other services or collection of data to be generated. And this is again a source of possible concern. The data collection and reformatting needs to speed up.

The database needs to be tuned and indexed to perform well. File access needs to be swift. Calculations and parsing need to be on par. CPU usage, IO bottle-necks, machine load and network latency can all slow process down to a grinding halt making the user wait for a page to be delivered.

Monitoring is essential as is doing load tests on new versions of the software. Things as caching data and code on various levels as well as using reverse proxies can help improve the overall responsiveness.

4. Visualization

Eventually when the client has gathered all the information, the page gets visualized to the end-user. From the beginning browsers have tried to pro-actively show what has already been available. Making sure there are no “blocking” elements to start showing what is there can improve the user experience substantially.

The effects can be subtle. Complaints came in that the main web page would come in really slow, it would sometimes take more than 30 seconds to display. It took some time to figure out what was wrong. This was partly due to the audience complaining. One of the sections on the web page was linked to an external service that their network was blocking (or at least would delay the requests substantially). The solution turned out to be really simple. Don’t make the visualization of that section blocking. That way the page would show all information apart from the small additional block improving the user experience vastly.

If you understand the problem…

So, after this substantial introduction, let’s discuss an issue we recently encountered. It goes something like this.

A few months ago we migrated a web service to a new environment. Better connectivity, more IOPS, improved CPU and more memory. What can possibly go wrong? Obviously nothing. We migrated the data, installed the latest CMS version on the most modern version of the libraries and tools needed. After extensive testing showing the service performed excellent, the migration was completed and all was well.

Our usual annual security audit took place. And all hell broke loose. The whole service came to a stop. All pages started to be slow. Not that pages would not be delivered but it would take multiple seconds rather than fractions of seconds for the pages to be delivered. So we started digging in. Looking at the above list, we checked:

the network latency – no issue
the cpu usage – no issue
the load – slightly up, but should not be an issue
database – nothing wrong there
since the content of the pages had not substantially changed we presumed visualization would not be the issue.

All in all, the service should perform well, since there was barely any CPU usage, IO levels were very low and the network was performing excellent – us being able to manage the server remotely with no issues at all. As soon as the security scan was stopped everything returned to normal. If it was resumed things would start to slow down again.

This was obviously a more complex problem with some underlying causes. So we started gathering some more data. One of the things we noticed was that the security tool would set up a new TLS connection per request. This could potentially slow down the security scanner but it should not have impact on the standard traffic. Or so we thought. So we started investigating a full one-page request. Much to our surprise the TLS handshake was taking some time. Actually depending on how many requests were coming per time-frame it would get longer and longer, returning gradually back to normal as the number of requests dropped.

… you can fix it.

What could impact the TLS handshake in such a way that it would slow down the service? We migrated this service as we had done a few times before. One of the main concerns of the customer we are running this service for are security issues. They have their own preferred order of security algorithms, OCSP stapling and some additional bits and bobs. To make sure we did not forget anything we decided to copy over all security settings.

The configuration included the following:

        SSLRandomSeed startup file:/dev/random  512

You would not see this as something bad on the first sight. It tells our web server to use the standard Linux random device to start seeding the TLS library to provide security. Obviously a good random number generator is essential to provide good security and /dev/random exactly does that. Note that there have been many discussions when it comes to random and Linux. If you feel up to the challenge Wikipedia is an excellent source to start your journey on finding out what random is all about. But the /dev/random device has also a major drawback. When visited too often it will block. The blockage is linked to what is referred to entropy or at least lack thereof and it’s a known issue to arise in particular on virtual machines.

Once the problem was clear, the solution was obvious. Do not use the blocking device, using the non-blocking device with a name that only differs one character.

        SSLRandomSeed startup file:/dev/urandom  512

Can we do better…?

Although reading this blog only takes a couple of minutes and understanding what the problem and the subsequent solution is makes you simply nod, figuring out what was going on takes substantially longer. No surprises there.

On our way to figuring out what was going on we assumed we might have bumped against a strange TLS bug. When investigating the openSSL library we noticed that our virtual machine supported the RDRAND function.

# openssl engine -v
(rdrand) Intel RDRAND engine
(dynamic) Dynamic engine loading support
SO_PATH, NO_VCHECK, ID, LIST_ADD, DIR_LOAD, DIR_ADD, LOAD

This is a hardware generated entropy source that can be used to seed the TLS communication allowing us to use the fast built in RandomSeed function of Apache:

SSLRandomSeed startup builtin
SSLRandomSeed connect builtin
SSLCryptoDevice rdrand

Note that this implies we are trusting Intel’s rdrand not to be influenced by any external sources as some conspiracy groups are claiming. We feel confident that the information we share is not of the nature that this would have any impact at all.

Conclusion

Providing an excellent user experience to your customers, wherever they may be, is not a trivial task. Some issues are hidden deep in the structure of your setup while other surface at the complete opposite end where the users’ browser is displaying the information.

Is your web service secure and performing the way it should? Nexperteam has the tools and the skills to check, report and find a good way forward for your infrastructure to deliver high performance.