“We moved our database into the cloud and it was terrible”. Or wasn’t it?
When we started moving a large database into the cloud we drafted a migration scenario to minimize the downtime. Migrating from one type of database to another one is seldom an easy game. Combine this with the fact the there was a lot of data to be shipped and you can understand the raw complexity.
We managed to set up an incremental process that on the first run would take several days, the second run about a day and then make it shorter and shorter to eventually end up in the 1-hour range which fit the requirements. After some rigorous testing, we felt comfortable that this would work. Next thing was to run our application on top of it and see if performance was still within specification.
We actually managed to do better on the clustered cloud database than with the classic virtual setup the original database ran on. This was probably due to the better table analysis that the cloud database provided by default.
So we were ready to migrate: convert the current infrastructure to read-only, do the final migration, start all services in the new environment and redirect to the new services. Done. Indeed, all went well and well within the boundaries of the migration plan, we finished the migration project.
Everything we had tested worked well, as per the specifications and without any problem.
Don’t try this at home
That is until the “daily routine” scripts started kicking in. Since there was no strict time constraint on any of those and the database performed much better, we had assumed that none of these would cause any problem. Wrong assumption! One of our scripts had a 4-hour time constraint. On the existing database, it would take about an hour. When testing it on the new infrastructure we had seen it performed better on the test machines but not against the cloud database. Here it would take much longer than 4 hours, up to 8 hours actually.
Having to increase the performance of a running job on the fly is never easy. We had to take baby steps making sure that every time we would make a change nothing breaks. If anything would break, we would abort the process, rollback, fix and try again on next run.
We quickly found a culprit. Something as simple as “SELECT id FROM table” to then loop over all IDs and do a “SELECT * FROM other_table WHERE id=<>”. When we tested the database we found out that it would always have been network-close. This meant that any query and result cycle would be well within 0.1 milliseconds if the data that had to be transferred was not substantial – and it wasn’t. Doing that with a cloud database would still be fast, as in the order of magnitude 10 msec. But that would be about 100 times slower than before.
Once we isolated the problem, the solution was easily found. Rather than reading a small block of data repeating the process a million times, we would simply do a full “ SELECT * FROM other_table” and store it into memory.
Quick on the back of the envelope calculation learns that a million times 0.1 msec up and down plus is about 15 minutes, doing one query to get all of the data will suffer a penalty for transferring a huge block of data. On a 1Gbps line, it would take up to 3 seconds, making the total query time 2 times 10 msec. for the connection and 3 seconds for the data transfer adding up to about 3 seconds. Obviously, a big gain on time compared to the 15 minutes. The trade-off was that we needed about 256MB more memory to run the job. We considered that was a pretty cheap way to vastly improve the run time.
We found a number of additional places that would benefit from the same strategy and we were even capable of minimizing the overlap of data in memory. The memory footprint of the process remained well within scope. Obviously, we could have been less lucky and then we would probably have been forced to split the process in blocks and handle the data block by block.
It took a number of days and very careful testing and planning to reduce the runtime by a multiple of 15 minutes or so day after day. Fortunately, in a week we were back to within the required 4-hour range.
Keep on going
Now that we knew what caused these issues, we started analyzing the script further. When a database is running on the same machine or on a machine next to you connected via a 10 Gbps network interface you tend to think that the data access cost is almost free. That dramatically changes when the network latency starts kicking in.
In the script, certain information was picked up on multiple occasions. Creating a simple key-value structure in memory for as long as needed that would immediately return the value rather than trying to pick it up again from the database is not only easy to set up, it will shave off millions of milliseconds over and over again.
After we were done implementing this “caching” feature the run time further decreased to the hour mark. This was well within the required time frame and would not cause any problems even if the number of entries that we had to process would grow dramatically.
Nevertheless, we still were left with one “loop over all” rows which would take half the time of the full run time of the script. This problem was a little subtler. It would involve the calculation of a CRC check-sum based on a random string and the content of a specific column of the table. The program looks something like this
- get all IDs and match text fields from the table
- loop over all rows
- calculate some random text
- calculate CRC on the concatenation of random text and text from the text field
- update row with id and store CRC and random text
As we explained above, having to do this a million times would take us about half an hour to execute this.
Note that calculating some random text is not the issue here. All modern databases have some form of a random function. Setting the “random text” field is simply a matter of executing an “UPDATE table SET random_text=whatever_generates_random_text()”. Admittedly this uses some intrinsic database functionality but considering this was a well-documented process and the text was just a hexadecimal representation of a random number it would certainly not be a show-stopper if we would have to change database platform again.
So what is left is calculating the CRC. The only way we found to further improve this was to use specific database features. PostgreSQL comes with a build-in language extension called PLPGSQL. One simple Google search away, we found a CRC32 implementation on Stack Overflow.
This cornerstone reduced to loop to two simple SQL statements. One to fill the random field and a second to calculate the check-sum on the concatenation of the two strings. Obviously, the whole calculation thing does take a few minutes executing it on the server-side. But compared to the 30 minutes of updating the full table row by row, that is about 10 times faster.
Eventually, we managed to get the full run well under 30 minutes and now most of the work is CPU or network-bandwidth bound. For all practical purposes that is what we wanted to achieve.
Only when migrating our database to the cloud we have realized how much we related on the database being network-close. When you start thinking about it we feel we have improved the workload substantially and gained some valuable insights into the network-latency trap.
If you have performance problems with your database or need help in evaluating network-based database usage, Nexperteam will bring valuable insights that are based on years of experience.