Redundant calls to the database and slow upload of CSV file

In this article I will describe step by step how we have identified the solutions to a few issues we have encountered during one of our projects.

The first challenge:

A few database queries were used a lot more than required, thus increasing the processing time. It was not clear and obvious from the start that these queries were integrated over 20k times.

The solution

Research & Analysis:

We enabled logging and used a stopwatch to identify which steps from the long running process take the longest.
After identifying these steps, we checked any existing IO operations that existed there. Since there weren’t any except database queries, we focused on them.
We examined if our database queries were repeating with the same parameters for N number of times (e.g. N being the number of times in a loop), and extracted all of them, if possible, to a higher scope, and by doing so, we reduced some of the database operations, which improved the time of long running process.
After identifying all of the possible database queries that potentially made our process longer, we saw that the improvement was not even close to what we expected.
A deeper analysis showed that a particular step repeats 20k times, as of the business requirement. That step was searching for available users for the business using certain criteria. Those criteria were repeating a lot less, +/- 8000 times, and they were not changing throughout the process.
This step was behind the long running process. There were over 20000 iterations, there were two database queries which were complex, and they took around 2-3 seconds per iteration.

Implementation:

Since it was only required to execute those queries for +/- 8000 iterations (different due on business), we’ve decided to cache in-memory the results of those queries.
The decision for the in-memory cache was due to: 1) Network overhead when fetching from an application-based cache, like Redis; 2) The process ran once a week, and our expected time for the process was 1-2 hours. It ran during non-working hours, and since the resources were occupied only for such a short time, it was plausible to use an in-memory cache.
The parameters for those queries were used as keys for the cache. They were stored in a process context object, which contained a cache map for these data. The memory was released as soon as the process was finished.

Results

By doing so, the first +/- 8000 iterations were fetching the required data directly from the database, thus running for 2-3 seconds. The rest of the iterations ran under 200ms (the time was taken by using a stopwatch and logging the step time).

The first production run of the process after adding the cache showed a time of 1.5 hours, which was a great improvement from +16 hours.

The second challenge:

Another long running process was the upload of the file used for the previous process. The criteria (which is called the design) are uploaded from a CSV file to the database. This process took over +20 hours.

Optimizing file upload solution

By analyzing the upload time and business process of introducing the contents to the database separately, we noticed that our business process was quick.
The upload time was taking most of the time due to the processing of the file and the validation of each line.
We were using OpenCSV, and after investigating the library closer, we found out that it is not very fast, especially for heavy files (our file contained over 8000 rows).
We decided to use a different csv file processor. We searched for any csv processing dependency that was specifically specialized in speed, and we stumbled upon UnivocityParsers (Apache License).
By analyzing the library closer, we saw that it was perfect for our scenario, and since it was open-source and unit test friendly, we’ve integrated and tested it quickly.
Along the integration, we’ve observed that the class responsible for validation was thread-safe, which opened the possibility to validate each line in parallel.
We’ve opted for a parallel stream, since the number of lines was over 8000, and this was one of the cases where parallel streams are efficient.

Result

After fully integrating the library and testing the new code, the process of upload was tested by the business, and it took around 2 hours, which was a great improvement from +20 hours.