Increasing the transaction levels and user levels can require you to tune your load tests. That may take some detective work. Here’s a real example.
The Problem.
A RedLine13 customer had an API test that ran for 12 minutes and generated around 500 transactions a second when run locally on their PC. Taking the same test running on Redline with 4 separate machines they expected to generate 2000 transactions but they were generating around 1040 transactions.
They were using four of the following instances when running the test.
They were unsure why they were not seeing the expected load. They could see all the agents up and running when the test starts.
Detective Work
So first we confirmed that the correct number of JMeter threads were running.
That looked good, so we looked at total requests.
That looked good too, meaning that all servers were firing off requests and responding.
Detective Work – Part 2
We looked at some of their most called APIs. It looked like response time decreased and they got less traffic through, though we didn’t want to jump to that conclusion.
We looked deeper into this and did some analysis.
We focusing on their largest API call since it is easiest to see what is going on.
To achieve their Constant Throughput Timer for this API they require every thread do 5 transactions per second and therefore the response time must be less than about 200 ms to achieve this. If their response time is greater than that there is no way for the throughput timer to achieve this since it does not increase threads.
We looked at their percentiles just for the Advertising API, retrieved by downloading the per url percentiles.
We could see performance issues starting to build up as they added more nodes. The CPU and other graphs for the agents look fine.
For 2 Node 74% of requests are < 200 ms, so safe to assume the average.
For 3 Node 48% are less then 200ms.
For 4 Node 12 % are less then 200ms.
Also if we remove the 100% percentile numbers, since these are outliers or one bad request, we can see that 2 Node average is less than the 200ms but 3 node and 4 node are not.
What does the data tell us?
So a few thoughts:
- It is possible they are starting to see where their servers start hitting response time issues. What do they use to monitor their environments?
- Increase size of thread groups. They will need to increase their thread group size if they want more api calls.
- They have Advertising API calls configured at 321 transactions per second. We backed into their logic about splitting percentages (see spreadsheet).
- They have 65 threads per machine. 321 TPS / 65 Threads ~= 5 Transactions per second per thread <= 200ms response time
- If they configure this one API to have 100 threads. 321 TPS / 100 Threads = 3.21 Transactions per second per thread <= .311 ms and so on. Of course they will need to watch agent CPU as they dramatically increase threads.
This gets validated when they look just at their Images API call which actually shows the desired throughput
2 Node
3 Node
4 Node
And this is because image requests are set to 46 TPS with 15 threads, 46 / 15 ~ 3.067706882 transactions per second per thread, each response must be less than 326 ms.
Here is their 95th% and higher for Image API 95% of requests are less than 326 ms so their load test can keep up the load they desire.
The Bottom Line
You may need to tune your load tests such as this case by increasing your thread size based on performance of specific API calls to keep them inline with the math of a constant throughput timer. The customer said that this helped them identify bottlenecks in their network and once they resolved those, their transactions per second increased.