Ten best practices for successful performance load testing
A project manager asks:
We are planning to perform load testing for our application with Hewlett Packard's LoadRunner® or perhaps OpenSTA, before placing the application into production.
Does Pegasystems have any experience with such tests, or advice on structuring the tests or interpreting the results?
Pegasystems has extensive performance load testing experience, based on hundreds of implementations. This article presents ten best practices to help you plan for success in the testing and implementation of Process Commander solutions. These guardrails are valid regardless of which software you use for load testing.
- Design the load test to validate the business use
- Validate performance for each component first
- Script user login only once
- Set realistic think times
- Switch off virus-checking
- Validate your environment first, then tune
- Prime the application first
- Ensure adequate data loads
- Measure results appropriately
- Focus on the right tests
Guardrail 1 — Design the load test to validate the business use. Do the load math.
Design the load test to meet the business use of the solution. This means executing a test that is as close as feasible to the real anticipated use of the application.
It is absolutely important that your application performance tests are designed to mimic the real-world production use. To ensure this happens, the identify the right volume and the right mix of work across a business day. Always do the math, to ensure you understand the throughput of the tests and be able to say that in any n minutes the test had throughput of y items that would represent a full daily rate of x items, which is A% of current volumes of V/day.
Calculate out the work rate per workday hour. Estimate the volume of the work based on the type of use expected. For example, in a Customer Process Manager solution, there is one work-object (interaction) per call. Using the following example you can calculate the actual load on the application for a given business hour of use;
Assume that the expected number of users is 250, and the expected number of concurrent users (using the application at the same time) is 175. If peak-hour throughput is not known precisely, a good rule of thumb is that the peak-hour volume is usually 20% of the volume in a given business day.
- Assume each interaction creates on average 2 service (case) requests. If each operator can take 10 calls per hour, calculate the load rate on the application as follows:
- 8 hours x 10 calls x 175 users x 3 work operations = 42,000 units of work.
- Therefore, 42,000 * 20% = 8,400 during the peak hour and 42,000 / 8 = 5,250 per work hour
Using the per-hour or peak-hour rate, ensure that the pacing and work load arrival rate to the application is correct. If the duration of the test is a half hour, ensure that only half the work rate is accomplished, not the per-hour rate.
Always consider the use of the application in respect of work; work creation, work retrieval, and work assignment. Calculate this use profile per the example above.
Common omissions and mistakes can also cause a performance test to diverge from real-world patterns of use. For example, two common mistakes are to use a single test operator ID or a single customer account number. On their own or together, these two mistakes can create database access or deadlocking issues.
Other common oversights include testing the application in "developer mode" — so the operator IDs used as the virtual user have the Developer portal or rules check-out enabled. Both of these conditions force significant server processing beyond the load that would normally be experienced if non-developer Operator IDs are used.
Often, there is not just one but a mix of transactions that dominate the workload; they all need to be modeled. Getting the proper proportions of transaction rates into the mix is key. If, for example, before entering a new case users perform an average of 2.5 searches, the load test should reflect this ration.
Finally, do not load the application incorrectly by assuming that fewer test users doing more work per hour provides equivalent system demand as the calculated workload. Using the above example, a user work rate of 30 operations per hour (5,250 / 175 users) is not the same as 87.5 users doing 60 operations per hour. (This common mistake is usually made to minimize test software license costs that are based on v-user counts.)
Guardrail 2 — Validate performance for each component first
Don't consider any load tests until you have run PAL on each flow first. Fix any string (a single user exercising each happy path through the application) test issues first.
Before exercising a performance test, the best practice is to exercise the main paths through the application (including all those to be exercised by the test script) and then take a PAL (Performance Analyzer) reading for each path. Investigate and fix any issues that are exposed. See Using Performance Tools.pdf for an explanation of the PAL analysis data points.
Before going to the next step of running the load test, repeat this exercise with additional users. Running the same test with 10 users should indicate immediately whether the application's performance is disproportionately worse at scale. If this is the case, investigate and fix the area of the application that PAL data shows has the performance problem.
Guardrail 3 — Script user login only once
Make sure that v-users are logged in once, not once for each test iteration or interaction. Login and logoff operations are expensive, and will skew results dramatically.
Many load tests are compromised because the test team scripts each individual virtual user to login, do a little work. and then logoff. This script is then repeatedly executed in the load test. This is not how real users behave, and this approach produces invalid test results because of the overhead associated with login and logoff (for example, memory collections and recycles).
Ensure that your test includes several operator IDs. Using one Operator ID profile for every v-user will cause choke points in the test, because access to the same resource (for example operator ID records) creates contention.
Guardrail 4 — Set realistic think times
Use realistic think times and include some randomization. Include think time in the flow interactions as well as after the end of flows. Review the results data and graphs excluding think times.
Your test scripts need to include think time to represent real human behavior and the corresponding load of the application. Think time or "pause time" is very important in duration-based tests; otherwise more work will arrive than appropriate in the period of the test (see Guardrail #1 above).
You can insert think time within the script steps and at the end. Scripts are typically enumerated as:
A1 + t + A2 + t + A3 + t + &
Where A is an action, t is think time. and & indicates a cycle back to the beginning of the script for a given set of interactions.
Randomizing think time with a % deviation allows the work pacing of each v-user to be slightly offset. As a general rule-of-thumb, use -3% — +3% of the value of t as the deviation. This value spreads the arrival rate of work to the application and avoids unrealistic oscillations in load (waves of work arriving all at the same time). Use of think time allows you to simulate other arrival patterns (such as Poisson or Erlang distributions) where appropriate.
Consider think times for the different operations a user will need to perform as they interact with the application. From the Model Human Processor (MHP) studies (Card, Moran and Newell), these broadly fall into perception, cognitive and motor actions. For examples, they include:
- reading time
- determining what action to take
- mouse moving clicking
- typing time
- local actions that do not create a server interaction,
These all need to be considered and factored into your think time scripting.
For the case where think time is to be calculated for a given use or load, over a duration, you can use Little's Law to calculate the required values.
Consider work by virtual Users (U) arriving at a rate R to the server and sp
ending T time utilizing the server. Little's law claims that U=R*T. (For further explanation of Little's law, see the Wikipedia article.) You can compute the service throughput of a system by dividing the number of users with the time spent in the server (R=U/T).
Now, assume that users will wait a To time in between requests — a think time. This value is an interval typical for users to interact with the application. So from the rule U=R*T, you can expand and infer that the number of users in think time will be; Uo = R*To.
But the number total of users in such a case will be a combination of those who are in the server and those who are in think time. So service throughput can be expressed as R = U/(T+To), where
- T is time spent in the server (response time),
- To is the average think time,
- U is the number of users
- R the throughput.
If your system has 200 users requesting services with 16,000 requests for 15 minutes and a response time average of 2 seconds, you can calculate that think time will be:
U/R-T = 200/17.77 - 2 = 9.25 seconds average
Using this value, you can script the interactions with the server to accommodate and create the appropriate load in a given period.
A common mistake is assuming that you have set the scripted think times correctly without reviewing real-world evidence. Always validate the script execution to ensure the work rate is correct. Counting new records in specific database tables to check that over a given duration the actual counts match expected values is a good method to validate expected work throughput.
Finally, when reviewing test response time results and test graphs, ensure think time is not included. (This is a report setting in most load-based test tools.) A common mistake is to include think time in average response times or end-to-end script times, thus skewing the results.
Guardrail 5 — Switch off virus-checking
Make sure that virus checking is not enabled on the v-user client. When virus checking runs, it impacts any buffered i/o, and this changes the collected response times.
A best practice is to ensure that any virus checking software is not enabled on virtual user client test injectors. During testing, the maximum available CPU should be dedicated to the client injector processes or threads. Bottlenecks or saturation can elongate response times collected by these processes, leading to false results and perception of server-side slowness.
Guardrail 6 — Validate your environment first, then tune
Do not try to tune the application immediately after the first test. Validate the environment first. Treat the first run as only a test of a test.
In too many cases, a test team analyzed initial test results from the test environment and reached premature — and misleading —conclusions about the design and performance of an application. Always plan to validate and performance-tune the environment first, before attempting to spend valuable time looking for problems in the application.
As you test, be aware of other constraints or factors that may influence the overall test results — for example working in a shared environment at the applications server, database, network or integration services level. Ensure you have visibility regarding the impact on the overall test by having sufficient metrics on this other use.
Load testing is not the time to shake out application quality issues. Make sure that the PegaRULES log is relatively clean before attempting any load tests. If exceptions and other errors occur often during routine processing, the load test results will not be valid.
Guardrail 7 — Prime the application first
Run a first use assembly (FUA) cycle before the actual test. Tune the environment as needed based on pre-test data. Don't start cold.
Every test should have an explicit objective. Stress and load tests are often compromised because the objective is not stated, mixed, or vague.
If the objective of a load test is to observe the behavior of an application with a known load, then conduct this test only when the system has achieved a steady state, and review the results in that light. Otherwise, other unknown conditions affecting performance can obfuscate the real results.
The best practice to observe the behavior and response time throughput of an application is to ensure that sufficient caching has occurred after startup, as would occur over a period of time in the production environment. Interpreting response time data for a short period of testing (including a period where First Use Assembly occurs) does not provide correct insights into the performance of the application in a production setting.
For best results, execute each flow or function at least 4 times after startup. Priming of some caches requires 4 cycles.
Guardrail 8 — Ensure adequate data loads
Make sure loads are realistic and sufficient data is available to complete tests in the time period! Transferring work is a good example.
Many performance issues first become evident in applications that have been in production for a certain period of time. Often this is because load testing was performed with insufficient data loads. As a result, response-time performance of the data paths was satisfactory during testing.
For example, the performance of a database table scan can be as effective on a table, with a certain number of records, as a selection through an index. However if the table grows significantly in production and a needed index is not in place, performance will seriously degrade. Ensuring the data load sizing for key work items and for attachments access, assignment and reporting is important. Calculate the amount of work that will be open and the amount resolved over a known time period, and load the test application with adequate data to reflect your calculations.
For duration-based testing, ensure that there is sufficient work available to meet the demands of the test. A common mistake is to run out of work and not have adequate detection mechanisms in the test script to account for this situation. For example, running a script that emulates a user getting the next highest priority of work, updating the item and then transferring it from their worklist to a workbasket will require that there are enough rows in the assignment tables to support the duration of the test. In this case, calculate total work assignment rates for the total number of users to validate that the test will not silently fail in background.
Also, if the application involves external database tables or custom tables within the PegaRULES database, make sure these contain realistically large numbers of rows, with a reasonable mix. For example, if the Process Commander application uses a lookup into a POLICY table, make sure this table has rows for the all policies that the application will need to search for, not just a few. Data in such tables must vary -- if all the policies are for one or a few customers, then doing a customer lookup will produce an unrealistically large number of hits.
Make sure that tables that grow during testing don't grow to be unrealistically large. To often, test scenarios omit the "Resolve" step at the end of each test, so that the Process Commander assignment tables grow and grow, rather than reach a steady state.
Guardrail 9 — Measure results appropriately
Do not use average response times for transactions as the absolute unit of measure for test results. Always consider Service Level Agreements (SLAs) in percentile terms. Load testing is not a precise science; consider the top percentile user or requestor experience. Review results in this light.
The best practice is to determine a realistic service level agreement for end-user response time experience. In most test tools, data points are collected from all virtual users and then averaged to show an average response time, as measured for pre-determined scripted transactions. However, as in the real-world, some data points will be anomalous and unexplained; this is a normal aspect of systems and especially of load testing, where an attempt to mimic or mirror a real system is not a precise science.
An attempt to normalize these data discrepancy and results is made through calculations of standard deviations. However, few practitioners are able to articulate or describe how to apply such elements to the test results.
An easier method is to simply average the transaction data showing response times in percentile terms. Use the test tool's reporting capabilities to provide the percentile average response time of the virtual users' experience. Know what the overall averages are and the 90th percentile.
- For transaction intensive applications ("heads-down" use) a recommended value is 80 percentile.
- For mixed-type use applications, use 90 percentile.
- For ad-hoc, infrequent type use, a 95 percentile average wills provide a more statistically relevant result set than 100 percentile of the average.
Typically, measuring the expected experience of 8 or 9 out of every 10 users for an application provides a more insightful profile of how the application will work in production than a 100 percentile average that includes some significant outlying response-time data points.
Guardrail 10 — Focus on the right tests
Don't try to achieve the impossible and load test for thousands of users. Judicious use of PAL data, load test results and basic extrapolation are first indicators of scale.
Trying to orchestrate large, multi-use, complex load tests can be daunting, logistically challenging and time consuming.
Pegasystems recommends that you test the application with a step approach, first testing with 50 users, then 100, 150, and 200 for example. You can then easily put the results into an Excel spreadsheet and chart them. Use Excel's built-in capability to generate an equation from the trend-line and plot a model. In addition, Excel can compute the R2 value.
In brief, R2 is the relative predictive power of a model. R2 is a descriptive measure between 0 and 1. The closer it is to one, the better the model is. By "better" it means a greater ability to predict. A value of R2 equal to 1.0 would imply that a quadratic regression provides perfect predictions.
Using this set of data points and the regression formula, predictive values can be extrapolated for a higher number of virtual users early in the testing cycle. Using the above example chart and a simple predictive model, it can be seen that the expected response time for 500 users would be;
Y = 0.02 * 500 ^ 1.0507 thus, 13.7 seconds
- Collect data on CPU utilization, I/O volume, memory utilization and network utilization to help understand the influences on performance.
- Review the Pega Log and the Alert log after each load test. Use the Pega Log Analyzer or PegaAES to summarize the logs.
- Begin testing just with HTTP transactions first (disable agents and listeners). Then test the agents and listeners, and finally test with both foreground and background processing.
- Relate the capacity of the test machines to production hardware. If the test machines have 20% of the performance of the production machines, then the test workload should be 20% of the expected production workload. If you expect to use two or more JVMs per server in production, use the same number when testing.
- Performance testing requires skilled and trained practitioners who are able to design and construct, execute and review performance tests early in the project cycle taking into account the best practices enumerated above.