Support Play: Analyzing OutOfMemory Exceptions
This article outlines a method for identifying root causes of OutOfMemory exceptions in Process Commander applications. The goal of this article is to provide you with a process to gather and analyze critical data so you can resolve memory-related errors.>
Information in this document applies to the Java 1.4.2, Java 5, and Java 6 versions of the Sun and IBM Java Virtual Machine (JVM).
The analysis approach uses a three-step process. Step 1 is to determine the operating context. Step 2 is to gather supporting detail. Step 3 is to identify and then troubleshoot OutofMemory exceptions based on the exception type.
Step 1 - Determine the Operating Context
The first step in any problem analysis is to establish the context in which the error occurs. This will prevent critical missteps later in the information gathering and analysis process. For example, if a previously stable system suddenly experiences multiple OutOfMemory exceptions per day, a recent change to the application or environment (for example, adding 100 new users) might explain the variation.
Do not skip this step even if you believe you already know the root cause of the problem, as this information enables you to provide guidance on future administration actions, and avoid problems of this type in the future.
To determine the operating context, obtain answers to the following questions. Some may have been answered previously when completing a Support Request entry.
- What type of system is experiencing this problem, e.g. development, test, or production?
- What are the full platform details of this system? What versions of Process Commander, the application server, operating system, database, and JVM are in use? You can gather this information from the System Management Application (SMA)and the Pega log files.
- What is the current usage of this system? For example, are there 10 concurrent developers building flows, or 200 concurrent business users creating/updating/resolving work objects?
- When did this problem begin and/or when was it first noticed?
- What if anything changed in the environment on or around the time the problem first started?
- Has anything been done to remedy the problem thus far?
- Can the you provide any other relevant information, for example about configuration changes or software upgrades?
Step 2 - Gather Supporting Details
The second step of the analysis is to gather all of the log files, configuration files, and other evidence for analysis in order to determine the root cause of the OutOfMemory exception. Here are the files and data to collect for all platforms.
- Verbose GC output. You may not have verbose GC logging turned on. If this is the case, do enable verbose GC logging now, regardless of whether the system in question is development, test, or production. The overhead of verbose GC is very small, and provides invaluable data on the performance and behavior of the JVM garbage collector process, which is critical to evaluating any OutOfMemory condition.
- JVM settings. If you have a problem locating or providing the JVM startup arguments, but if in doubt examine the application server configuration files (for example, server.xml) and startup programs (for example, startWebLogic.sh). The IBM JVM produces javacore files when an OutOfMemory exception is thrown, and these contain full JVM and operating system argument information.
- PegaRULES log files (both standard and ALERT). These can be gathered from the SMA.
- PegaRULES configuration file (prconfig.xml or pegarules.xml) or file contents. This can be gathered from the SMA.
- Application server log files. This includes both the stdout and stderr logs. WebSphere actually has four logs, two for each type of output, e.g. SystemOut, native_stdout, SystemErr, native_stderr.
- Java system snapshot and error files. If the OutOfMemory exceptions were accompanied by javacore, heapdump, or other Java error files (for example, hs_err.pid for a Sun JVM crash), request these as well. Note, however, that UNIX/Linux operating system core files are in binary format and are not usable.
Step 3 - Identify and Troubleshoot OutofMemory Exceptions by Type
OutOfMemory exceptions break down into four basic categories:
- Permanent generation exhaustion (Sun JVMs only).
- Maximum classloaders exceeded (IBM JVMs only).
- Native memory exception (primarily IBM JVMs).
- Standard main heap space OutOfMemory exceptions.
After locating the OutOfMemory error message in the log files, use the information in the following sections to identify and troubleshoot the cause of the problem.
Permanent Generation Exhaustion Exceptions (Sun JVMs)
The Sun JVM utilizes a generational model for object allocation. The permanent generation is used by the Sun JVM to store objects that are long lived, and to house certain internal JVM cache structures, such as the interned string cache. One of the object types that is stored in the permanent generation is java.lang.Class.
Because Process Commander generates Java classes for most rule types, it is important to size the permanent generation appropriately for the number of class objects that will be loaded by the application. If the amount of space in the permanent generation is inadequate for the number of class objects that are needed, an OutOfMemory exception is thrown, even if there is still plenty of space available in the new and tenured generation areas.
A permanent generation exhaustion has a special OutOfMemory exception message:
java.lang.OutOfMemoryError: PermGen space
Troubleshooting Permanent Generation Exceptions
The first step in troubleshooting a permanent generation exception is to ensure the customer’s version of the Sun JVM is bug-free. A permanent generation bug exists in Sun 1.5 JVM versions earlier than 1.5.0_07. If you are running an earlier version of the JVM, an upgrade is recommended. See OutOfMemory Error - PermGen Space when using Sun JVM.
If the JVM version is good, the next step is to determine if the permanent generation size is set to the recommended value for Process Commander applications. The recommended default value (listed in the Process Commander installation manual) is 64 MB, and the recommended maximum permanent generation size is 256 MB.
If the permanent generation size is less than the recommendation, have the customer make this change. Then check if it resolves the issue. Here are the JVM arguments used to configure the permanent generation size for a V5.4 system running on a Sun JVM:
If the JVM version is good and the permanent generation size is set to the recommended value, next determine if the maximum size of the permanent generation should be increased from the default of 256 MB. To evaluate if this is the case, check the total number of classes loaded on the SMA’s Classloader Status screen. If the total exceeds 10,000, it may be necessary to increase the size of the permanent generation to resolve the issue. Increase the permanent generation in moderate sized chunks, typically 128 MB at a time. If 256 MB is too small, try 384 MB.
If you do not want to increase the permanent generation size without a more detailed analysis, or the number of classes on the SMA’s Classloader Status screen is less than 10,000, a detailed analysis of the contents of the permanent generation may be necessary. This can be done using heap dump and/or class histogram analysis over a period of time, but an explanation of such analyses are beyond the scope of this document.
Maximum Classloaders Exceeded Exceptions (IBM JVMs)
IBM’s Java Virtual Machine 5.0 has a built-in limit on the number of classloaders that it permits in a JVM. This causes problems when executing applications that dynamically generate Java, including Process Commander.
Process Commander’s Rules Assembly subsystem assigns one classloader for each rule that is dynamically converted to Java, compiled into a Java class, and loaded into the JVM. The number of classloader objects that Process Commander uses is proportional to the number of Java classes generated for the rules in a given configuration.
The limitation on the number of classloaders in the IBM JVM is version-dependent. It can pose a problem if the version is IBM JVM 5.0 SR-4 or earlier, including all versions of JVM 1.4. The classloader limit does not apply to JVM 5.0 SR-5 and later.
On the earlier JVMs, the maximum number of classloaders is defined as 8,192 by default, a setting that is too low for Process Commander systems. If Process Commander loads more than 8,192 classes, it throws an OutOfMemory exception, usually with a message that this particular JVM memory space has been exceeded in the associated javacore file:
ERROR: The failure was caused because the class loader limit(-Xmxcl) was exceeded. Please set –Xmxcl to value or greater.
Where value is the maximum number of classloaders allowed in the JVM.
Troubleshooting Maximum Classloaders Exceeded Exceptions
What to do for IBM JVM 5.0 SR-5 and later
As the classloader limit no longer applies to these JVM versions, look elsewhere in this Support Play for other causes of out-of-memory exceptions.
What to do for IBM JVM 5.0 SR-4 and earlier
Updating a JVM to SR-5 or later should correct an out-of-memory error, and you should recommend that the customer make this upgrade to the JVM.
If you choose not to upgrade, use the JVM parameter
–Xmxcl to set the number of Java classloaders that can be loaded in the JVM at one time to a value greater than the default setting of 8,192.
The recommended value for
-Xmxcl is 22,000, as suggested in How to set up IBM JVM v1.4.2 and v5.01. To specify the recommended value on the command line, enter:
If errors persist, start with the value of the Process Commander property
fua/global/instancecountlimit, which is set in prconfig.xml, and increase it. The default of
instancecountlimit is 20,000, and the recommended increment is by 10 percent. Thus the first increment equals the recommended value for
–Xmxcl, 22,000. Edit prconfig.xml to allow a higher value, and then set
–Xmxcl on the command line.
The value assigned to
–Xmxcl is JVM-wide, while
instancecountlimit applies only to Process Commander. To avoid a possible conflict, ensure that the JVM’s
–Xmxcl value is set to be greater than Process Commander’s
Native OutOfMemory Exceptions (Primarily IBM JVMs)
Native OutOfMemory exceptions have a special signature, and are sometimes accompanied by JVM process termination and the creation of an operating system core file (on UNIX/Linux platforms). The OutOfMemory stack shows Native Method as the highest class in the OutOfMemory exception stack, and this confirms that the exception originated from code running in native memory. Here is an example of a native OutOfMemory exception:
Caused by: java.lang.OutOfMemoryError: ZIP004:OutOfMemoryError, MEM_ERROR in inflateInit2
at java.util.zip.Inflater.init(Native Method)
Troubleshooting Native OutOfMemory Exceptions
Native OutOfMemory exceptions occur when the JVM process has exhausted the memory areas dedicated to native operations. Native operations include all JNI (Java Native Interface) calls, zip inflation and deflation (for BLOBs), JIT (Just In-Time) method compilation, and any third-party shared library operations. All 32-bit JVM processes are restricted to a 4 GB maximum memory allocation because of the number of addressable memory spaces with 32 bits. Both the JVM heap and native memory space must fit into 4 GBs. Generally, the more memory allocated to the main JVM heap for java object allocation, the less that is available for native operations. For example, if you allocate 3 GB of space to the JVM heap, there is only 1 GB of memory of space left for native operations.
To some degree this is an over-simplification of the architecture, however it is still worthwhile because most often the “cure” for native memory issues is to give the JVM more memory space in which to conduct native operations. In truth, there is no true “fix” for native OutOfMemory exceptions for a 32-bit JVM. The process must work within the constraints of a 4 GB maximum addressable memory space, and to do so may mean a tradeoff between main Java heap and native memory spaces. Keep in mind that adding more JVM processes might help alleviate concerns over reductions in the main JVM heap space.
In a Process Commander system, the IBM JVM uses native memory for zip inflation and deflation operations, classloader allocation (for generated rules), and JIT compilation. If the number of generated rules in the Process Commander system is greater than 10,000, it may be necessary to reduce the JVM heap size and/or cap the First Usage Assembly cache to avoid native memory problems.
The IBM JVM typically works with memory in 256 MB chunks, so all JVM heap memory settings should be made in increments of 256 MBs. If you are having native memory problems and you decide to reduce the main JVM heap to avoid the issue, be wary of this rule. For example, if you are having problems while running with a main heap size of 1536 MB, and you only reduce the heap to 1408 MB (a 128 MB reduction) you have not made any more native memory available to the JVM. Instead, you should to reduce the main heap to either 1280 MB (freeing up one 256 MB chunk), or 1024 (freeing up two 256 MB chunks).
Operating context factors might come into play here. Check whether the system is running with any special JIT compilation switches, custom shared libraries, or you are conducting any JNI operations directly in the application. These things will also put pressure on the native memory space of the JVM process and may need to be altered to avoid the issue.
“Standard” OutOfMemory Exceptions
Essentially all other OutOfMemory exceptions are “standard.” There is no special message inside or with the OutOfMemory exception stack, and the offending code is not a native method.
Standard OutOfMemory exceptions themselves can be broken down into sub-categories through detailed inspection of the verbose GC log file, and analysis of the user load on the system either through conversation with the customer or measurement with SMA and other tools. The standard OutOfMemory exception sub-categories are:
- Memory usage spike
- Memory leak
- Load exceeds capacity of system, e.g. system was sized for 150 users and 300 are connected.
Memory Usage Spike
Memory spikes are the most frequent cause of OutOfMemory exceptions. A memory spike involves a requestor (user or agent) requesting a large amount of memory space for objects. This could be based on a large database list, or on the creation of a very large HTML stream.
To determine if there has been a memory usage spike on the system, use the GC viewer of your choice for your JVM to graph the verbose GC log file in order to view total memory and free memory after each GC cycle over time. If the memory usage has an obvious spike preceding the OutOfMemory exception, you have found the type of problem similar to what the customer is experiencing.
Sometimes an OutOfMemory exception can occur with a usage spike when there is memory (even significant memory) still available in the heap. This can occur in IBM JVMs when there is significant memory fragmentation, which occurs when there are many small to medium-sized objects allocated over the whole JVM heap, and there isn’t enough contiguous space available for a single large memory request.
For example, if a user runs a report that returns all of the resolved work objects in the last week, a request that requires a single object allocation of 35 MB, and the heap has 250 MB total memory free but only 15 MB in contiguous memory blocks, an OutOfMemory exception is thrown. This is a memory spike problem, but it will also require an evaluation of the your JVM arguments to make sure they meet the recommendations in How to set up IBM JVM v1.4.2 and v5.0.
Once the issue is confirmed as a memory usage spike, the next step is to try to find the source of the spike. This is most often done using data in the ALERT log file. The alerts to check are:
- The PEGA0004 alert indicates that a large number of bytes (more than 50 MB, by default) were retrieved from the database in a single interaction. This indicates that a requestor (user or agent) has run a database transaction that requires a very large memory allocation and should be evaluated. This alert is available in V5.1+. See Understanding the PEGA0004 alert - Quantity of data received by database query exceeds limit.
- The PEGA0027 alert tracks the number of database rows returned from a single query and produces a warning if the number exceeds 25,000. This alert can indicate whether very large result sets are being returned by the application and can indicate a memory usage problem. This alert is available in V5.4+. See Understanding the PEGA0027 alert - Number of rows exceeds
database list limit.
- The PEGA0029 alert indicates that the amount of stream data sent to a client browser exceeded a default threshold of 2 MB. If the HTML text sent to the client is excessive, these alerts can also cause a memory spike. Java uses UTF-16 encoding, so every character is represented by 2 bytes in the JVM heap. An HTML page with 25,000 characters has a heap allocation of at least 50,000 bytes (50 KB). With complex tables and repeating layouts, the number of characters in a HTML page can grow quite large, so this alert should be watched closely. This alert is available in V5.4 SP. See Understanding the PEGA0029 alert - HTML stream size exceeds limit..
- The PEGA0035 alert indicates that during the creation of a Page List property, the number of elements in the Page List value has exceeded the WARN-level threshold, which is 10K elements by default. A high number of embedded pages can consume substantial memory, and can indicate a report design issue or a looping logic issue. This alert is available in V5.5+. See Understanding the PEGA0035 Alert - A Page List property has a number of elements that exceed a threshold.
Although the alert log is the primary tool to use in tracking memory spikes, keep in mind that information in thread dumps, heap dumps, and other log files from the time of the usage spike may be helpful in finding its source.
Memory leaks are the second most typical cause of standard OutOfMemory exceptions. A memory leak can be identified by graphing the verbose GC output. If it shows a linear or accelerating memory growth leading up to the exception, then the system is leaking memory somehow.
Fortunately the JVM vendors have already found and resolved many memory leaks, and the first step is to check whether the you have applied fixes to your system for known problems. If not, apply any unimplemented fixes and re-evaluate. If your JVM has all the current fixes applied, then a heap dump analysis is required to determine the source of the leak. Heap dumps can be created on both Sun and IBM JVMs, although the methods differ somewhat.
Here is a brief description of how to produce heap dumps for both the Sun and IBM JVMs:
Producing a Sun JVM heap dump
With the Sun JVM, you can enable heap dump creation on a trigger or when an OutOfMemory exception is thrown. You can also trigger heap dumps using the JConsole JMX monitoring program that is packaged with some Sun JDKs.
The JConsole method involves locating the Sun JVM diagnostic MBean that has a method to dump the JVM heap. The JConsole method is recommended, however it might not be available in all configurations. If it is unavailable, you can run the following options on the command line to configure heap dump creation.
The first argument creates a heap dump file when an OOM exception is produced, and then second fires when a
kill -3 pid command is issued for the JVM process. The second option may require additional configuration in some versions of the Sun JVM. Additional information can be found at these sites:
Finally, the Sun JDK’s JMAP utility program can produce heap dumps on the Solaris operating system, but the tool’s limitations make it the least preferred method to produce a Sun JVM heap dump.
Producing an IBM JVM heap dump
The IBM JVM typically produces heap dumps by default whenever an OutOfMemory exception is thrown. If the system is not doing this, some WebSphere/IBM JVM variables may need to be set. These can vary slightly based on the version of IBM JVM being used (1.4.2 versus 5 versus 6), so you should consult the diagnostic guide for the version in question (search the WWW for “IBM JDKdiagnostic guide”).
The standard option for enabling IBM heap dumps is through the WebSphere environment variable IBM_HEAPDUMP. Set this option to true to enable heap dump creation based on user signals (
kill -3 on UNIX/Linux or CTRL-BREAK on Windows).
Once you have heap dump files available, you will need a tool to evaluate and compare them. There is no single way to do this. Most memory leaks are found by comparing multiple heap dumps and looking for object types or classes that grow and grow, often to very large memory allocations. Many tools are available on the WWW for this, including these three:
- The WebAge heapdiff tool compares the objects in two heap dumps. A conversion from PHD (Portable HeapDump) format to text format may be necessary before using the tool. This is used only with IBM heap dumps.
- The IBM Support Assistant (ISA) Workbench includes tools for heap dump, thread dump, and GC analysis for both the Sun and IBM JVMs.
- The Eclipse Memory Analysis Tool (MAT) is used primarily to analyze Solaris heap dump files. However, some of the latest versions of the IBM JVM have information in their heap dumps that can be used with MAT.
The help documentation for each tool describes how to use the tool to inspect heap dumps and search for memory leak sources within them.
Load Exceeds Capacity
The excess load sub-category is typically the least likely cause of OutOfMemory exceptions. To determine if system load exceeded the configured capacity, investigate system usage at and around the time of the issue.
Find out if user load has increased recently, and how many users were connected at the time of the exception. If the answers to these questions are not illuminating, you can also evaluate this option by reviewing the Master Agent status messages (if they are available) in the PegaRULES log file. The Master Agent status messages will reveal how much total memory is allocated, how much is currently free, and the number of current active requestors. Here is an example from a V5.x system:
2009-02-09 14:39:57,792 [ Default : 4] (
engine.context.Agent) INFO - System date: Mon Feb 09 14:39:57 EST 2009
Total memory: 373,458,432 Free memory: 64,932,000 Requestor Count: 7
Shared Pages memory usage: 0%
This example shows 7 requestors attached to the system. If the average memory usage is 1 MB per requestor, this number is not a problem. If however the average requestor size is much larger (complex Smart Investigate systems can approach 4 MB per requestor), and/or the number of requestors is much higher, there could be cause for concern.
Occasionally there are problems deleting and cleaning up requestors from a system, and this can lead to memory exhaustion. If this is the case, you should find a very large number of requestors recorded in the Master Agent status messages. If you are unsure whether this is a problem or not, evaluate the other options in greater detail before investigating excessive loads.