java.lang.OutOfMemoryError on indexing node
in our production enviroment we encountered, as described in previous SR SR-115587, the following exception: java.lang.OutOfMemoryError: GC overhead limit exceeded.
As per your suggestion we enabled heap space logging on our jvm and so we are now able to provide you with a dump.
In the provided logs you can see that the other exception described in SR-115587, LuceneServiceWorkPackage - com.pega.pegarules.pub.PRException: Timed out borrowing service requestor from requestor pool, is still present.
That's because we didn't yet change any of the parameters for the LuceneServiceWorkPackage service package, since we are still testing new values in our preproduction enviroment.
Please note that our production enviroment is the same described in SR-115587. No new setting has been introduced and the work-indexer agent is still not running on a dedicated JVM.
With this SR we would kindly ask your support in analyzing the heap dump to precisely identify the reason behind the java.lang.OutOfMemoryError expection.
This analysis will greatly speed up the implementation of the already proposed solutions, dedicated JVM for indexing and timeout setting for lucene service package, in the eventuality that the exception is effectively being raised due to those factors.
Please note that the heap dump is about 1GB size compressed, we uploaded it in a shared space at http://we.tl/RYJgvs1IBK, if you prefer we can try and upload it to your web application
java.lang.OutOfMemoryError: GC overhead limit exceeded
Steps to Reproduce
specific to our enviroment
The root cause of this problem is non-optimal setup of the PRPC operating environment. The customer runs three PRPC nodes in their environment; all of them are user nodes, but one is also the search/indexing node.
Now they are facing "OutOfMemoryError: GC overhead limit exceeded" exceptions.
After analysis of some heap dumps provided by them, we found that they still have plenty of free heap space when this error occurs. Therefore our assumption is that the exception is caused by heavy fragmentation as it may occur because of the work of the Lucene indexer.
Together with setting sun.rmi.dgc.client.gcInterval and sun.rmi.dgc.server.gcInterval to one hour this can cause the FullGC exceeding the configured execution time limit.
Given that really the fragmentation is just causing the the FullGC taking to much time, it is either possible to switch of the monitoring or to increase the threshold time, as described in "Java SE 6 HotSpot[tm] Virtual Machine Garbage Collection Tuning" (http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html), chapter "Excessive GC Time and OutOfMemoryError". Also reducing the settings for sun.rmi.dgc.client.gcInterval and sun.rmi.dgc.server.gcInterval may help, although the latter may have a negative impact elsewhere.
According to the customer, they want to resolve this issue by making the following change to the PRPC operating environment: Setting up a dedicated node for searching and indexing with settings that are optimised for that purpose.