Support Article
OutOfMemory exceptions in batch node
SA-9416
Summary
One batch node is experiencing OutOfMemory (OOM) exceptions and has crashed multiple times today.
Error Messages
There aredatabase connection errors as well as CPU starvation issues.
Recent info from Pega logs:
2015-04-03 14:09:29,731 [j2ee14_ws,maxpri=10]] [ STANDARD] [ ] (.access.DatabaseConnectionImpl) ERROR - Couldn't obtain a connection. Refresh the DataSource, and try again
2015-04-03 14:09:32,999 [j2ee14_ws,maxpri=10]] [ STANDARD] [ ] (riv.factory.ObjectArrayFactory) INFO - Factory-Internal pool expansion for ObjectArray[1] from 20 up to 40.
2015-04-03 14:09:33,002 [j2ee14_ws,maxpri=10]] [ STANDARD] [ -:03.01] (.access.DatabaseConnectionImpl) ERROR - Couldn't obtain a connection. Refresh the DataSource, and try again
[3/31/15 13:11:12:878 CDT] 00000216 ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup at Member [----]: Did not receive adequate CPU time slice. Last known CPU usage time at 13:10:10:470 CDT. Inactivity duration was 32 seconds.
[3/31/15 13:11:12:882 CDT] 000000c0 CoordinatorCo W HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 32 seconds.
[3/31/15 13:11:55:255 CDT] 000000c0 CoordinatorCo W HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 12 seconds.
[3/31/15 13:12:30:550 CDT] 000000c0 CoordinatorCo W HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 5 seconds.
[3/31/15 13:13:10:583 CDT] 000000c0 CoordinatorCo W HMGR0152W: CPU Starvation detected. Current thread scheduling delay is 10 seconds.
Steps to Reproduce
Unknown.
Root Cause
The root cause of this problem is that deferred operations were filling the heap. This occured because the case structure has a case with over 10,000 covered objects. A batch process was attempting to update this case and all its covered objects. This is done as a deferred operation, requiring all covered objects to be in the clipboard. This filled up the heap with a single deferred operation and an OOM condition was experienced.
This was diagnosed by analyzing a heap dump - a single DeferredOperationsImpl object was taking over 3GB out of the 4GB heap at the time of the OOM condition.
Resolution
This issue is resolved through the following local change: Redesign the batch process so that it is not attempting to update all covered objects in a case at once, or redesign the case structure so that one cover object does not have thousands of covered objects.
To temporarily work around this issue, a very large JVM could be provisioned to handle the problematic object and allow processing to complete.
Published June 12, 2015 - Updated October 8, 2020
Have a question? Get answers now.
Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.