Pega Ping service FAQs
Pega Ping is a service that verifies the health of a web node, that is,
Load balancers can be configured to use this ping service to check the node's health. F5 Big IP can use the URL in creating a health monitor, and AWS Elastic Load Balancer can reference the URL in its Health Check.
In releases prior to Pega 8.2, Pega Ping is a REST service that performs health checks synchronously (on demand) and returns the status when a request comes.
In Pega releases prior to Pega 8.2, checking the health of a Pega node rapidly and reliably is hindered by the following limitations:
- Health Ping service times out even though the node is healthy. This causes the nodes to restart multiple times and destabilizes the entire cluster.
- Health Ping service does not report unhealthy behaviors such as Out of Memory (OOM). OOM might still be raised in third- party code or code that is not handled by Pega Ping node health monitoring.
- Health Ping service checks the health of the web node processing only, that is,
-DNodeType=WebUser. It does not consider node types for BackgroundProcessing, Stream (including DSM), BIX, Search, and Universal.
- Remote tracing of REST services interferes with ping service execution times.
In Pega 8.2 and later releases, the Pega Ping service is improved to run health checks asynchronously and periodically.
To benefit from improvements to the Pega Ping service, upgrade to the latest Pega Platform release, at minimum Pega 8.2.
Address your browser to this URL:
If the node is healthy, the URL returns a response code of 200.
If the node is unhealthy, the URL returns a response code of 500. See I see status code 500: What additional artifacts do I need to collect? I see status codes indicating a problem, but I do not see any error in the PegaRules log file. Why?
Pega 8.2 and later releases
In Pega 8.2 and later releases, the Pega Ping response looks like this example:
"node_type":[ "WebUser", "Stream" ] ,
Releases earlier than Pega 8.2
In releases earlier than Pega 8.2, the Pega Ping responses looks like this example:
Pega 7.3.1 and earlier releases
With Pega 7.3.1 and earlier releases, the ping service used to give the number of active requestors present in a particular node. Because ping is a synchronous API, getting the requestor count causes some performance issues.
Therefore, returning the requestor count was disabled in these earlier releases by setting the DASS
disableActiveUserCount to true:
Rules set: Pega-RULES
Setting name: disableActiveUserCount
Setting value: true
Pega 7.4 and later releases and a Pega Cloud Services environment
In Pega 7.4 and later releases, you can count the number of active browser requestors by using this REST service:
This REST service gives results only if you enable the maximum limit of concurrent browser sessions and the environment is Pega Cloud Services.
Set some +ve value to
cluster/requestors/browser/maxactive in the prconfig.xml file setting.
<env name="cluster/requestors/browser/maxactive" value="200"/>
For releases prior to Pega 8.2, the Pega Ping service is a REST service pingService in the monitor package that runs the activity pzGetHealthStatus.
In Pega 8.2 and later releases, the Pega Ping service does not use REST infrastructure and no activity is processed. The engine handles the ping requests without requestor context.
Pega 8.2 and later releases
In Pega 8.2 and later releases, several health checks run to determine the health of a node.
If any of the health checks fail, then the node's health is marked as unhealthy and the URL returns a response code of 500.
The Pega Ping response includes the details on the health checks being run and which check failed.
Look at the ping response body (JSON) to see these details.
Releases prior to Pega 8.2
In releases prior to Pega 8.2, you see an exception in the logs like Timed out borrowing service requestor from requestor pool for service package: monitor or some exception in executing the activity pzGetHealthStatus.
In either case, review the Pega-Rules logs, which will provide more information.
My ping service is returning 500 status (unhealthy) but reviewing the ping JSON or Pega-Rules logs does not help me. Whom do I contact?
Go to My Support Portal to submit a support case (INC) for GCS assistance. See My Support Portal: New Design, Streamlined Features.
If your environment is a Pega Cloud environment, in My Support Portal, select My Pega Cloud to Self-manage your Pega Cloud environments from My Support Portal.
The GCS engineer will work with the Product or Service team that owns the service failing the node health check:
- HTML-Stream-Check owned by the Engine-as-a-Service team
- Streamservice-Check owned by the Streaming and Large-scale Processing team
- StaleThreadHealthCheck owned by the Decisioning & Analytics team
- ServiceRegistry-Check owned by the Data Sync and Caching team
My ping service health check displays N_CriticalErrorNotification. What does this mean? What do I need to do?
N_CriticalErrorNotificationis reported by a heath check notification when there is a critical error that occurred in the node, usually an Out of Memory (OOM) error. You need to determine the root cause of the OOM error. See the answer to the next question.
The Pega Ping service also returns an unhealthy status for a node when an Out of Memory (OOM) error occurs in the node. Usually an OOM error will mark the node as unhealthy.
However, OOM errors occurring from third-party JAR files are not caught by the node health checks. Because of this limitation, the node health checks catch only about 70 percent to 80 percent of the OOM errors.
When OOM occurs, the Pega Ping response looks like this example:
"node_type":" "[ "WebUser"],
"state":" " "unhealthy",
"node_id":" " "10.150.69.32_envblr85-web-3"
Prior to Pega 8.2, you might encounter the following issues:
- The requestor pool timed out borrowing service requestor from requestor pool for service package.
- Ping service is not reporting an unhealthy node when OOM occurs.
In Pega 8.2, the Pega Ping service timeout is fixed and. most of the time, OOM errors will mark the node as unhealthy. However, OOM errors occurring from third-party JAR files are not caught by the node health checks. Because of this limitation, the node health checks catch only about 70 percent to 80 percent of the OOM errors.
With Pega 8.2 and later releases, reliable monitoring and reporting of node health is afforded by the following improvements:
- All node health checks are run asynchronously and periodically. You can keep or adjust the default settings.
- Every health check must be completed within the configured time. The default value is 5 seconds. When a health check exceeds the specified time, the health check fails.
- Results of all health checks are aggregated at one place after they run.
- Each check result has an expiry time. For a particular health check, if the results are not updated within the specified time, for example 60 seconds, then the health check fails and the node is reported as unhealthy. This detects if there a problem in background job itself.
- Each component specifies its own health checks and registers them with the Health monitor component.
- Components can specify the NodeType for which the check needs to run during health check registration.
- Only health checks that are registered for the current node type are picked for processing.
- Engine components can publish the health events by specifying the event and the event handler. These results will not expire during result aggregation.
- When a ping request comes from the client, the status of all health checks is aggregated, and the final health status is sent with the JSON response.
- If you encounter an issue with any of the health checks, you can disable those checks using the Data-Admin-System-Setting (DASS) identified in Settings. The disabled health checks are not run in the next cycle of monitoring.
Keep or adjust the default settings for monitoring the health of Pega system nodes:
|Setting Name||Type||Default Value||Description|
|monitor/health/monitorInterval||prconfig.xml||15 (seconds)||Health monitor daemon interval in seconds|
|5000 (milliseconds)||Health monitor check execution timeout in milliseconds|
|120 (seconds)||Health monitor status expiration in seconds|
|None||To disable checks dynamically|
You can create Dynamic System Settings (DSSes) from the prconfig.xml settings by following the procedure in Creating a dynamic system setting. For complete information, see Configuring dynamic system settings