Troubleshooting Elasticsearch performance with TCP network analysis
The agent FTSIncrementalIndexer is running on multiple nodes, and the volume of cases (work items) being updated in the application is very high. The agent is able to process the work items in its queue. However, after the queue is empty and when a new case is created, there is a delay in finding work items from the Search gadget in the user portal.
The delay is intermittent: It takes 2 minutes for some work items to be found and 10 minutes for other work items to be found. For example, a search for case W-xxxx took more than 15 minutes before the search timed out.
Successful scenarios show that the TCP transmission takes 14 minutes without any retransmission attempts. However, with a failure scenario like the one for example case W-xxxx, there were many retransmission attempts.
The example screen below shows the Wireshark network analysis tool with a filter on a specific port that is trying to attempt the index request for the example case W-xxxx. No corresponding packet information is being received by Node1, the primary indexing node. Therefore, the root cause of the problem appears to be at the network layer, where some of the packets are not being transmitted successfully.
Wireshark analysis for FTS index request on example case
- Apply HFix-25085 for Pega 7.1.8 or upgrade to the latest Pega Platform release.
If you are not using Pega 7.1.8 and cannot upgrade to the latest Pega Platform release, submit a Support Request for a hotfix for the Pega 7 Platform release that you are using.
- If you apply HFix-25085 for Pega 7.1.8, follow the instructions in the Resolution of SA-19403, https://pdn.pega.com/support-articles/slow-performance-after-upgrade-pega-718.
- If, after installing HFix-25085, full-text search performance is still slow, review the TCP dumps for your environment and case volume.
- For example, for the scenario of searching for work objects after their creation, run the TCP trace while searching for a case and observe the trace as shown in the example screen shown in the Explanation.
- If you are using Linux, run the following commands for TCP keepalive:
Linux commands for TCP keepalive
By default, the tcp keepalive time is 7200 seconds.
- Reduce the tcp keepalive time or adjust it according to your needs so that the retransmission requests for particular work items do not fail.
- Make sure that there is an adequate number of open files allowed for the system:
Check the operating system’s limit for open files for scenarios related to both the application server user and root user accounts.
Here’s an example of the ulimit command to run on the Linux operating system showing the value 16000.
Linux ulimit command