Data Flow Run performance degrades
Pega Platform 7.3.1 is hosted on WebSphere and installed on SQL Server 2016 SP2 database. The user has configured a data flow which takes a long time to complete.
A primary database table dataset, of the data flow with 100K records, composes a Join against four secondary database table datasets each of which contain around 10-13 million records. The data flow run takes more than 20 hours to complete.
Steps to Reproduce
- Create a data flow with a primary database table dataset containing 100K records.
- Use the Compose operation in the data flow to perform a Join for four secondary database table datasets each of which have around 10-13 million records. The data flow Run takes a long time to complete in the Pega instance hosted on WebSphere and SQL Server 2016.
A defect or configuration issue in the operating environment.
The Partition key was not configured in the primary dataset. Hence, the Pega application was processing one record at a time. The database query generated by the Compose operation to fetch 250 records (default batch size) from secondary datasets has a lengthy IN filter clause (multiple Bind variables). Therefore, the database Query Optimizer ignored the Indexing.
Perform the following local-change:
- Configure a Partition key in the primary dataset.
For best performance, use a Partitioning key property with not too many and not too few distinct values.
Partition Key allows the data flow's Batch processing to be distributed vertically and horizontally within a multi-threaded or a multi-node system.
- Verify the data flow Run by lowering the batch size to values such as 200, 100, 50, 25, and use a batch size that provides optimum performance.
Some of the databases ignore the Indexing when a long list of Bind variables is passed in the query.
Reducing the batch size helps in such use cases. The batch size in Pega Decisioning is set using the below Dynamic System Settings (DSS) record:
Owning Ruleset: Pega-DecisionEngine
0% found this useful