Clustered job can get stuck between phases
Assignee

Reporter

Sprint
None
Description
Steps to reproduce
None
Attachments
1
Activity
Show:

Pavel Salamon July 19, 2023 at 1:18 PM
Cannot be tested, closing.

Pavel Salamon July 13, 2023 at 3:07 PM
Fixed - when parent gets events about children, it skips the STARTED event if some newer event such as PHASE_FINISHED has already been processed. This means that dictionary from newer events is not lost if the events are processed in mixed order.
Any partitioned/clustered job can get stuck indefinitely while moving between phases due to a race condition.
all.log looks like this:
The NPE happens when dictionaries of all the partitions are being merged after a phase finishes. This fails and the job never moves to the next phase.
How to manually reproduce:
Put breakpoint to com.cloveretl.server.graph.workflow.ChildrenEventsCollector.notifyListener(IServerEvent)
Put breakpoint to com.cloveretl.server.graph.MasterWatchdogManager.mergeDictionary(RuntimeEnvironment, int)
Run any clustered job with more than 1 phase, e.g. this:
While both breakpoints are hit, let the thread in ChildrenEventsCollector get past line rEnv.updateChildRun(...)
This is what messes up the internal state, the event being processed here is an old event JobServerEvent#GRAPH_STARTED of one of the children. This old event contains run record with null dictionary. This old data overwrites the newer run record from PHASE_FINISHED jmx event that has dictionary filled in.
Now step the other thread in the MasterWatchdogManager until you get the NPE.