Clustered job can get stuck between phases

Assignee

Reporter

Sprint

Description

Any partitioned/clustered job can get stuck indefinitely while moving between phases due to a race condition.

all.log looks like this:

The NPE happens when dictionaries of all the partitions are being merged after a phase finishes. This fails and the job never moves to the next phase.

How to manually reproduce:

  1. Put breakpoint to com.cloveretl.server.graph.workflow.ChildrenEventsCollector.notifyListener(IServerEvent)

  2. Put breakpoint to com.cloveretl.server.graph.MasterWatchdogManager.mergeDictionary(RuntimeEnvironment, int)

  3. Run any clustered job with more than 1 phase, e.g. this:

  1. While both breakpoints are hit, let the thread in ChildrenEventsCollector get past line rEnv.updateChildRun(...)

    • This is what messes up the internal state, the event being processed here is an old event JobServerEvent#GRAPH_STARTED of one of the children. This old event contains run record with null dictionary. This old data overwrites the newer run record from PHASE_FINISHED jmx event that has dictionary filled in.

  2. Now step the other thread in the MasterWatchdogManager until you get the NPE.

 

Steps to reproduce

None

Attachments

1

Activity

Show:

Pavel Salamon July 19, 2023 at 1:18 PM

Cannot be tested, closing.

Pavel Salamon July 13, 2023 at 3:07 PM

Fixed - when parent gets events about children, it skips the STARTED event if some newer event such as PHASE_FINISHED has already been processed. This means that dictionary from newer events is not lost if the events are processed in mixed order.

Fixed

Details

Priority

Fix versions

Zendesk ticket

QA Testing

UNDECIDED

Components

Created July 12, 2023 at 3:46 PM
Updated September 22, 2023 at 10:54 AM
Resolved July 19, 2023 at 1:17 PM