Clustered job can get stuck between phases

Assignee

Pavel Salamon

Reporter

Pavel Salamon

Labels

Blueclusterrelease-notesreliability

Sprint

None

Description

Any partitioned/clustered job can get stuck indefinitely while moving between phases due to a race condition.

all.log looks like this:

The NPE happens when dictionaries of all the partitions are being merged after a phase finishes. This fails and the job never moves to the next phase.

How to manually reproduce:

Put breakpoint to com.cloveretl.server.graph.workflow.ChildrenEventsCollector.notifyListener(IServerEvent)
Put breakpoint to com.cloveretl.server.graph.MasterWatchdogManager.mergeDictionary(RuntimeEnvironment, int)
Run any clustered job with more than 1 phase, e.g. this:

While both breakpoints are hit, let the thread in ChildrenEventsCollector get past line rEnv.updateChildRun(...)
- This is what messes up the internal state, the event being processed here is an old event JobServerEvent#GRAPH_STARTED of one of the children. This old event contains run record with null dictionary. This old data overwrites the newer run record from PHASE_FINISHED jmx event that has dictionary filled in.
Now step the other thread in the MasterWatchdogManager until you get the NPE.

Steps to reproduce

None

Attachments

Activity

Show:

Pavel Salamon July 19, 2023 at 1:18 PM

Cannot be tested, closing.

Pavel Salamon July 13, 2023 at 3:07 PM

Fixed - when parent gets events about children, it skips the STARTED event if some newer event such as PHASE_FINISHED has already been processed. This means that dictionary from newer events is not lost if the events are processed in mixed order.

Fixed

Details

Priority

Major

Fix versions

rel-6-2-0

Zendesk ticket

215422

QA Testing

UNDECIDED

Components

Created July 12, 2023 at 3:46 PM

Updated September 22, 2023 at 10:54 AM

Resolved July 19, 2023 at 1:17 PM

Configure