Task execute graph - stays RUNNING forever

Assignee

Kamil Kočí

Reporter

Pavel Salamon

Labels

Greenreadyrelease-notesrest-apivte-detected-by

Sprint

None

Description

An 'execute_graph' record in task_log can stay as RUNNING forever, probably due to a race condition.
The underlying executed graph is triggered, it finishes, but the task_log is not updated and stays RUNNING.

When this happens, following problems arise:

Health of the item that triggered the task is not correct - e.g. schedule stays OK despite the task failing.
An entry in map lastRecordsIIncludingIds in class HealthMonitor is leaked (is re-loaded from DB and re-inserted into map during every iteration)

Detected by test MonitorApiTest.stateChange which failed with message java.lang.AssertionError: Entity 'http://172.23.3.234:9084/clover/api/rest/v1/schedules/11065' did not get into desired state 'FAIL' and failure count 1 in time.

The test triggers a schedule and expects it to get into FAIL state, but the schedule stayed OK forever because HealthMonitor doesn't process RUNNING task log entries.

Found a possible cause - full task_log entry is updated and persisted in 2 threads in parallel:

The method TaskProcessorExecuteGraph.doProcess() executes the job and then returns TaskLog instance which is later persisted.
But the method also registers a job listener for the executed GraphExecutionCommand. The listener's method processJobEvent will be called by different thread and it persists entire TaskLog instance, which can be obsolete at the time it's called. This can cause overwrite of correct final result by this obsolete (RUNNING) data. It should not persist the entire object, but only the 1 field it wants to set.

Steps to reproduce

None

Activity

Show:

Lukas Adamek October 19, 2020 at 7:53 AM
Edited

Seems OK on #4

Kamil Kočí October 14, 2020 at 8:48 AM

A new method for persistence after the graph is started was added.

Fixed

Details

Story Points

Priority

Major

Fix versions

rel-5-8-1

QA Testing

UNDECIDED

Created October 9, 2020 at 12:14 PM

Updated September 12, 2023 at 8:45 AM

Resolved October 14, 2020 at 8:49 AM

Configure