Task execute graph - stays RUNNING forever
Assignee

Reporter

Sprint
None
Description
Steps to reproduce
None
Activity
Show:

Lukas Adamek October 19, 2020 at 7:53 AMEdited
Seems OK on #4

Kamil Kočí October 14, 2020 at 8:48 AM
A new method for persistence after the graph is started was added.
An 'execute_graph' record in task_log can stay as RUNNING forever, probably due to a race condition.
The underlying executed graph is triggered, it finishes, but the task_log is not updated and stays RUNNING.
When this happens, following problems arise:
Health of the item that triggered the task is not correct - e.g. schedule stays OK despite the task failing.
An entry in map lastRecordsIIncludingIds in class HealthMonitor is leaked (is re-loaded from DB and re-inserted into map during every iteration)
Detected by test MonitorApiTest.stateChange which failed with message java.lang.AssertionError: Entity 'http://172.23.3.234:9084/clover/api/rest/v1/schedules/11065' did not get into desired state 'FAIL' and failure count 1 in time.
The test triggers a schedule and expects it to get into FAIL state, but the schedule stayed OK forever because HealthMonitor doesn't process RUNNING task log entries.
Found a possible cause - full task_log entry is updated and persisted in 2 threads in parallel:
The method TaskProcessorExecuteGraph.doProcess() executes the job and then returns TaskLog instance which is later persisted.
But the method also registers a job listener for the executed GraphExecutionCommand. The listener's method processJobEvent will be called by different thread and it persists entire TaskLog instance, which can be obsolete at the time it's called. This can cause overwrite of correct final result by this obsolete (RUNNING) data. It should not persist the entire object, but only the 1 field it wants to set.