Event listeners with failure ratio recover after first successful run

Assignee

Reporter

Sprint

Description

Some types of listeners (File, JMS, Universal) configured to be marked as failing when the failure ratio exceeds some threshold are incorrectly marked as working again immediately after the first successful execution.

Example:

  • File event listener periodically checks a remote directory for new files. It is configured with a 10% failure ratio.

  • Due to a network outage, the number of failures exceeds 10% and the listener is marked as failing.

  • When the connection is restored:

    • Expected behavior: The listener should stay failing until the following successful executions change the failure ratio back below 10%.

    • Actual behavior: The listener is incorrectly marked as working after the first successful run.

Affected types of event listeners:

  • File listener

  • JMS message listener

  • Universal listener

Fix test com.cloveretl.server.test.events.file.*ListenerChangeStatusTest.

Steps to reproduce

None

Attachments

1

Activity

Show:

Filip Vaško October 19, 2020 at 8:49 AM
Edited

Testing.

once the failure ratio is met and the listener has run at least 3 times (), the listener is marked as failing
create a failing File Event Listener (failing file check), let it fail a few times, manually reset the listener state - the failures are now still aggregated under the same old row in the task_log table, but the failure ratio is still calculated correctly
successful checks (e.g. file checks of a file event listener) do not change the state to OK
the listener state changes to OK once the failure ratio dips below the configured threshold value

Tested on both standalone and cluster. Closing.
(server-CLO-19879 #12)

Fixed

Details

Story Points

Priority

Fix versions

Affects versions

QA Testing

UNDECIDED

Components

Created October 8, 2020 at 1:18 PM
Updated September 12, 2023 at 8:44 AM
Resolved October 16, 2020 at 1:28 PM