IGNITE-21792 fix ItNodeTest#testFollowerStartStopFollowing flakiness #4990

12rcu · 2025-01-02T00:28:05Z

https://issues.apache.org/jira/browse/IGNITE-21792

Thank you for submitting the pull request.

To streamline the review process of the patch and ensure better code quality
we ask both an author and a reviewer to verify the following:

The Review Checklist

Formal criteria: TC status, codestyle, mandatory documentation. Also make sure to complete the following:
- There is a single JIRA ticket related to the pull request.
- The web-link to the pull request is attached to the JIRA ticket.
- The JIRA ticket has the Patch Available state.
- The description of the JIRA ticket explains WHAT was made, WHY and HOW.
- The pull request title is treated as the final commit message. The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
Design: new code conforms with the design principles of the components it is added to.
Patch quality: patch cannot be split into smaller pieces, its size must be reasonable.
Code quality: code is clean and readable, necessary developer documentation is added if needed.
Tests code quality: test set covers positive/negative scenarios, happy/edge cases. Tests are effective in terms of execution time and resources.

Notes

Apache Ignite Coding Guidelines

Issue Notes:

Since the Jira ticket was not created by me, here is the summary of what was done:

Flakiness of the original test could only be reproduced by lowering the timeout of the assertions, even on bad hardware.
Fix: Remove the timeout in the assert statements and use a Junit @Timeout annotation with the summed time of the individual assert statements.
Flakiness from using this annotation could only be displayed with a massive reduction of the timeout value, see this action log: https://github.com/12rcu/ignite-3/actions/runs/12575502390 (This uses an @timeout of 10 seconds, the final timeout is 25 seconds.)

…m individual statements

just to be safe use the old timeout added together + 5 seconds

12rcu · 2025-01-02T17:53:30Z

Test Reports

I created a test environment where I tested the timeouts a bit to get a feel for whether everything was working as expected.

Manual: tests are executed manually from an IDE (Intelij)
Automated: tests are executed from another program for parallaization and stressing
Stressor: CPU/RAM almost fully utilised during test execution

Test Setup 1

Specs:
CPU: 13th Gen Intel© Core™ i7-13700KF × 16
RAM: 32gb

Type	Timeout setting	Result
Manual, 5 runs	2s	timeout
Manual, 5 runs	3s	success
Manual, 5 runs	2.5s	flaky
Automated, 50 runs (stressor)	5s	success
Automated, 50 runs (stressor)	10s	success

Note: Other tests in the test suite begin to flake when the system was fully utilized, but not the test in question.

Test Setup 2

Specs:
CPU: Intel© Core™ i7-5000U × 4
RAM: 16gb

Type	Timeout setting	Result
Automated, 10 runs (stressor)	5s	success
Automated, 10 runs (stressor)	10s	success

Test Setup 3

Github Action

Type	Timeout setting	Result
20 runs	5s	success
10 runs	10s	success

20 runs: https://github.com/12rcu/ignite-3/actions/runs/12584096504
10 runs: https://github.com/12rcu/ignite-3/actions/runs/12575502390

sashapolo · 2025-01-05T08:45:57Z

Hi, @12rcu! First of all, thank you for the great work and for your contribution. However, could you please explain how this PR fixes the flakiness of the test? I can see that you replaced the waitForCondition calls with straight up calls to the original methods. If this was the case, the original approach would have also worked. Did I understand your fix incorrectly?

12rcu · 2025-01-05T16:34:58Z

Hi @sashapolo,

Yes, you're right, it would have worked just by increasing the timeout of the original calls.

So in conclusion, the test is still flaky, but with my "fix" when the timeout is reached it will say so in the test reports, so it's easier to increase the timeout in the future if needed.

12rcu · 2025-01-05T16:36:23Z

What I did with the test reports is just to show that I think the timeout I put in is long enough to not trigger the flakiness regularly.

sashapolo · 2025-01-05T20:39:34Z

@12rcu but waitForCondition and JUnit's @Timeout annotation have different semantics. waitForCondition executes a predicate multiple times until either the condition is satisfied or the timeout is reached. @Timeout fails the test if it got "stuck" for more than the configured period of time. So, my question still stands: what does @Timeout achieve here, how does it even affect the flakiness of the test?

So in conclusion, the test is still flaky, but with my "fix" when the timeout is reached it will say so in the test reports

But the name of your PR clearly states that your code is intended to fix the flakiness of the test, is the description incorrect?

12rcu · 2025-01-05T21:03:27Z

@sashapolo In retrospect, it may not be the best name for the pull request, in my opinion the "fix" is not that the test will always succeed, but rather gives a better reason why it failed.

Yes, the timeout changes the semantics of the test, the first loops technically have more time than the other loops, and this could change the behaviour of the test. In my opinion this is negligible in this test case as we want the test to always succeed and never timeout (so if we wait longer for some loops then others, it doesn't really change much).

So why would I want to use the timeout annotation, even though it's semantically not the same:

The exception that was thrown with the waitForCondition() leads to an assertion error, which I think is bad and would be clearer if a timeout was clearly stated.
The timeout of the test can be more easily changed if needed in the future.
I would like to see more use of the @Timeout annotation in the tests, and this would be my way of "fixing" other flaky tests in future submissions.

But I am also happy to test the timeouts with the waitForCondition() and find better timeout values.

12rcu · 2025-01-05T21:14:53Z

modules/raft/src/integrationTest/java/org/apache/ignite/raft/jraft/core/ItNodeTest.java

-            assertTrue(
-                waitForCondition(() -> ((MockStateMachine) node.getOptions().getFsm()).getOnStartFollowingTimes() == 1, 5_000));


Suggested change

assertTrue(

waitForCondition(() -> ((MockStateMachine) node.getOptions().getFsm()).getOnStartFollowingTimes() == 1, 5_000));

assertTimeoutPreemptively(Duration.ofMillis(5_000), () -> {

assertEquals(1, ((MockStateMachine) node.getOptions().getFsm()).getOnStartFollowingTimes());

});

12rcu · 2025-01-05T21:15:49Z

I was just digging around in the Junit documentation and noticed that there is an assert function specifically for this: see #4990 (comment)

sashapolo · 2025-01-06T09:13:59Z

I would like to stress the difference between waitForCondition and assertTimeoutPreemptively (or @Timeout in this case):

waitForCondition executes the given predicate multiple times until either the condition is satisfied or a time limit is reached.
assertTimeoutPreemptively executes the given statement once and waits for it to complete in a given time period.

If we are getting exceptions when executing waitForCondition it does not mean that the predicate inside it got stuck for some reason, it means that the predicate did not return true during all attempts. Therefore, replacing waitForCondition with assertTimeoutPreemptively is equivalent to just calling the predicate inside waitForCondition just once. So, to me, your proposed changes do not solve any problems related to the test correctness.

If you are instead trying to improve the error message, then you can just add a better error message to the assertTrue statement.

12rcu · 2025-01-06T10:43:32Z

@sashapolo Now I understand you, yes you are absolutely right, it seems I was just lucky not to trigger a failure. I will convert this PR back to a draft and look for a better solution.
Thanks for your patience, I am so sorry 😐

sashapolo · 2025-01-06T11:06:24Z

Thank you, will be looking forward to your future contributions.

12rcu added 4 commits January 1, 2025 23:22

refactor: use junit timeout exceptions and remove time constrains fro…

7a0c319

…m individual statements

temp: add test action to test flakiness in a ci environment

64cc5c7

remove flaky detection action and increase timeout

74f54b7

just to be safe use the old timeout added together + 5 seconds

remove flaky detection action

88cd093

12rcu marked this pull request as ready for review January 2, 2025 18:17

12rcu commented Jan 5, 2025

View reviewed changes

12rcu marked this pull request as draft January 6, 2025 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IGNITE-21792 fix ItNodeTest#testFollowerStartStopFollowing flakiness #4990

IGNITE-21792 fix ItNodeTest#testFollowerStartStopFollowing flakiness #4990

12rcu commented Jan 2, 2025

12rcu commented Jan 2, 2025 •

edited

Loading

sashapolo commented Jan 5, 2025 •

edited

Loading

12rcu commented Jan 5, 2025

12rcu commented Jan 5, 2025

sashapolo commented Jan 5, 2025

12rcu commented Jan 5, 2025 •

edited

Loading

12rcu Jan 5, 2025 •

edited

Loading

12rcu commented Jan 5, 2025

sashapolo commented Jan 6, 2025 •

edited

Loading

12rcu commented Jan 6, 2025

sashapolo commented Jan 6, 2025

		assertTrue(
		waitForCondition(() -> ((MockStateMachine) node.getOptions().getFsm()).getOnStartFollowingTimes() == 1, 5_000));

IGNITE-21792 fix ItNodeTest#testFollowerStartStopFollowing flakiness #4990

Are you sure you want to change the base?

IGNITE-21792 fix ItNodeTest#testFollowerStartStopFollowing flakiness #4990

Conversation

12rcu commented Jan 2, 2025

The Review Checklist

Notes

Issue Notes:

12rcu commented Jan 2, 2025 • edited Loading

Test Reports

Test Setup 1

Test Setup 2

Test Setup 3

sashapolo commented Jan 5, 2025 • edited Loading

12rcu commented Jan 5, 2025

12rcu commented Jan 5, 2025

sashapolo commented Jan 5, 2025

12rcu commented Jan 5, 2025 • edited Loading

12rcu Jan 5, 2025 • edited Loading

Choose a reason for hiding this comment

12rcu commented Jan 5, 2025

sashapolo commented Jan 6, 2025 • edited Loading

12rcu commented Jan 6, 2025

sashapolo commented Jan 6, 2025

12rcu commented Jan 2, 2025 •

edited

Loading

sashapolo commented Jan 5, 2025 •

edited

Loading

12rcu commented Jan 5, 2025 •

edited

Loading

12rcu Jan 5, 2025 •

edited

Loading

sashapolo commented Jan 6, 2025 •

edited

Loading