Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IGNITE-21792 fix ItNodeTest#testFollowerStartStopFollowing flakiness #4990

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

12rcu
Copy link

@12rcu 12rcu commented Jan 2, 2025

https://issues.apache.org/jira/browse/IGNITE-21792

Thank you for submitting the pull request.

To streamline the review process of the patch and ensure better code quality
we ask both an author and a reviewer to verify the following:

The Review Checklist

  • Formal criteria: TC status, codestyle, mandatory documentation. Also make sure to complete the following:
    - There is a single JIRA ticket related to the pull request.
    - The web-link to the pull request is attached to the JIRA ticket.
    - The JIRA ticket has the Patch Available state.
    - The description of the JIRA ticket explains WHAT was made, WHY and HOW.
    - The pull request title is treated as the final commit message. The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
  • Design: new code conforms with the design principles of the components it is added to.
  • Patch quality: patch cannot be split into smaller pieces, its size must be reasonable.
  • Code quality: code is clean and readable, necessary developer documentation is added if needed.
  • Tests code quality: test set covers positive/negative scenarios, happy/edge cases. Tests are effective in terms of execution time and resources.

Notes

Issue Notes:

Since the Jira ticket was not created by me, here is the summary of what was done:

  • Flakiness of the original test could only be reproduced by lowering the timeout of the assertions, even on bad hardware.
  • Fix: Remove the timeout in the assert statements and use a Junit @Timeout annotation with the summed time of the individual assert statements.
  • Flakiness from using this annotation could only be displayed with a massive reduction of the timeout value, see this action log: https://github.com/12rcu/ignite-3/actions/runs/12575502390 (This uses an @timeout of 10 seconds, the final timeout is 25 seconds.)

@12rcu
Copy link
Author

12rcu commented Jan 2, 2025

Test Reports

I created a test environment where I tested the timeouts a bit to get a feel for whether everything was working as expected.

Manual: tests are executed manually from an IDE (Intelij)
Automated: tests are executed from another program for parallaization and stressing
Stressor: CPU/RAM almost fully utilised during test execution

Test Setup 1

Specs:
CPU: 13th Gen Intel© Core™ i7-13700KF × 16
RAM: 32gb

Type Timeout setting Result
Manual, 5 runs 2s timeout
Manual, 5 runs 3s success
Manual, 5 runs 2.5s flaky
Automated, 50 runs (stressor) 5s success
Automated, 50 runs (stressor) 10s success

Note: Other tests in the test suite begin to flake when the system was fully utilized, but not the test in question.

Test Setup 2

Specs:
CPU: Intel© Core™ i7-5000U × 4
RAM: 16gb

Type Timeout setting Result
Automated, 10 runs (stressor) 5s success
Automated, 10 runs (stressor) 10s success

Test Setup 3

Github Action

Type Timeout setting Result
20 runs 5s success
10 runs 10s success

20 runs: https://github.com/12rcu/ignite-3/actions/runs/12584096504
10 runs: https://github.com/12rcu/ignite-3/actions/runs/12575502390

@12rcu 12rcu marked this pull request as ready for review January 2, 2025 18:17
@sashapolo
Copy link
Contributor

sashapolo commented Jan 5, 2025

Hi, @12rcu! First of all, thank you for the great work and for your contribution. However, could you please explain how this PR fixes the flakiness of the test? I can see that you replaced the waitForCondition calls with straight up calls to the original methods. If this was the case, the original approach would have also worked. Did I understand your fix incorrectly?

@12rcu
Copy link
Author

12rcu commented Jan 5, 2025

Hi @sashapolo,

Yes, you're right, it would have worked just by increasing the timeout of the original calls.

So in conclusion, the test is still flaky, but with my "fix" when the timeout is reached it will say so in the test reports, so it's easier to increase the timeout in the future if needed.

@12rcu
Copy link
Author

12rcu commented Jan 5, 2025

What I did with the test reports is just to show that I think the timeout I put in is long enough to not trigger the flakiness regularly.

@sashapolo
Copy link
Contributor

@12rcu but waitForCondition and JUnit's @Timeout annotation have different semantics. waitForCondition executes a predicate multiple times until either the condition is satisfied or the timeout is reached. @Timeout fails the test if it got "stuck" for more than the configured period of time. So, my question still stands: what does @Timeout achieve here, how does it even affect the flakiness of the test?

So in conclusion, the test is still flaky, but with my "fix" when the timeout is reached it will say so in the test reports

But the name of your PR clearly states that your code is intended to fix the flakiness of the test, is the description incorrect?

@12rcu
Copy link
Author

12rcu commented Jan 5, 2025

@sashapolo In retrospect, it may not be the best name for the pull request, in my opinion the "fix" is not that the test will always succeed, but rather gives a better reason why it failed.

Yes, the timeout changes the semantics of the test, the first loops technically have more time than the other loops, and this could change the behaviour of the test. In my opinion this is negligible in this test case as we want the test to always succeed and never timeout (so if we wait longer for some loops then others, it doesn't really change much).

So why would I want to use the timeout annotation, even though it's semantically not the same:

  • The exception that was thrown with the waitForCondition() leads to an assertion error, which I think is bad and would be clearer if a timeout was clearly stated.
  • The timeout of the test can be more easily changed if needed in the future.
  • I would like to see more use of the @Timeout annotation in the tests, and this would be my way of "fixing" other flaky tests in future submissions.

But I am also happy to test the timeouts with the waitForCondition() and find better timeout values.

Comment on lines -2729 to -2730
assertTrue(
waitForCondition(() -> ((MockStateMachine) node.getOptions().getFsm()).getOnStartFollowingTimes() == 1, 5_000));
Copy link
Author

@12rcu 12rcu Jan 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertTrue(
waitForCondition(() -> ((MockStateMachine) node.getOptions().getFsm()).getOnStartFollowingTimes() == 1, 5_000));
assertTimeoutPreemptively(Duration.ofMillis(5_000), () -> {
assertEquals(1, ((MockStateMachine) node.getOptions().getFsm()).getOnStartFollowingTimes());
});

@12rcu
Copy link
Author

12rcu commented Jan 5, 2025

I was just digging around in the Junit documentation and noticed that there is an assert function specifically for this: see #4990 (comment)

@sashapolo
Copy link
Contributor

sashapolo commented Jan 6, 2025

I would like to stress the difference between waitForCondition and assertTimeoutPreemptively (or @Timeout in this case):

  1. waitForCondition executes the given predicate multiple times until either the condition is satisfied or a time limit is reached.
  2. assertTimeoutPreemptively executes the given statement once and waits for it to complete in a given time period.

If we are getting exceptions when executing waitForCondition it does not mean that the predicate inside it got stuck for some reason, it means that the predicate did not return true during all attempts. Therefore, replacing waitForCondition with assertTimeoutPreemptively is equivalent to just calling the predicate inside waitForCondition just once. So, to me, your proposed changes do not solve any problems related to the test correctness.

If you are instead trying to improve the error message, then you can just add a better error message to the assertTrue statement.

@12rcu
Copy link
Author

12rcu commented Jan 6, 2025

@sashapolo Now I understand you, yes you are absolutely right, it seems I was just lucky not to trigger a failure. I will convert this PR back to a draft and look for a better solution.
Thanks for your patience, I am so sorry 😐

@12rcu 12rcu marked this pull request as draft January 6, 2025 10:44
@sashapolo
Copy link
Contributor

Thank you, will be looking forward to your future contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants