Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(manual_compact): fix replica lose manual compact finished status after replica migrate bug #1961

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ninsmiracle
Copy link
Contributor

What problem does this PR solve?

#1665

And cause the misstake operate of PR #1666, I closed it and forced pull it.
So I have to re pull in another pull request.

What is changed and how does it work?

I add another string in meta_store (meta_store is a column families which persist pegasus_last_manual_compact_finish_time and so on).
So that we can read the value from meta_store when we recover a replica after replica migrate.

Checklist

Tests
  • Manual test
    1.Create table,and put some data in it.
    2.Let table begin to do manual compaction
    3.Finish manual compaction and view the replica relationship of this table.
    4.Stop a node,and check the progress of compaction.
    5.Wait a few minutes,and check if still have some replica can not show a Finish status.

Test Result

Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 100
Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 83
Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 83
Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 100
Code changes
  • try to read LAST_MANUAL_COMPACT_USED_TIME from meta_store when DB exist.
  • Set LAST_MANUAL_COMPACT_USED_TIME to zero when DB not exist.
  • Set LAST_MANUAL_COMPACT_USED_TIME to the time replica last used.
  • Generate a new checkpoint after we change LAST_MANUAL_COMPACT_USED_TIME in func do_manual_compact

@github-actions github-actions bot added the cpp label Mar 22, 2024
@ninsmiracle ninsmiracle changed the title Fix(manual_compact): fix replica lose manual compact finished status after replica migrate bug fix(manual_compact): fix replica lose manual compact finished status after replica migrate bug Mar 26, 2024
src/server/pegasus_server_impl.cpp Outdated Show resolved Hide resolved
@@ -1758,14 +1758,22 @@ dsn::error_code pegasus_server_impl::start(int argc, char **argv)
dsn::ERR_LOCAL_APP_FAILURE,
"open app failed, unsupported data version {}",
_pegasus_data_version);
// update last manual compact finish timestamp
uint64_t last_manual_compact_used_time = 0;
LOG_AND_RETURN_NOT_OK(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When upgrade from old versions, ERR_OBJECT_NOT_FOUND will be returned, right?

return last_manual_compact_finish_time;
}

void pegasus_server_impl::after_manual_compact(std::uint64_t starttime, uint64_t endtime)
{
// store last manual compact used time to meta store for learn situation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the replica server shutdown before the replicas complete the manual compaction, can this patch resolve this issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this patch will still work. Cause if replica server shutdown before the replicas complete the manual compaction,the last_manual_compact_finish_time would not be update. So it will start compaction again.

@acelyc111
Copy link
Member

acelyc111 commented Mar 27, 2024

It's OK to me to remove the periodic manual compaction function, it can be replaced by thirdparty tools, feel free to remove it.

@ninsmiracle
Copy link
Contributor Author

It's OK to me to remove the periodic manual compaction function, it can be replaced by thirdparty tools, feel free to remove it.

I think it's reasonable to keep this periodic manual compaction function. Cause many user using spark to set manual compaction time now. If we remove this function in new version, we have to change the change user habits.
Actully, I think we lack a unified management platform to control every things.

@acelyc111
Copy link
Member

It's OK to me to remove the periodic manual compaction function, it can be replaced by thirdparty tools, feel free to remove it.

I think it's reasonable to keep this periodic manual compaction function. Cause many user using spark to set manual compaction time now. If we remove this function in new version, we have to change the change user habits. Actully, I think we lack a unified management platform to control every things.

What is "using spark to set manual compaction time" ? Is it a Pegasus binding method? I don't think it's a block to remove the periodic manual compaction. This is not the core tasks of Pegasus, it make the code a bit of smelly.

At least, you need to make this patch to resolve these issues mentioned above.

@ninsmiracle
Copy link
Contributor Author

It's OK to me to remove the periodic manual compaction function, it can be replaced by thirdparty tools, feel free to remove it.

I think it's reasonable to keep this periodic manual compaction function. Cause many user using spark to set manual compaction time now. If we remove this function in new version, we have to change the change user habits. Actully, I think we lack a unified management platform to control every things.

What is "using spark to set manual compaction time" ? Is it a Pegasus binding method? I don't think it's a block to remove the periodic manual compaction. This is not the core tasks of Pegasus, it make the code a bit of smelly.

At least, you need to make this patch to resolve these issues mentioned above.

In high version of PEGASUS-SPARK , a compaction request will be send to cluster after user call startBulkLoad . And I will update PEGASUS-SPARK this week.
I do not think we should remove perodic manual compaction code now . Cause I think all its bugs about manual compaction have been fixed, and they are all minor issues.

@acelyc111
Copy link
Member

It's OK to me to remove the periodic manual compaction function, it can be replaced by thirdparty tools, feel free to remove it.

I think it's reasonable to keep this periodic manual compaction function. Cause many user using spark to set manual compaction time now. If we remove this function in new version, we have to change the change user habits. Actully, I think we lack a unified management platform to control every things.

What is "using spark to set manual compaction time" ? Is it a Pegasus binding method? I don't think it's a block to remove the periodic manual compaction. This is not the core tasks of Pegasus, it make the code a bit of smelly.
At least, you need to make this patch to resolve these issues mentioned above.

In high version of PEGASUS-SPARK , a compaction request will be send to cluster after user call startBulkLoad . And I will update PEGASUS-SPARK this week. I do not think we should remove perodic manual compaction code now . Cause I think all its bugs about manual compaction have been fixed, and they are all minor issues.

Thanks to update the pegasus-spark project! It's not well maintenance for about 4 years.

Do you think it's confusing that the manual compact status depends on last_manual_compact_used_time? Is it possible manual_compact_last_time_used_ms == 0 but manual_compact_last_finish_time_ms > 0 ?

If the paramaters are set in a previous round, and havn't complete compaction and set the vaiables in current round before restarting the server, will it be considered as finished?

Could you add some tests to verify the bug has been fixed?

@ninsmiracle
Copy link
Contributor Author

ninsmiracle commented Apr 17, 2024

It's OK to me to remove the periodic manual compaction function, it can be replaced by thirdparty tools, feel free to remove it.

I think it's reasonable to keep this periodic manual compaction function. Cause many user using spark to set manual compaction time now. If we remove this function in new version, we have to change the change user habits. Actully, I think we lack a unified management platform to control every things.

What is "using spark to set manual compaction time" ? Is it a Pegasus binding method? I don't think it's a block to remove the periodic manual compaction. This is not the core tasks of Pegasus, it make the code a bit of smelly.
At least, you need to make this patch to resolve these issues mentioned above.

In high version of PEGASUS-SPARK , a compaction request will be send to cluster after user call startBulkLoad . And I will update PEGASUS-SPARK this week. I do not think we should remove perodic manual compaction code now . Cause I think all its bugs about manual compaction have been fixed, and they are all minor issues.

Thanks to update the pegasus-spark project! It's not well maintenance for about 4 years.

Do you think it's confusing that the manual compact status depends on last_manual_compact_used_time? Is it possible manual_compact_last_time_used_ms == 0 but manual_compact_last_finish_time_ms > 0 ?

If the paramaters are set in a previous round, and havn't complete compaction and set the vaiables in current round before restarting the server, will it be considered as finished?

Could you add some tests to verify the bug has been fixed?

Firstly , I update the newest version of PEGASUS-SPARK and commit a merge request.
And I think manual_compact_last_time_used_ms == 0 but manual_compact_last_finish_time_ms > 0 will happened before this pr . However , now , in pegasus_server_impl::start replica will reload the parameter from disk .
If last_finish_time_ms be set , last_time_used_ms will be set also.The results of the staging environment test verified this point.
In this week we will try to put this manual compaction fix version to online environment , and we will verify the effectiveness of bug fixes in large data environments .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants