Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: wrong cp lag metrics #35588

Open
1 task done
pingliu opened this issue Aug 20, 2024 · 6 comments
Open
1 task done

[Bug]: wrong cp lag metrics #35588

pingliu opened this issue Aug 20, 2024 · 6 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@pingliu
Copy link
Contributor

pingliu commented Aug 20, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4.9
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

截屏2024-08-20 16 19 40

Expected Behavior

No response

Steps To Reproduce

from 2.3.x upgrade to 2.4.9

Milvus Log

No response

Anything else?

No response

@pingliu pingliu added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 20, 2024
@pingliu
Copy link
Contributor Author

pingliu commented Aug 20, 2024

/assign @XuanYang-cn

@yanliang567
Copy link
Contributor

/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 20, 2024
@XuanYang-cn
Copy link
Contributor

Channel checkpoint meta lifecycle is buggy. Checkpoints are often left in the meta even if collections are dropped. And the creation and the deletion of the metrics are also in a chaos.

Here're the rules I need to check:

  1. When creating collections, channel watchinfo and channel cp should be created.
  2. When dropping collections, channel watchinfo and channel cp should be dropped.
  3. When DC recovers channel, channelcp and channel watch info should be recovered for VALID collection.
  4. When DC drop channel, channel cp and channel watch info should be removed.
  5. Only DN's updateChannelCheckpoint is able to update channel checkpoint.

@xiaofan-luan
Copy link
Collaborator

Channel checkpoint meta lifecycle is buggy. Checkpoints are often left in the meta even if collections are dropped. And the creation and the deletion of the metrics are also in a chaos.

Here're the rules I need to check:

  1. When creating collections, channel watchinfo and channel cp should be created.
  2. When dropping collections, channel watchinfo and channel cp should be dropped.
  3. When DC recovers channel, channelcp and channel watch info should be recovered for VALID collection.
  4. When DC drop channel, channel cp and channel watch info should be removed.
  5. Only DN's updateChannelCheckpoint is able to update channel checkpoint.

do we need to hack channel cp meta to fix the problem for now?

XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue Aug 23, 2024
XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue Aug 23, 2024
@XuanYang-cn XuanYang-cn added this to the 2.4.10 milestone Aug 23, 2024
@XuanYang-cn
Copy link
Contributor

@xiaofan-luan I think so. I believe this is causing numerous false alarms, very annoying. see milvus-io/birdwatcher#303

sre-ci-robot pushed a commit that referenced this issue Aug 24, 2024
sre-ci-robot pushed a commit that referenced this issue Aug 26, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.10, 2.4.11 Sep 5, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.11, 2.4.12 Sep 18, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.12, 2.4.13 Sep 27, 2024
@XuanYang-cn
Copy link
Contributor

/assign @pingliu
Please help verify
/unassign

@sre-ci-robot sre-ci-robot assigned pingliu and unassigned XuanYang-cn Oct 9, 2024
@yanliang567 yanliang567 removed this from the 2.4.13 milestone Oct 15, 2024
@yanliang567 yanliang567 added this to the 2.4.14 milestone Oct 15, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.14, 2.4.16 Nov 14, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.16, 2.4.17, 2.4.18 Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants