Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure safety updates #2324

Merged
merged 32 commits into from
Sep 24, 2024

Conversation

ChrisThibodeaux
Copy link
Contributor

@ChrisThibodeaux ChrisThibodeaux commented Sep 21, 2024

Updates to handle critical errors when:

  • Non-persistent machine deletion from the DB leads to a machine scheduled for deletion being assigned a task
  • A machine's agent will sometimes be unreachable at startup
  • A machine is sometimes added more than once to delete_vm_list
  • On startup, an existing VM is on the VMSS
  • On startup, a machine that was previously in the DB is still present on the VMSS
  • On startup, when all initialized machines fail, add a configurable retry capability
  • On startup, an existing VM is locked

Many of the changes are to make the code more thread-safe.
Delay monitor thread start to allow all VMSSs to be up and instances running, especially in the case of initial startup failures.

@doomedraven
Copy link
Collaborator

Hello, thank you, i will need first move this to staging before it will be merged to master to ensure that we didn't break nothing

@ChrisThibodeaux
Copy link
Contributor Author

@doomedraven Sounds good. I want to test it significantly on my end before moving forward with it as well. Is it possible to change this to a draft instead of a regular PR?

@ChrisThibodeaux ChrisThibodeaux changed the base branch from master to staging September 21, 2024 07:14
@doomedraven
Copy link
Collaborator

yes then draft would be good

@doomedraven doomedraven marked this pull request as draft September 21, 2024 07:20
@ChrisThibodeaux
Copy link
Contributor Author

@cccs-mog If you have a chance to test this on your end, it should fix all the errors we discussed in #2323

@ChrisThibodeaux
Copy link
Contributor Author

@doomedraven I removed the changes in scheduler.py and abstracts.py. Only machinery/az.py has changes to it now.

Ready to move out of draft now.

@doomedraven doomedraven marked this pull request as ready for review September 22, 2024 11:12
@doomedraven
Copy link
Collaborator

cool thank you, lets see if @cccs-mog can confirm it and then we fine to merge

@doomedraven doomedraven changed the base branch from staging to master September 22, 2024 11:31
@doomedraven
Copy link
Collaborator

i changed from staging to master as changes are only to az.py, so they are not so dangerous for the rest of users and easier to review

@cccs-mog
Copy link
Contributor

Hi @doomedraven and @ChrisThibodeaux, I will take a look today and get back. Awesome work !

Copy link
Contributor

@cccs-mog cccs-mog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! Seem to work wonders ! Only missing subnet limits and quota from #2291 :) Thanks a lot !

@doomedraven
Copy link
Collaborator

ok lets merge this then, and then someone use uses azure can add subnet limit and quota . thank you @ChrisThibodeaux for your work

@doomedraven doomedraven merged commit 09a28ff into kevoreilly:master Sep 24, 2024
5 checks passed
@ChrisThibodeaux ChrisThibodeaux deleted the azure-safety-updates branch September 24, 2024 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants