Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"bad file descriptor" when using git mirror #2906

Open
btajuddin opened this issue Jul 29, 2024 · 6 comments
Open

"bad file descriptor" when using git mirror #2906

btajuddin opened this issue Jul 29, 2024 · 6 comments

Comments

@btajuddin
Copy link

Environment:
Buildkite agent docker image with minor customizations (node, some extra scripts, and docker configuration)
Mirror directory is mounted AWS EFS volume (effectively NFS)
AWS Graviton processors
Agent version 3.75.1

What's happening?
After upgrading our agent from 3.74.0 to 3.75.1, we started seeing this message (somewhat redacted):

2024-07-29 12:48:01 PDT | # Could not acquire lock on "/buildkite/mirrors/https---github-com-******.clonelockf" (bad file descriptor)
-- | --
  | 2024-07-29 12:48:01 PDT | ⚠️ Warning: Checkout failed! getting/updating git mirror: bad file descriptor (Attempt 1/3 Retrying in 2s)

This caused all checkouts to fail on the new agent version. Upon reverting to 3.74.0, the errors went away. This appears to be specific to the agent version, not our environment.

@DrJosh9000
Copy link
Contributor

Thanks for the report @btajuddin ! I suspect the version jump to v0.12.0 for gofrs/flock in #2864 introduced the bug. And I also suspect gofrs/flock has fixed it in v0.12.1.

@btajuddin
Copy link
Author

@DrJosh9000 Thanks for the update. This seems like a gap in the agent testing, though. Since git mirrors are no longer considered experimental, the agent's test suite should cover that functionality before release.

@DrJosh9000
Copy link
Contributor

Good point, I think I'll leave this open until we have a test that would catch at least this bug.

@btajuddin
Copy link
Author

Thanks, I appreciate that.

I do have a question about the versioning strategy here. Considering 3.74.1, 3.75.0, and 3.75.1 all have the bad version of the lock library, why is the potential fix (we haven't validated it yet on our side) only available in 3.76.0 and above? This seems like a good time to do a backport to 3.74 and 3.75 to fix the broken functionality.

If there isn't a plan to backport fixes like this, then what is the suggested upgrade cadence and target?

@btajuddin
Copy link
Author

Update: We've tested 3.76.1, and the lock issue has been resolved. Please feel free to resolve this issue once mirror testing is in place.

@DrJosh9000
Copy link
Contributor

Re the versioning strategy question: broadly speaking, we support "the v3 agent", which implies every single minor release. Each bugfix could potentially mean a lot of backporting (77? minor releases so far). So as yet, we haven't proactively backported fixes at all. Instead, we've favoured rolling into the next minor release and recommend upgrading to that, since on average we've managed to release once every two weeks over the last couple of years.

Personally I'm not thrilled about this situation, and I wrote up some plans to fix it and provide a more stable set of expectations, but these are yet to be implemented.

If a customer did need a specific older minor release for some reason, I think we'd still be happy to do backports as needed, but we would definitely be keen to understand blockers to upgrading!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants