Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Cache Refactor #374

Open
wants to merge 58 commits into
base: main
Choose a base branch
from
Open

File Cache Refactor #374

wants to merge 58 commits into from

Conversation

foodprocessor
Copy link
Contributor

@foodprocessor foodprocessor commented Nov 25, 2024

This PR changes the file cache by:

  1. Creating a clear concept of file state, and using file locks to keep file state consistent.
  2. Deleting local files immediately when a user command tells us to do so.
  3. Refactoring isDownloadRequired to be simpler and easier to read.
  4. Refactoring OpenFile to open files immediately, when a download is not required.

1. File state:

I made this simplified diagram to show how I think file states should work. Every transition between states is now protected by a lock, without any risk of deadlock.
file cache state diagram

  • I added a LazyOpen flag to the flock struct to keep track of whether a file was in that state. Previously we only tracked that in the handle, which hid the true state of the file from other contexts.
  • The state of directories is still not protected. That may need to be a future PR.
  • The worst of this change is the file locking I added to take care of RenameDir and DeleteDir. I decided the right way to protect the files in those cases was to lock every file involved simultaneously (but without deadlock risk), before proceeding to change an entire prefix.
  • There are many new "internal" functions which are wrapped by component API functions. The only difference is that the internal functions usually run with a lock already acquired.
  • One change that may look strange is that I chose to add a lock parameter to functions which need to run with locks already acquired. This choice makes more sense in some places than others. Please give me feedback... maybe just a naming convention would be simpler.
  • While working on this, I did attempt to add more state flags to flock to signal when a file is in cloud storage and when it is in cache. This turned out to add a lot of code complexity, with the same information already being supplied by calls to os.Stat and GetAttr (which is usually cached!).
  • I did add a lock to GetAttr, and I do feel a bit weird about it. It should really be an RLock(), but that is not part of our LockMap, and I don't know if I have the heart to add it for one use.
  • TruncateFile puts the file through multiple state transitions, but when I examined the possibly bad outcomes of Truncate being interrupted by other unexpected requests or state changes, I really didn't find anything alarming. So, the code changes to make Truncate atomic don't really seem worth it to me.

2. Synchronous file deletion:

This change only affects the deletion of cached files which we were explicitly asked by the user to delete. Evictions are still asynchronous. These two very different situations used to use all the same code, which was causing lots of issues, because the two are very different cases. This caused a number of race conditions, bugs, which up until now I was "fixing" using bad code. Now I think async deletion is simply incorrect for user requests. So I removed it.

3. isDownloadRequired:

  • I removed the logic that checked if the file is expired, since the policy is in charge of that. If the file exists, it is not expired.
  • I cleaned up the refresh code, and changed the logic so if a file is open (not just lazy open, but actually open), then we will not download an update from the cloud, even if the refresh timer has expired.

4. OpenFile:

OpenFile had a slight issue, which is that if the user called Open with the create flag instead of calling Create explicitly, then we would end up creating a file handle without creating any file or object behind that. Then the user could call GetAttr and we would report that the file they have open doesn't exist.
I changed OpenFile to call isDownloadRequired right away, and open the local file immediately if a download is not required.

Changed file cache tests:

  • Drop 0s default timeouts. The minimum is 1s now, and unless we're explicitly testing eviction, there's no point having a low value there.
  • When a test needs a file to stay in cache for longer than a second, I changed the timeout instead of holding the file open to prevent eviction.
  • Changed tests to write to fake storage directly instead of wasting our time by using the cache timeout for no reason.
  • Close handles to satisfy the fearful handle leak gods.

What type of Pull Request is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

Checklist

  • Tested locally
  • Added new dependencies
  • Updated documentation
  • Added tests

Related Issues

  • Related Issue #
  • Closes #

Update tests to work with sync deletion.
Create files directly in loopback whenever possible.
@foodprocessor foodprocessor marked this pull request as draft November 25, 2024 16:58
Copy link
Contributor

@jfantinhardesty jfantinhardesty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions/concerns. Are there some negative performance impacts that you have noticed now that deletes are all synchronized? It seems that this could end up slowing the filesystem down when there is a large cache as a lot of time would be spent synchronously deleting. Did this fix any prior race conditions when using the go race detector?

component/file_cache/file_cache_linux_test.go Show resolved Hide resolved
component/file_cache/file_cache_linux_test.go Show resolved Hide resolved
component/file_cache/file_cache_test.go Show resolved Hide resolved
@foodprocessor
Copy link
Contributor Author

A few questions/concerns. Are there some negative performance impacts that you have noticed now that deletes are all synchronized? It seems that this could end up slowing the filesystem down when there is a large cache as a lot of time would be spent synchronously deleting. Did this fix any prior race conditions when using the go race detector?

I'll try to address your questions one bit at a time:

  1. This does not change eviction. Eviction is still done asynchronously. This only makes explicit deletion and rename (where the user has deleted or renamed a file) synchronous. So, the performance impact of this change should be very very small.
  2. The race conditions that this change fixes were rare, and I had already done a fair amount of work to minimize or eliminate the risk from them in previous improvements to the file cache. I do not have any proof that there are exposed race conditions that this resolves.
  3. However, the real reason for this change is to tighten up the file cache to prepare for adding network fault tolerance. The consistency challenges that come with adding that feature leave little or no room for the type of awkward timing issues that the file cache has before this change.

@foodprocessor foodprocessor marked this pull request as ready for review November 26, 2024 21:31
@foodprocessor
Copy link
Contributor Author

A few questions/concerns. Are there some negative performance impacts that you have noticed now that deletes are all synchronized? It seems that this could end up slowing the filesystem down when there is a large cache as a lot of time would be spent synchronously deleting. Did this fix any prior race conditions when using the go race detector?

I'll try to address your questions one bit at a time:

1. This does not change eviction. Eviction is still done asynchronously. This only makes explicit deletion and rename (where the user has deleted or renamed a file) synchronous. So, the performance impact of this change should be very very small.

2. The race conditions that this change fixes were rare, and I had already done a fair amount of work to minimize or eliminate the risk from them in previous improvements to the file cache. I do not have any proof that there are exposed race conditions that this resolves.

3. However, the real reason for this change is to tighten up the file cache to prepare for adding network fault tolerance. The consistency challenges that come with adding that feature leave little or no room for the type of awkward timing issues that the file cache has before this change.

In the third point, I said this was for network fault tolerance... it's actually in prep for a much smaller feature: maintaining file handle validity after rename and deletion (to better fit with POSIX).

Copy link
Contributor

@jfantinhardesty jfantinhardesty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that some of the locking and unlocking could be a bit aggressive. It seems that there is a possibility for deadlock. In particular, the unit tests are timing out due to deadlock. Also, I might recommend using the race detector when running the unit tests to check for any race conditions as you add / remove some of the locks.

@foodprocessor
Copy link
Contributor Author

I think that some of the locking and unlocking could be a bit aggressive. It seems that there is a possibility for deadlock. In particular, the unit tests are timing out due to deadlock. Also, I might recommend using the race detector when running the unit tests to check for any race conditions as you add / remove some of the locks.

If the tests can't finish then they can't fail! 😁

@foodprocessor foodprocessor marked this pull request as draft November 27, 2024 08:47
… (not just one handle).

Add residency state variables to flock as well.
… downloading when the file is lazy opened (but not completely opened).
Keep file state updated
Use isDownloadRequired in OpenFile to completely open any file that doesn't need downloading.
Improve logic to prevent erronious ENOSPC error.
@foodprocessor foodprocessor marked this pull request as ready for review January 2, 2025 06:35
@foodprocessor foodprocessor changed the title Delete files synchronously File Cache Refactor Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants