-
-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rudimentary garbage collection to graph #19513
Conversation
Pushing this now to know if this is directionally or generally correct. This is an attempt at solving #7675. I'm aware there needs to be tests on the graphs, but I'm still trying to load the data structure semantics into my head to know how to construct a test. In the meanwhile, I'd like some feedback to see if this is going in the right direction or if I'm totally off base here. Also feel free to correct me or push me to better practices on my Rust.
@stuhood Modified this a bit, but I'm worried I may have some logic totally wrong here. To get faster results, locally, I've added a much more aggressive threshold, but this results in tons of rule retries (which I think makes sense), since nodes are being removed from the graph quite quickly. And it only seems to result in a very paltry amount of memory being relinquished (on the order of 1-3 MB per cycle, if any at all), which is disheartening.
I guess it might be too much to chew off here to try to do GC within a short time frame, even though I think that's going to be necessary to get Pants to be performant on moderate-to-large size repos. But I'm okay with starting with a smaller goal if that gets us there. I think I need to find a way to test this at a smaller level (and load a better idea of the graph into my head). So this is my prospective TODO here:
Any guidance on where I could start with these? I'm a bit lost |
(Just acknowledging the review request: I don't yet have the domain knowledge to provide guidance on your questions. Sorry!) |
Looking at the code that you have, what we're actually doing is So I suspect that what is happening is that you are clearing nodes while they are running. You could add a check that avoids invalidating nodes while they are running, but I expect that as mentioned in #19070 (comment) , that would go hand in hand with adjusting the GC calculation to be based on the size of the node value (since there is no point clearing something that we don't actually have a value/size for).
I think that the guidance from step 6 here is probably still accurate, in that returning a count of cleared nodes from the method which implements the meat of garbage collection, and then testing a few different situations like "expecting nothing to be cleared" and "expecting 1 thing to be cleared", etc ... would be sufficient. To accomplish that, you will likely also need to go ahead and do the parametrization (then you can sleep between creating one set of nodes and another so that they end up with different access times). |
Thanks for the feedback. This is more in line what I've thought after spending more time diving into how the graph works (mostly by just going through its existing tests). Based on this do you think this time-based variant is even worth pursuing? Or can we just add some additional check to see if the node has a value without worrying about its size? |
Hm, actually: thinking about this a bit more, it's unusual that anything that is Running would actually not be reachable from a root. During a run, everything should be reachable from the current roots. So it could be that you actually want to be storing last access time on all nodes, and then rather than worrying about roots at all, you would garbage collect nodes which hadn't been accessed in a while (...and which weren't running). That would allow you to safely garbage collect data that was idle, even during a run. |
That makes sense. I need to figure out why I was clearing anything at all
in this case. Like I said, I think I need to write more granular test cases
to validate my assumptions if I want to tackle this.
A question, as I’m still trying to fit this crate in my head: I think that
roots are the nodes that are requested externally from outside the graph,
and their dependents are what they add to be the graph while running. Is
this correct? In a typical pants run, what roots might get added?
…On Wed, Aug 16, 2023 at 19:56 Stu Hood ***@***.***> wrote:
Hm, actually: thinking about this a bit more, it's unusual that anything
that is Running would actually *not* be reachable from a root. During a
run, everything should be reachable from the current roots.
So it could be that you actually want to be storing last access time on
*all* nodes, and then rather than worrying about roots at all, you would
garbage collect nodes which hadn't been accessed in a while (...and which
weren't running). That would allow you to safely garbage collect data that
was idle, even during a run.
—
Reply to this email directly, view it on GitHub
<#19513 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAH5UH6V7SBTX2J6EUOWHRTXVVM2RANCNFSM6AAAAAA2UTZO7E>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Attempt to address #7675 by adding some rudimentary garbage collection to the graph.
Re-opening of #19070 on a new branch to take advantage of caching in CI.