Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use database of partial paths to speed up bindings resolution #1198

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

ggiraldez
Copy link
Contributor

@ggiraldez ggiraldez commented Dec 19, 2024

Builds on top of #1195

  • Resolve all references at once using a database of minimal partial paths. This speeds up resolution considerably (since it avoids a lot of duplicated work) at the expense of higher memory consumption.
  • Change Definition and Reference to hold a Rc<> to the BindingGraph as opposed to a normal reference. This should allow easier integration with WASM since there are already wrappers for ref counted objects.
  • Split the implementation of BindingGraph and splint off a BindingGraphBuilder in which to add user files and built-ins and then call resolve() which will consume the builder and return a leaner BindingGraph with all bindings resolved.
  • The changes here require using our fork of stack-graphs which adds the ability to rewind the arena allocator used for partial paths after resolving each reference, but still allows using the default database to hold the set of minimal partial paths.

This results in some references being resolved to many ambiguous definitions,
some of which we were able to resolve via ranking.
Most remaining assertion tests were redundant as there were already snapshots
that cover those cases.
This also removes the ranking algorithm for resolution results, since it's no
longer needed.
…attributes

This *should* make it easier to construct a partial paths databases in which
these nodes are endpoints.
@ggiraldez ggiraldez requested review from a team as code owners December 19, 2024 21:36
Copy link

changeset-bot bot commented Dec 19, 2024

⚠️ No Changeset found

Latest commit: 1b36243

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@ggiraldez ggiraldez force-pushed the hooks-database-stitching branch 2 times, most recently from 5d3a2e6 to 156722b Compare December 19, 2024 23:10
@@ -1,11 +1,11 @@
use semver::Version;
Copy link
Contributor

@OmarTawfik OmarTawfik Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which will consume the builder and return a leaner BindingGraph with all bindings resolved.

I assume that means speeding up resolving (all) defs/refs in the file, at the expense of a slower initialization time. Is that correct? Do we have rough figures on how much this is changing? or a benchmark run for before/after?

at the expense of higher memory consumption

Do we have a rough figure of the increased memory as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that means speeding up resolving (all) defs/refs in the file, at the expense of a slower initialization time. Is that correct? Do we have rough figures on how much this is changing? or a benchmark run for before/after?

Yes, that's the expectation. I ran a couple of sanctuary shards locally with

infra run --bin solidity_testing_sanctuary --release -- test --shards-count 256 --shard-index INDEX --check-bindings ethereum mainnet

and while this is not exhaustive by any means, the results are quite significant:

  • For INDEX = 1, total execution time went down from 3'32" to 1'40"
  • For INDEX = 120, total execution time went down from 4'24" to 1'26"

I expect similar results for other shards. Overall, for very small contracts we may see a slight increase in time due to the overhead of creating the database, initial population and the increased number of memory allocations. But I expect the overhead to be quickly amortized for larger contracts.

at the expense of higher memory consumption

Do we have a rough figure of the increased memory as well?

This is tough to estimate because it should vary with contract complexity. Empirically I've seen peak memory to be twice as large when using the database. YMMV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I queued a run for it now: https://github.com/NomicFoundation/slang/actions/runs/12438290472/

I've seen the results, and we may need to modify the structure of the test slightly. binding_graph_builder.resolve() is called during the definitions test, because in order to access the definitions the bindings need to be resolved already. That means all the cost of resolution is added to the definitions test, while previously it was tallied in the references test (which now has negligible cost).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and while this is not exhaustive by any means, the results are quite significant:

Looks great!

I've seen the results, and we may need to modify the structure of the test slightly.

Should we modify it in the same PR, to make sure that the benchmark results are reported correctly for this commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the tests and moved the call to .resolve() into the references test. So all resolution happens in that last test. The other name, definitions is now a bit misleading though, as only ingestion of user source files happens at that stage. But execution costs should be comparable.

@@ -46,3 +46,9 @@ if ! output=$(

exit 1
fi

if [[ ! -f submodules/stack-graphs/stack-graphs/Cargo.toml ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should already be taken care of by infra setup git command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work, because to build infra we need to have all the dependencies available, and stack-graph being in a submodule means it's not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for explaining!

Cargo.toml Outdated
@@ -130,7 +130,7 @@ serde = { version = "1.0.216", features = ["derive", "rc"] }
serde_json = { version = "1.0.133", features = ["preserve_order"] }
similar-asserts = { version = "1.6.0" }
smallvec = { version = "1.7.0", features = ["union"] }
stack-graphs = { version = "0.13.0" }
stack-graphs = { path = "submodules/stack-graphs/stack-graphs", version = "0.14.0" }
Copy link
Contributor

@OmarTawfik OmarTawfik Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only need the Cargo reference to build this, without any infra/pre-build steps, I wonder why are we adding the submodule alltogether?

We can just add the crate as a direct git reference, and it will be cloned/built automatically by Cargo:

Suggested change
stack-graphs = { path = "submodules/stack-graphs/stack-graphs", version = "0.14.0" }
stack-graphs = { git = "https://github.com/NomicFoundation/stack-graphs", ref = "SPECIFIC_REF_TO_UPDATE_TO" }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a good idea. And it would handle the previous comment as well.

@@ -130,7 +130,7 @@ serde = { version = "1.0.216", features = ["derive", "rc"] }
serde_json = { version = "1.0.133", features = ["preserve_order"] }
similar-asserts = { version = "1.6.0" }
smallvec = { version = "1.7.0", features = ["union"] }
stack-graphs = { version = "0.13.0" }
stack-graphs = { path = "submodules/stack-graphs/stack-graphs", version = "0.14.0" }
string-interner = { version = "0.17.0", features = [
"std",
Copy link
Contributor

@OmarTawfik OmarTawfik Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are forking/editing NomicFoundation/stack-graphs, I suggest doing a few changes there first:

  • keeping the main branch pure for upstream changes.
  • adding a nomic branch that contains both upstream+our changes. We can regularly merge changes from main to it.
  • send PR(s) to nomic branch with the intended changes.

This will make sure at least one person reviews the changes there, and that is kept up to date/separate from upstream.
To help with this, I'm creating the nomic branch now, and will add CI checks/validation to it, so you can just send the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll set this up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the PR. Since we're no longer using the crate as a submodule, we can also remove the linter/formatting configuration options that we had added to ignore warnings.

@@ -130,7 +130,7 @@ serde = { version = "1.0.216", features = ["derive", "rc"] }
serde_json = { version = "1.0.133", features = ["preserve_order"] }
similar-asserts = { version = "1.6.0" }
smallvec = { version = "1.7.0", features = ["union"] }
stack-graphs = { version = "0.13.0" }
stack-graphs = { path = "submodules/stack-graphs/stack-graphs", version = "0.14.0" }
string-interner = { version = "0.17.0", features = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ability to rewind the arena allocator used for partial paths after resolving each reference

Do you think this would be useful for the upstream project? maybe suggesting it as a PR, in case they accept it? then we don't have to maintain the fork at all.

Copy link
Contributor Author

@ggiraldez ggiraldez Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try, but I doubt it's useful for their normal use cases. The main problem is it's not exactly safe, since you have no direct control over the mutability of the database and a mutable reference is required to do anything meaningful with it. That means you can accidentally allocate new objects in the partial paths arena (which are invalidated when you reset) that you'll be referencing in the database.

I think it may be possible to change the design to take an immutable database reference (inside stack-graphs), but it's probably a much bigger change.

parents: Vec<GraphHandle>,
}

pub struct BindingGraph<KT: KindTypes + 'static> {
pub struct BindingGraphBuilder<KT: KindTypes + 'static> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by the hierarchy here:

  • BindingGraphBuilder exposed from lib.rs, and is different than Builder, which is exposed from builder/mod.rs.
  • BindingGraph is exposed from resolved/mod.rs, and is different than Graph, which is exposed from metaslang_graph_builder::graph.

WDYT of restructuring it a bit to clarify the relationships between them? If I can suggest, ordering it by the public API/use cases:

  • builder/mod.rs exposes the public BindingGraphBuilder:
    • Has the internal Builder and Resolver under it.
  • graph/mod.rs exposes the public BindingGraph, and the related public APIs, like:
    • graph/definition.rs
    • graph/reference.rs
    • graph/location.rs

Not blocking for this PR of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not happy about the module structure either. What we currently have in the builder module should probably be called loader, since it builds a graph and loads it into our stack graph. Then the resolver and BindingGraphBuilder could live in a builder module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reorganized to code and put all the builder, resolver and loader code under a builder module.

@@ -25,8 +25,8 @@ mod rust {
pub definiens_location: BindingLocation,
}

impl From<crate::rust_crate::bindings::Definition<'_>> for Definition {
fn from(definition: crate::rust_crate::bindings::Definition<'_>) -> Self {
impl From<crate::rust_crate::bindings::Definition> for Definition {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should allow easier integration with WASM since there are already wrappers for ref counted objects.

Given this, I don't think we longer need these Definition/Reference types here, and can just reuse the types you added in resolved/mod.rs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed both wrapper classes and changed the code to use the metaslang_bindings Definition and Reference.

This uses the added `save_checkpoint`/`restore_checkpoint` which rewind the
allocation pointer in the `PartialPaths` arenas. For this to work properly, we
also first `ensure_both_directions` in the database so that after that it
doesn't need further mutation.
The database resolver will resolve all references at once by using a database of
minimal partial paths.
This makes it impossible to try to access definitions/references before
resolving, and allows dropping the entire stack graph and database of partial
paths used for resolution after they are no longer necessary.
@ggiraldez ggiraldez force-pushed the hooks-database-stitching branch from 15f35da to 8c55602 Compare December 23, 2024 21:16
@ggiraldez
Copy link
Contributor Author

After #1195 is merged, I'll rebase this PR and the conflicts should be resolved.

…records

The `definitions` test name is now a bit misleading since no definitions are
retrieved there, but it's still where user source files are ingested.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants