-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts on persistent caching #157
Comments
Here's how I would do it: import { createReadStream } from 'node:fs';
import { createHash } from 'node:crypto';
export async function resolve(specifier, context, next) {
const result = await next(specifier, context);
const url = new URL(result.url);
if (url.protocol !== 'file:') return result; // for e.g. data: URLs
const hashChunks = createReadStream(url).pipe(createHash('sha256')).toArray();
url.searchParams.set(
import.meta.url, // An almost certainly unique key
Buffer.concat(hashChunks).toString('base64url')
);
return { ...result, url: url.href };
} By adding the hash to the resolved URL, you are guaranteed per spec that it won't be load more than once, you don't need to implement your own cache. What is the purpose of your custom
According to the current ES spec, there can be only one module per URL. IIRC it's also a limitation of V8, trying to load more than one module on the same URL would lead to undefined behavior. For this reason, Node.js has an internal module cache which loaders cannot access but can rely upon. So loaders are of course free to add an additional cache layer if they see fit, but I'd expect that wouldn't be necessary for most use cases. |
I think you misunderstood. I am talking about a persistent file cache for caching the results of transformations between different runs of nodejs, not an in-memory cache for caching instances of modules within a single process. The motivation is explained clearly in the first few lines of the comment.
I'm sorry but none of this is true.
The resolution process is not specified by es262 at all. HostLoadImportedModule is host-defined and can be anything. They punted this to other specifications, and rightfully so.
v8 doesn't care at all about the module URL, it's just metadata on a module record. When you invoke We can verify with my other project isolated-vm which is as close to raw v8 bindings as you can get in nodejs: const ivm = require('isolated-vm');
void async function() {
const isolate = new ivm.Isolate();
for (let ii = 0; ii < 10; ++ii) {
console.log(await isolate.compileModule('import foo from "foo"; export {};', { filename: 'file:///wow' }));
}
}();
You can also verify with |
ecma262 defines a
Sure but it must be stable: "If this operation is called multiple times with the same (referrer, specifier) pair […] then it must perform FinishLoadingImportedModule(referrer, specifier, payload, result) with the same result each time." But we're getting off topic, a loader doesn't have to comply with the ES spec anyway.
I completely missed that, sorry for the confusion. |
ecma262 doesn't have any requirements on the specifier except that it's a string; HTML is what requires they be URLs. node is free to make whatever choice it wants here, since it's not a web browser. |
Could we create a cache based on the resolved URL and a hash (like a shasum) of the source returned by |
Ideally you wouldn't need to call Imagine a generalized Babel loader that transforms your source based on the contents of babelrc. You invoke nodejs with something like: Invocation 1 (fresh):
Invocation 2 (afterward):
What I'm suggesting is a cache scheme which allows us to elide the invocation to |
That's fine. I guess what this illuminates though is that there can be varying goals for creating a cache: avoiding file reads (the goal you cited) or avoiding processing. Like for example if your loader is the one that does the transpilation, you could use the approach I suggested to load transpiled output from cache rather than doing the transpilation work again. |
Yeah I didn't mean to say that either case is more valid than the other. Studying both is great. Thinking about good "best practices" for caching would really benefit the ecosystem. Right now my intuition is that caching should live in dedicated loader. If each loader implements their own caching mechanism then you might actually run into very poor performance on first load, because cache misses aren't free. |
I made a loader called dynohot which implements hot module reloading in nodejs as an experimental loader.
One of the requirements of the loader is a code transformation written using Babel. Like all Babel transformations there is a good bit of overhead involved. I wanted to add a file cache to avoid the transformation in the common case where most source files are unchanged since the last invocation. This raised a bunch of questions and ad-hoc solutions that I'd like to share here.
How do we know what previous loaders are doing?
The result of
nextLoad
may change depending on what loaders are defined before us in the chain. If we want to be able to cache a result then it is necessary to ask each loader for a cache key.The ad-hoc solution for this is a resolve hook. Cache-aware loaders will define a resolver for "loader:cache-key" and add their cache key to the payload blob. A cache provider will resolve this cache key and use it as a "namespace" while caching.
A cache provider then only needs to do
const cacheKey = import.meta.resolve("loader:cache-key")
(with try/catch) to build a cache key for the active loader chain.How do we know all the source dependencies for the previous loader and resolvers?
A load hook can use information from multiple sources to generate a single underlying module source text blob. For example a TypeScript loader would use the settings in
tsconfig.json
to determine whether or not it should omit type-only imports. This has a material impact on the resultingsource
payload.Resolve hooks run into the same issue.
package.json
, and the absence ofpackage.json
in traversed directories, affects the way a specifier is resolved to a module URL.The ad-hoc solution is to pass forward arbitrary information in the
result
object but this isn't something that is officially documented and seems subject to the whims of the implementation:This would inform a cache provider about which files it needs to
stat
before returning a cached response. It also would have benefits to other loaders, for example, it would tell dynohot which files it needs to watch for updates.What is a cache provider?
This led to my final question about whose job it is to cache? No loader should make any presumptions about its position in the loader chain. Multiple loaders implementing different forms of caching would lead to inconsistent caching, less than optimal performance, and duplicated code. Therefore I think terminating your loading chain with a caching loader makes the most sense.
The most simple caching loader would take in the result of
nextLoad
and save a persistent cache entry for the givenmoduleURL
,sourceURLs
, and active chain cache-key. It would be up to the user, and not the earlier loaders, whether or not and how they want to cache.Another example of a caching loader would be one which JIIT-bundles arbitrary packages under node_modules. I think it's absolutely deranged that packages in the npm ecosystem distribute minified single-file build artifacts from rollup. But of course there's very real benefits to loading and parsing, so I understand why they do this. If this was implemented as a loader it would encourage package authors to distribute plain source files and let the user decide how to manage caching for their project.
Do we standardize this?
I proposed 2 ad-hoc solutions here:
loader:cache-key
resolution specifier, andsourceURLs
array on the result ofresolve
andload
. By asking loaders to provide this information we unlock the possibility of caching / watching loaders. Is this something we want to encourage? I think the cache key solution would be better represented as an export on the loader, but this isn't possible without support in the host environment.sourceURLs
I think is pretty close to ideal and is only missing an implementation from the default loaders.The text was updated successfully, but these errors were encountered: