-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: dataset cache for tests #4608
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Left same comments largely about the interface for such fixture/helper
datalad/tests/fixtures.py
Outdated
|
||
Note | ||
---- | ||
`version` not yet implemented. Intention: verify that `version` can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely then version should be included in the cache path, since different tests might want different versions... Alternative is to checkout a version for the test, but it would prohibit running in parallel ever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below about necessity of @datalad_dataset
always delivering a clone. I'd prefer to checkout a particular version only in that clone, that is delivered to the test, while the cached dataset itself only replaces the original source.
datalad/tests/fixtures.py
Outdated
|
||
ds = Dataset(DATALAD_TESTS_CACHE / dataset_name) | ||
if not ds.is_installed(): | ||
ds = clone(url, ds.pathobj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds.clone(url)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that clone into ds instead? That's not what I want.
datalad/tests/fixtures.py
Outdated
@with_tempfile | ||
def newfunc(*arg, **kw): | ||
ds = get_cached_dataset(url, name, version=version) | ||
clone_ds = clone(ds.pathobj, arg[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many cases clone wouldn't be needed, option? That is why I thought to make get also optional in the original cached one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be also could be reckless? Pass clone kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should always be a clone. We shouldn't actually operate on datasets that are used across tests. And it wouldn't be a cache, if we deliver it directly. Getting a cached dataset shouldn't have any side effects compared to getting it directly from that URL (which would give you - a clone of the original one). To me the entire point is to be equivalent to cloning from that URL to a temp. location from the point of view of a test. Except that it becomes cheaper across tests and test runs in terms of network traffic and time needed. Cache serves just as a drop-in replacement for the original URL. And with the suggested env var it could then be disabled and clone directly from that URL (or use a temp. cache instead of user's). Therefore env var makes no difference to the test at all, but only to its performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re clone kwargs:
Guess that's useful, although I start being afraid of having to write tests for all those features and their combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it a bit more:
May be it should be two decorators (don't want to much features in one thing):
-
as described above: replaces the remote one specified by
url
by a cached one as the origin and delivers a clone thereof. If caching is disabled, delivers a direct clone of the original remote. -
Doesn't deliver a clone, but a URL to be used in the test. original URL if caching is disabled and path to a cached dataset (or
file:
scheme URL) if enabled.
What do you think about this, @yarikoptic ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. @cached_dataset
+ @cached_url
instead of @datalad_dataset
.
da7e56c
to
b517642
Compare
Codecov Report
@@ Coverage Diff @@
## master #4608 +/- ##
===========================================
+ Coverage 49.82% 89.57% +39.74%
===========================================
Files 286 288 +2
Lines 38822 39000 +178
===========================================
+ Hits 19344 34934 +15590
+ Misses 19478 4066 -15412
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice! Yet to have a better look - it grew up ;-) file needs to be renamed though from generic fixtures .
Re
See comment below about necessity of @datalad_dataset always delivering a clone
I seems can't comment to that outdated one, so do it here - I actually thought it is your use case - to use cached dataset as input dataset for data conversion in hirni. That is how we use/need them in heudiconv. If such a dataset is used as input, I don't see the need for a clone of a clone with get - it just would double disc consumption and keep regretting data locally over and over again - yes, it is safer but wasteful for those use cases
datalad/tests/fixtures.py
Outdated
if version: | ||
# check whether version is available | ||
assert ds.repo.commit_exists(version) | ||
# TODO: What about an outdated dataset in cache? How do we properly update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess whenever the need comes, we could add option for that. But that would require these cycle etc, may be now? ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an option isn't it. That's again because it is supposed to work as cache and thereby needs to be transparent to the test. If the cache was skipped, no update would ever be required. If we now need to specify an option on how to update the usage becomes messy, I think.
I'd rather do something like:
If the dataset exists in cache, fetch and just proceed if there was nothing to fetch. If there was something however, we would need to decide what to do with that (merge isn't necessarily it) or - simplest solution - throw cache away in that case and get a fresh one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about transparency! Then update only if version isn't found locally or explicitly requested overall?
We also will benefit from an explicit trigger to reset or ignore cached version (relates to the "fresh one") - I had to do that in dandi as well (https://github.com/dandi/dandi-cli/blob/master/dandi/support/cache.py#L38) although for a different type of cache (see con/fscacher#1 for a higher level description/discussion, forgot to ping you there). I can see how it would be useful here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have a look. But quick remark:
Then update only if version isn't found locally or explicitly requested overall?
Insufficient. "version" could be a branch. A test would "mean" the branch as available from url
. But that one could have been updated. We would "find" version
in an outdated dataset in cache.
Well, depends on what the test is going to do with the dataset. Generally I'd go for the safer option. Blowing up all tests simultaneously makes it harder to figure out what's going on. We've had those issues before. But I agree - as a test knows how dangerous it would be it can now decide to use either of those two decorators. With the second one no per-test clone is enforced. However, if we want to be able to decide whether or not caching is enabled, we cannot possibly provide a cached dataset directly, since then the test itself needs to be aware of the cache and its status. Kinda contradicts the definition of a cache. |
I don't think it really contradicts. Just that test should announce if it needs a clone, eg for the purpose of guarantees of absent content or for performing changes to it. We have that also in "good" old with_test_repos was well already. (I think this new helper might provide the core functionality for RF of that elderly beast was well later on ;-)) |
Don't think so. If we would deliver the cached dataset itself, it's not transparent anymore and the test relies on something being present in the cache. It shouldn't. Requesting a persistent-across-tests dataset to be locally available is very different thing than a cache. Of course we could build that almost the same way, but I would at least separate that kind of a request from what is implemented here. So - very similar implementation (different config) and a different decorator. I can do that, but:
|
BTW! We could make original clone read only (after get competes, and rw for getting before get)! So the tests which do not request clone and try to modify it anyhow would fail - a nice test on itself IMHO |
Anyways, that is just my 1c, and even without original clone reuse it would be useful, so it is ok to nevermind me ;-) |
If combined with the separation I mentioned above (have dedicated decorator requesting a persistent clone rather than an actual transparent cache), I kinda like that idea. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment/recommendation about clone location
Assuming that tests still pass, I think this is ready to go into 0.13.0 as is. Agreed, @datalad/developers ? P.S.: If so, I should prob. squash before merge. |
5f55c42
to
824d694
Compare
# get_cached_dataset yet, since translation to keys depends on a | ||
# worktree. We'll have the worktree of `version` only after cloning. | ||
ds = get_cached_dataset(url, version=version) | ||
clone_ds = Clone()(ds.pathobj, arg[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW -- to be more "transparent cache" instance, shouldn't then it
- clone so that local cached "origin" is named differently than "origin" remote (e.g.
local-cached
) - add "origin" remote which would be the original URL?
the only possible side-effect I see could be -- it would need to fetch that "origin" remote so might prevent testing "offline" but so far I think the entire approach is for "online" testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also probably that would undermine benefits for some tests use cases which might want to request data specifically from "origin".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "origin" remote which would be the original URL?
the only possible side-effect I see could be -- it would need to fetch that "origin" remote so might prevent testing "offline" but so far I think the entire approach is for "online" testing
Quite the opposite! With that approach a test would need to be aware of the cache in order to make any use of it - that's the exact opposite of transparent. With the implementation here, the test would always talk to "origin" - not knowing whether or not it's cached. That is the point! Also: I can't disable caching without side effect on the tests, if they would explicitly talk to a cache remote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks very useful.
d03c6f0
to
01dd84c
Compare
Okay, per today's call: will squash, have a proper issue for further development and merge ... |
This introduces two test decorators for caching remote datasets for tests. The caching location defaults to user level cache dir for datalad under tests/ (GNU/Linux: $HOME/.cache/datalad/tests) and can be configured via 'datalad.tests.cache'. Not specifying this, disables caching altogether. The intend is for a transparent cache, so decorated tests do not know whether or not caching is enabled and must not depend on it. The dataset in the cache is a drop-in replacement for the one at the specified URL.
01dd84c
to
0399eda
Compare
Per discussion, squash, and greens of CI - merging! Thanks @bpoldrack! |
Inspired by discussion in datalad/datalad-extensions#12 and internal chat.
This allows to use remote datasets in tests via a cache, in order to reduce time and network traffic.
The goal is to replace the original remote repository (specified by a URL) by a local clone in a caching location. From the point of view of a test, this is kind of transparent - the cached dataset simply replaces the actually remote one. It introduces a config variable
datalad.tests.cache
to specify the caching location or disable caching entirely. Default location is<user's cache dir>/datalad/tests
, which would result in datasets used by tests that way being persistent across test runs, not only across tests! This should be convenient for ourselves and irrelevant in CI builds, where the only thing interesting is to be persistent across tests (entire environment is fresh every time after all).This PR brings two test decorators:
cached_dataset
:This delivers a
Dataset
to the test, that is a temporary clone of the dataset in the cache (or of the actual URL if caching was disabled). That clone intentionally is not persistent - each test gets its own clone. The difference is that not every test using it, needs to clone (and get content) via network again, making it much cheaper (plus reducing chance to be rejected/blacklisted by flooding some services during our CI runs). In addition this decorator allows to specify paths to get and a version to checkout in that clone (of a clone).cached_url
:In opposition to above, this decorator merely delivers a URL, not a clone, to a test. If caching is disabled it merely passes on the specified URL, otherwise it delivers a file-scheme URL to the instance in cache. If the test merely needs something to clone from rather than to actually operate on, this is the way to go.
Note, that there are TODOs in the code wrt to enhancing this concept. If everything apart from that is fine with the rest of @datalad/developers , I'd rather postpone that and get the current version into 0.13.0 in order to solve the immediate usecase with datalad-hirni, which therefore needs to depend on a datalad version, that actually has those decorators.
Also: Those TODOs raise some questions to explore.