ENH: dataset cache for tests #4608

bpoldrack · 2020-06-05T12:26:49Z

Inspired by discussion in datalad/datalad-extensions#12 and internal chat.

This allows to use remote datasets in tests via a cache, in order to reduce time and network traffic.
The goal is to replace the original remote repository (specified by a URL) by a local clone in a caching location. From the point of view of a test, this is kind of transparent - the cached dataset simply replaces the actually remote one. It introduces a config variable datalad.tests.cache to specify the caching location or disable caching entirely. Default location is <user's cache dir>/datalad/tests, which would result in datasets used by tests that way being persistent across test runs, not only across tests! This should be convenient for ourselves and irrelevant in CI builds, where the only thing interesting is to be persistent across tests (entire environment is fresh every time after all).

This PR brings two test decorators:

cached_dataset:
This delivers a Dataset to the test, that is a temporary clone of the dataset in the cache (or of the actual URL if caching was disabled). That clone intentionally is not persistent - each test gets its own clone. The difference is that not every test using it, needs to clone (and get content) via network again, making it much cheaper (plus reducing chance to be rejected/blacklisted by flooding some services during our CI runs). In addition this decorator allows to specify paths to get and a version to checkout in that clone (of a clone).
cached_url:
In opposition to above, this decorator merely delivers a URL, not a clone, to a test. If caching is disabled it merely passes on the specified URL, otherwise it delivers a file-scheme URL to the instance in cache. If the test merely needs something to clone from rather than to actually operate on, this is the way to go.

Note, that there are TODOs in the code wrt to enhancing this concept. If everything apart from that is fine with the rest of @datalad/developers , I'd rather postpone that and get the current version into 0.13.0 in order to solve the immediate usecase with datalad-hirni, which therefore needs to depend on a datalad version, that actually has those decorators.
Also: Those TODOs raise some questions to explore.

yarikoptic

Nice! Left same comments largely about the interface for such fixture/helper

datalad/tests/fixtures.py

yarikoptic · 2020-06-05T12:34:49Z

datalad/tests/fixtures.py

+
+    Note
+    ----
+    `version` not yet implemented. Intention: verify that `version` can be


Most likely then version should be included in the cache path, since different tests might want different versions... Alternative is to checkout a version for the test, but it would prohibit running in parallel ever

See comment below about necessity of @datalad_dataset always delivering a clone. I'd prefer to checkout a particular version only in that clone, that is delivered to the test, while the cached dataset itself only replaces the original source.

yarikoptic · 2020-06-05T12:36:07Z

datalad/tests/fixtures.py

+
+    ds = Dataset(DATALAD_TESTS_CACHE / dataset_name)
+    if not ds.is_installed():
+        ds = clone(url, ds.pathobj)


ds.clone(url)?

Shouldn't that clone into ds instead? That's not what I want.

yarikoptic · 2020-06-05T12:40:07Z

datalad/tests/fixtures.py

+    @with_tempfile
+    def newfunc(*arg, **kw):
+        ds = get_cached_dataset(url, name, version=version)
+        clone_ds = clone(ds.pathobj, arg[-1])


In many cases clone wouldn't be needed, option? That is why I thought to make get also optional in the original cached one

May be also could be reckless? Pass clone kwargs?

I think it should always be a clone. We shouldn't actually operate on datasets that are used across tests. And it wouldn't be a cache, if we deliver it directly. Getting a cached dataset shouldn't have any side effects compared to getting it directly from that URL (which would give you - a clone of the original one). To me the entire point is to be equivalent to cloning from that URL to a temp. location from the point of view of a test. Except that it becomes cheaper across tests and test runs in terms of network traffic and time needed. Cache serves just as a drop-in replacement for the original URL. And with the suggested env var it could then be disabled and clone directly from that URL (or use a temp. cache instead of user's). Therefore env var makes no difference to the test at all, but only to its performance.

Re clone kwargs:
Guess that's useful, although I start being afraid of having to write tests for all those features and their combinations.

Thinking about it a bit more:
May be it should be two decorators (don't want to much features in one thing):

as described above: replaces the remote one specified by url by a cached one as the origin and delivers a clone thereof. If caching is disabled, delivers a direct clone of the original remote.

Doesn't deliver a clone, but a URL to be used in the test. original URL if caching is disabled and path to a cached dataset (or file: scheme URL) if enabled.

What do you think about this, @yarikoptic ?

Done. @cached_dataset + @cached_url instead of @datalad_dataset.

codecov · 2020-06-05T16:56:20Z

Codecov Report

Merging #4608 into master will increase coverage by 39.74%.
The diff coverage is 97.75%.

@@             Coverage Diff             @@
##           master    #4608       +/-   ##
===========================================
+ Coverage   49.82%   89.57%   +39.74%     
===========================================
  Files         286      288        +2     
  Lines       38822    39000      +178     
===========================================
+ Hits        19344    34934    +15590     
+ Misses      19478     4066    -15412

Impacted Files	Coverage Δ
datalad/interface/common_cfg.py	`100.00% <ø> (ø)`
datalad/tests/utils_cached_dataset.py	`92.30% <92.30%> (ø)`
datalad/tests/test_utils_cached_dataset.py	`100.00% <100.00%> (ø)`
datalad/core/local/tests/test_create.py	`100.00% <0.00%> (+0.44%)`	⬆️
datalad/local/subdatasets.py	`95.65% <0.00%> (+0.86%)`	⬆️
datalad/core/local/diff.py	`95.29% <0.00%> (+1.17%)`	⬆️
datalad/support/external_versions.py	`96.07% <0.00%> (+1.30%)`	⬆️
datalad/dochelpers.py	`87.40% <0.00%> (+1.48%)`	⬆️
... and 180 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0adaed...0399eda. Read the comment docs.

yarikoptic

Looks nice! Yet to have a better look - it grew up ;-) file needs to be renamed though from generic fixtures .

Re

See comment below about necessity of @datalad_dataset always delivering a clone

I seems can't comment to that outdated one, so do it here - I actually thought it is your use case - to use cached dataset as input dataset for data conversion in hirni. That is how we use/need them in heudiconv. If such a dataset is used as input, I don't see the need for a clone of a clone with get - it just would double disc consumption and keep regretting data locally over and over again - yes, it is safer but wasteful for those use cases

datalad/tests/fixtures.py

yarikoptic · 2020-06-06T14:30:04Z

datalad/tests/fixtures.py

+    if version:
+        # check whether version is available
+        assert ds.repo.commit_exists(version)
+    # TODO: What about an outdated dataset in cache? How do we properly update


I guess whenever the need comes, we could add option for that. But that would require these cycle etc, may be now? ;-)

I think an option isn't it. That's again because it is supposed to work as cache and thereby needs to be transparent to the test. If the cache was skipped, no update would ever be required. If we now need to specify an option on how to update the usage becomes messy, I think.
I'd rather do something like:
If the dataset exists in cache, fetch and just proceed if there was nothing to fetch. If there was something however, we would need to decide what to do with that (merge isn't necessarily it) or - simplest solution - throw cache away in that case and get a fresh one.

Good point about transparency! Then update only if version isn't found locally or explicitly requested overall?
We also will benefit from an explicit trigger to reset or ignore cached version (relates to the "fresh one") - I had to do that in dandi as well (https://github.com/dandi/dandi-cli/blob/master/dandi/support/cache.py#L38) although for a different type of cache (see con/fscacher#1 for a higher level description/discussion, forgot to ping you there). I can see how it would be useful here as well

Will have a look. But quick remark:

Then update only if version isn't found locally or explicitly requested overall?

Insufficient. "version" could be a branch. A test would "mean" the branch as available from url. But that one could have been updated. We would "find" version in an outdated dataset in cache.

bpoldrack · 2020-06-06T15:55:19Z

If such a dataset is used as input, I don't see the need for a clone of a clone with get - it just would double disc consumption and keep regretting data locally over and over again

Well, depends on what the test is going to do with the dataset. Generally I'd go for the safer option. Blowing up all tests simultaneously makes it harder to figure out what's going on. We've had those issues before. But I agree - as a test knows how dangerous it would be it can now decide to use either of those two decorators. With the second one no per-test clone is enforced. However, if we want to be able to decide whether or not caching is enabled, we cannot possibly provide a cached dataset directly, since then the test itself needs to be aware of the cache and its status. Kinda contradicts the definition of a cache.

yarikoptic · 2020-06-06T16:56:45Z

I don't think it really contradicts. Just that test should announce if it needs a clone, eg for the purpose of guarantees of absent content or for performing changes to it. We have that also in "good" old with_test_repos was well already. (I think this new helper might provide the core functionality for RF of that elderly beast was well later on ;-))

bpoldrack · 2020-06-06T18:01:25Z

Just that test should announce if it needs a clone, eg for the purpose of guarantees of absent content or for performing changes to it.

Don't think so. If we would deliver the cached dataset itself, it's not transparent anymore and the test relies on something being present in the cache. It shouldn't. Requesting a persistent-across-tests dataset to be locally available is very different thing than a cache. Of course we could build that almost the same way, but I would at least separate that kind of a request from what is implemented here. So - very similar implementation (different config) and a different decorator. I can do that, but:

I don't have an immediate use case
I generally don't like to encourage and suggest that kind of test writing. Sure - it's "wasteful" if all the tests work just right, but the point of testing is to ensure it's working right, not to assume it ;)

yarikoptic · 2020-06-07T01:19:01Z

BTW! We could make original clone read only (after get competes, and rw for getting before get)! So the tests which do not request clone and try to modify it anyhow would fail - a nice test on itself IMHO

yarikoptic · 2020-06-07T01:20:39Z

Anyways, that is just my 1c, and even without original clone reuse it would be useful, so it is ok to nevermind me ;-)

bpoldrack · 2020-06-07T09:31:50Z

BTW! We could make original clone read only (after get competes, and rw for getting before get)! So the tests which do not request clone and try to modify it anyhow would fail - a nice test on itself IMHO
Anyways, that is just my 1c, and even without original clone reuse it would be useful, so it is ok to nevermind me ;-)

If combined with the separation I mentioned above (have dedicated decorator requesting a persistent clone rather than an actual transparent cache), I kinda like that idea.

yarikoptic

Left a comment/recommendation about clone location

datalad/tests/fixtures.py

bpoldrack · 2020-06-08T00:05:00Z

Assuming that tests still pass, I think this is ready to go into 0.13.0 as is.
If we agree on this, I'll make an issue to collect next pieces as discussed here with @yarikoptic and outlined in the code comments. This would include to add very similar decorators for providing persistent read-only clones (in opposition to cache, this would not be a transparent thing for the tests). Furthermore cache could be more efficient with --bare or --mirror clones, etc. Which in turn requires such option(s) for datalad-clone. Recursion is yet another topic to address. Generalization for other uses such as in heudiconv or even some kind of datalad-cache feature needs some evolution and experience in using it, I guess.
However, this is already something very useful, I think, it solves the immediate use-case and most of the possible changes and additions are either independent or transparent for the decorated tests.

Agreed, @datalad/developers ?

P.S.: If so, I should prob. squash before merge.

yarikoptic · 2020-06-08T14:58:31Z

datalad/tests/utils_cached_dataset.py

+            # get_cached_dataset yet, since translation to keys depends on a
+            # worktree. We'll have the worktree of `version` only after cloning.
+            ds = get_cached_dataset(url, version=version)
+            clone_ds = Clone()(ds.pathobj, arg[-1])


BTW -- to be more "transparent cache" instance, shouldn't then it

clone so that local cached "origin" is named differently than "origin" remote (e.g. local-cached)

add "origin" remote which would be the original URL?
the only possible side-effect I see could be -- it would need to fetch that "origin" remote so might prevent testing "offline" but so far I think the entire approach is for "online" testing

also probably that would undermine benefits for some tests use cases which might want to request data specifically from "origin".

add "origin" remote which would be the original URL?
the only possible side-effect I see could be -- it would need to fetch that "origin" remote so might prevent testing "offline" but so far I think the entire approach is for "online" testing

Quite the opposite! With that approach a test would need to be aware of the cache in order to make any use of it - that's the exact opposite of transparent. With the implementation here, the test would always talk to "origin" - not knowing whether or not it's cached. That is the point! Also: I can't disable caching without side effect on the tests, if they would explicitly talk to a cache remote.

kyleam

Thanks, this looks very useful.

datalad/tests/test_utils_cached_dataset.py

datalad/tests/utils_cached_dataset.py

bpoldrack · 2020-06-09T14:55:20Z

Okay, per today's call: will squash, have a proper issue for further development and merge ...

This introduces two test decorators for caching remote datasets for tests. The caching location defaults to user level cache dir for datalad under tests/ (GNU/Linux: $HOME/.cache/datalad/tests) and can be configured via 'datalad.tests.cache'. Not specifying this, disables caching altogether. The intend is for a transparent cache, so decorated tests do not know whether or not caching is enabled and must not depend on it. The dataset in the cache is a drop-in replacement for the one at the specified URL.

yarikoptic · 2020-06-10T00:07:46Z

Per discussion, squash, and greens of CI - merging! Thanks @bpoldrack!

bpoldrack marked this pull request as draft June 5, 2020 12:26

yarikoptic requested changes Jun 5, 2020

View reviewed changes

bpoldrack force-pushed the enh-cached-test-ds branch 2 times, most recently from da7e56c to b517642 Compare June 5, 2020 15:02

yarikoptic requested changes Jun 6, 2020

View reviewed changes

yarikoptic reviewed Jun 7, 2020

View reviewed changes

datalad/tests/fixtures.py Outdated Show resolved Hide resolved

bpoldrack marked this pull request as ready for review June 8, 2020 00:06

bpoldrack force-pushed the enh-cached-test-ds branch 3 times, most recently from 5f55c42 to 824d694 Compare June 8, 2020 04:40

yarikoptic reviewed Jun 8, 2020

View reviewed changes

kyleam reviewed Jun 8, 2020

View reviewed changes

bpoldrack force-pushed the enh-cached-test-ds branch from d03c6f0 to 01dd84c Compare June 9, 2020 09:00

bpoldrack force-pushed the enh-cached-test-ds branch from 01dd84c to 0399eda Compare June 9, 2020 18:13

bpoldrack mentioned this pull request Jun 9, 2020

Something is flaky in crippled FS workflow #4621

Closed

yarikoptic merged commit afa6825 into datalad:master Jun 10, 2020

yarikoptic mentioned this pull request Jul 23, 2020

wishlist: fixture/decorator to facilicate testing on some datasets #1603

Closed

bpoldrack deleted the enh-cached-test-ds branch September 2, 2020 10:20

mih mentioned this pull request Oct 23, 2020

Dataset caching broken on windows (or tests) #5074

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: dataset cache for tests #4608

ENH: dataset cache for tests #4608

bpoldrack commented Jun 5, 2020 •

edited

Loading

yarikoptic left a comment

yarikoptic Jun 5, 2020

bpoldrack Jun 5, 2020

yarikoptic Jun 5, 2020

bpoldrack Jun 5, 2020

yarikoptic Jun 5, 2020

yarikoptic Jun 5, 2020

bpoldrack Jun 5, 2020

bpoldrack Jun 5, 2020

bpoldrack Jun 5, 2020

bpoldrack Jun 6, 2020

codecov bot commented Jun 5, 2020 •

edited

Loading

yarikoptic left a comment

yarikoptic Jun 6, 2020

bpoldrack Jun 6, 2020

yarikoptic Jun 6, 2020

bpoldrack Jun 6, 2020

bpoldrack commented Jun 6, 2020

yarikoptic commented Jun 6, 2020 •

edited

Loading

bpoldrack commented Jun 6, 2020

yarikoptic commented Jun 7, 2020

yarikoptic commented Jun 7, 2020

bpoldrack commented Jun 7, 2020

yarikoptic left a comment

bpoldrack commented Jun 8, 2020

yarikoptic Jun 8, 2020

yarikoptic Jun 8, 2020

bpoldrack Jun 8, 2020

kyleam left a comment

bpoldrack commented Jun 9, 2020

yarikoptic commented Jun 10, 2020

ENH: dataset cache for tests #4608

ENH: dataset cache for tests #4608

Conversation

bpoldrack commented Jun 5, 2020 • edited Loading

yarikoptic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 5, 2020 • edited Loading

Codecov Report

yarikoptic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpoldrack commented Jun 6, 2020

yarikoptic commented Jun 6, 2020 • edited Loading

bpoldrack commented Jun 6, 2020

yarikoptic commented Jun 7, 2020

yarikoptic commented Jun 7, 2020

bpoldrack commented Jun 7, 2020

yarikoptic left a comment

Choose a reason for hiding this comment

bpoldrack commented Jun 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyleam left a comment

Choose a reason for hiding this comment

bpoldrack commented Jun 9, 2020

yarikoptic commented Jun 10, 2020

bpoldrack commented Jun 5, 2020 •

edited

Loading

codecov bot commented Jun 5, 2020 •

edited

Loading

yarikoptic commented Jun 6, 2020 •

edited

Loading