Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: dataset cache for tests #4608

Merged
merged 1 commit into from
Jun 10, 2020
Merged

Conversation

bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Jun 5, 2020

Inspired by discussion in datalad/datalad-extensions#12 and internal chat.

This allows to use remote datasets in tests via a cache, in order to reduce time and network traffic.
The goal is to replace the original remote repository (specified by a URL) by a local clone in a caching location. From the point of view of a test, this is kind of transparent - the cached dataset simply replaces the actually remote one. It introduces a config variable datalad.tests.cache to specify the caching location or disable caching entirely. Default location is <user's cache dir>/datalad/tests, which would result in datasets used by tests that way being persistent across test runs, not only across tests! This should be convenient for ourselves and irrelevant in CI builds, where the only thing interesting is to be persistent across tests (entire environment is fresh every time after all).

This PR brings two test decorators:

  1. cached_dataset:
    This delivers a Dataset to the test, that is a temporary clone of the dataset in the cache (or of the actual URL if caching was disabled). That clone intentionally is not persistent - each test gets its own clone. The difference is that not every test using it, needs to clone (and get content) via network again, making it much cheaper (plus reducing chance to be rejected/blacklisted by flooding some services during our CI runs). In addition this decorator allows to specify paths to get and a version to checkout in that clone (of a clone).

  2. cached_url:
    In opposition to above, this decorator merely delivers a URL, not a clone, to a test. If caching is disabled it merely passes on the specified URL, otherwise it delivers a file-scheme URL to the instance in cache. If the test merely needs something to clone from rather than to actually operate on, this is the way to go.

Note, that there are TODOs in the code wrt to enhancing this concept. If everything apart from that is fine with the rest of @datalad/developers , I'd rather postpone that and get the current version into 0.13.0 in order to solve the immediate usecase with datalad-hirni, which therefore needs to depend on a datalad version, that actually has those decorators.
Also: Those TODOs raise some questions to explore.

@bpoldrack bpoldrack marked this pull request as draft June 5, 2020 12:26
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Left same comments largely about the interface for such fixture/helper

datalad/tests/fixtures.py Outdated Show resolved Hide resolved
datalad/tests/fixtures.py Outdated Show resolved Hide resolved

Note
----
`version` not yet implemented. Intention: verify that `version` can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely then version should be included in the cache path, since different tests might want different versions... Alternative is to checkout a version for the test, but it would prohibit running in parallel ever

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below about necessity of @datalad_dataset always delivering a clone. I'd prefer to checkout a particular version only in that clone, that is delivered to the test, while the cached dataset itself only replaces the original source.


ds = Dataset(DATALAD_TESTS_CACHE / dataset_name)
if not ds.is_installed():
ds = clone(url, ds.pathobj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds.clone(url)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that clone into ds instead? That's not what I want.

@with_tempfile
def newfunc(*arg, **kw):
ds = get_cached_dataset(url, name, version=version)
clone_ds = clone(ds.pathobj, arg[-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many cases clone wouldn't be needed, option? That is why I thought to make get also optional in the original cached one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be also could be reckless? Pass clone kwargs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should always be a clone. We shouldn't actually operate on datasets that are used across tests. And it wouldn't be a cache, if we deliver it directly. Getting a cached dataset shouldn't have any side effects compared to getting it directly from that URL (which would give you - a clone of the original one). To me the entire point is to be equivalent to cloning from that URL to a temp. location from the point of view of a test. Except that it becomes cheaper across tests and test runs in terms of network traffic and time needed. Cache serves just as a drop-in replacement for the original URL. And with the suggested env var it could then be disabled and clone directly from that URL (or use a temp. cache instead of user's). Therefore env var makes no difference to the test at all, but only to its performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re clone kwargs:
Guess that's useful, although I start being afraid of having to write tests for all those features and their combinations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it a bit more:
May be it should be two decorators (don't want to much features in one thing):

  1. as described above: replaces the remote one specified by url by a cached one as the origin and delivers a clone thereof. If caching is disabled, delivers a direct clone of the original remote.

  2. Doesn't deliver a clone, but a URL to be used in the test. original URL if caching is disabled and path to a cached dataset (or file: scheme URL) if enabled.

What do you think about this, @yarikoptic ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. @cached_dataset + @cached_url instead of @datalad_dataset.

@bpoldrack bpoldrack force-pushed the enh-cached-test-ds branch 2 times, most recently from da7e56c to b517642 Compare June 5, 2020 15:02
@codecov
Copy link

codecov bot commented Jun 5, 2020

Codecov Report

Merging #4608 into master will increase coverage by 39.74%.
The diff coverage is 97.75%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4608       +/-   ##
===========================================
+ Coverage   49.82%   89.57%   +39.74%     
===========================================
  Files         286      288        +2     
  Lines       38822    39000      +178     
===========================================
+ Hits        19344    34934    +15590     
+ Misses      19478     4066    -15412     
Impacted Files Coverage Δ
datalad/interface/common_cfg.py 100.00% <ø> (ø)
datalad/tests/utils_cached_dataset.py 92.30% <92.30%> (ø)
datalad/tests/test_utils_cached_dataset.py 100.00% <100.00%> (ø)
datalad/core/local/tests/test_create.py 100.00% <0.00%> (+0.44%) ⬆️
datalad/local/subdatasets.py 95.65% <0.00%> (+0.86%) ⬆️
datalad/core/local/diff.py 95.29% <0.00%> (+1.17%) ⬆️
datalad/support/external_versions.py 96.07% <0.00%> (+1.30%) ⬆️
datalad/dochelpers.py 87.40% <0.00%> (+1.48%) ⬆️
... and 180 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0adaed...0399eda. Read the comment docs.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! Yet to have a better look - it grew up ;-) file needs to be renamed though from generic fixtures .

Re

See comment below about necessity of @datalad_dataset always delivering a clone

I seems can't comment to that outdated one, so do it here - I actually thought it is your use case - to use cached dataset as input dataset for data conversion in hirni. That is how we use/need them in heudiconv. If such a dataset is used as input, I don't see the need for a clone of a clone with get - it just would double disc consumption and keep regretting data locally over and over again - yes, it is safer but wasteful for those use cases

datalad/tests/fixtures.py Outdated Show resolved Hide resolved
if version:
# check whether version is available
assert ds.repo.commit_exists(version)
# TODO: What about an outdated dataset in cache? How do we properly update
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess whenever the need comes, we could add option for that. But that would require these cycle etc, may be now? ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an option isn't it. That's again because it is supposed to work as cache and thereby needs to be transparent to the test. If the cache was skipped, no update would ever be required. If we now need to specify an option on how to update the usage becomes messy, I think.
I'd rather do something like:
If the dataset exists in cache, fetch and just proceed if there was nothing to fetch. If there was something however, we would need to decide what to do with that (merge isn't necessarily it) or - simplest solution - throw cache away in that case and get a fresh one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about transparency! Then update only if version isn't found locally or explicitly requested overall?
We also will benefit from an explicit trigger to reset or ignore cached version (relates to the "fresh one") - I had to do that in dandi as well (https://github.com/dandi/dandi-cli/blob/master/dandi/support/cache.py#L38) although for a different type of cache (see con/fscacher#1 for a higher level description/discussion, forgot to ping you there). I can see how it would be useful here as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a look. But quick remark:

Then update only if version isn't found locally or explicitly requested overall?

Insufficient. "version" could be a branch. A test would "mean" the branch as available from url. But that one could have been updated. We would "find" version in an outdated dataset in cache.

@bpoldrack
Copy link
Member Author

If such a dataset is used as input, I don't see the need for a clone of a clone with get - it just would double disc consumption and keep regretting data locally over and over again

Well, depends on what the test is going to do with the dataset. Generally I'd go for the safer option. Blowing up all tests simultaneously makes it harder to figure out what's going on. We've had those issues before. But I agree - as a test knows how dangerous it would be it can now decide to use either of those two decorators. With the second one no per-test clone is enforced. However, if we want to be able to decide whether or not caching is enabled, we cannot possibly provide a cached dataset directly, since then the test itself needs to be aware of the cache and its status. Kinda contradicts the definition of a cache.

@yarikoptic
Copy link
Member

yarikoptic commented Jun 6, 2020

I don't think it really contradicts. Just that test should announce if it needs a clone, eg for the purpose of guarantees of absent content or for performing changes to it. We have that also in "good" old with_test_repos was well already. (I think this new helper might provide the core functionality for RF of that elderly beast was well later on ;-))

@bpoldrack
Copy link
Member Author

Just that test should announce if it needs a clone, eg for the purpose of guarantees of absent content or for performing changes to it.

Don't think so. If we would deliver the cached dataset itself, it's not transparent anymore and the test relies on something being present in the cache. It shouldn't. Requesting a persistent-across-tests dataset to be locally available is very different thing than a cache. Of course we could build that almost the same way, but I would at least separate that kind of a request from what is implemented here. So - very similar implementation (different config) and a different decorator. I can do that, but:

  1. I don't have an immediate use case
  2. I generally don't like to encourage and suggest that kind of test writing. Sure - it's "wasteful" if all the tests work just right, but the point of testing is to ensure it's working right, not to assume it ;)

@yarikoptic
Copy link
Member

BTW! We could make original clone read only (after get competes, and rw for getting before get)! So the tests which do not request clone and try to modify it anyhow would fail - a nice test on itself IMHO

@yarikoptic
Copy link
Member

Anyways, that is just my 1c, and even without original clone reuse it would be useful, so it is ok to nevermind me ;-)

@bpoldrack
Copy link
Member Author

BTW! We could make original clone read only (after get competes, and rw for getting before get)! So the tests which do not request clone and try to modify it anyhow would fail - a nice test on itself IMHO
Anyways, that is just my 1c, and even without original clone reuse it would be useful, so it is ok to nevermind me ;-)

If combined with the separation I mentioned above (have dedicated decorator requesting a persistent clone rather than an actual transparent cache), I kinda like that idea.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment/recommendation about clone location

datalad/tests/fixtures.py Outdated Show resolved Hide resolved
@bpoldrack
Copy link
Member Author

Assuming that tests still pass, I think this is ready to go into 0.13.0 as is.
If we agree on this, I'll make an issue to collect next pieces as discussed here with @yarikoptic and outlined in the code comments. This would include to add very similar decorators for providing persistent read-only clones (in opposition to cache, this would not be a transparent thing for the tests). Furthermore cache could be more efficient with --bare or --mirror clones, etc. Which in turn requires such option(s) for datalad-clone. Recursion is yet another topic to address. Generalization for other uses such as in heudiconv or even some kind of datalad-cache feature needs some evolution and experience in using it, I guess.
However, this is already something very useful, I think, it solves the immediate use-case and most of the possible changes and additions are either independent or transparent for the decorated tests.

Agreed, @datalad/developers ?

P.S.: If so, I should prob. squash before merge.

@bpoldrack bpoldrack marked this pull request as ready for review June 8, 2020 00:06
@bpoldrack bpoldrack force-pushed the enh-cached-test-ds branch 3 times, most recently from 5f55c42 to 824d694 Compare June 8, 2020 04:40
# get_cached_dataset yet, since translation to keys depends on a
# worktree. We'll have the worktree of `version` only after cloning.
ds = get_cached_dataset(url, version=version)
clone_ds = Clone()(ds.pathobj, arg[-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW -- to be more "transparent cache" instance, shouldn't then it

  • clone so that local cached "origin" is named differently than "origin" remote (e.g. local-cached)
  • add "origin" remote which would be the original URL?
    the only possible side-effect I see could be -- it would need to fetch that "origin" remote so might prevent testing "offline" but so far I think the entire approach is for "online" testing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also probably that would undermine benefits for some tests use cases which might want to request data specifically from "origin".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "origin" remote which would be the original URL?
the only possible side-effect I see could be -- it would need to fetch that "origin" remote so might prevent testing "offline" but so far I think the entire approach is for "online" testing

Quite the opposite! With that approach a test would need to be aware of the cache in order to make any use of it - that's the exact opposite of transparent. With the implementation here, the test would always talk to "origin" - not knowing whether or not it's cached. That is the point! Also: I can't disable caching without side effect on the tests, if they would explicitly talk to a cache remote.

Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks very useful.

datalad/tests/test_utils_cached_dataset.py Outdated Show resolved Hide resolved
datalad/tests/test_utils_cached_dataset.py Outdated Show resolved Hide resolved
datalad/tests/test_utils_cached_dataset.py Outdated Show resolved Hide resolved
datalad/tests/test_utils_cached_dataset.py Outdated Show resolved Hide resolved
datalad/tests/test_utils_cached_dataset.py Outdated Show resolved Hide resolved
datalad/tests/test_utils_cached_dataset.py Outdated Show resolved Hide resolved
datalad/tests/utils_cached_dataset.py Show resolved Hide resolved
datalad/tests/utils_cached_dataset.py Outdated Show resolved Hide resolved
datalad/tests/utils_cached_dataset.py Outdated Show resolved Hide resolved
@bpoldrack bpoldrack force-pushed the enh-cached-test-ds branch from d03c6f0 to 01dd84c Compare June 9, 2020 09:00
@bpoldrack
Copy link
Member Author

Okay, per today's call: will squash, have a proper issue for further development and merge ...

This introduces two test decorators for caching remote datasets for
tests. The caching location defaults to user level cache dir for datalad
under tests/ (GNU/Linux: $HOME/.cache/datalad/tests) and can be
configured via 'datalad.tests.cache'. Not specifying this, disables
caching altogether.
The intend is for a transparent cache, so decorated tests do not know
whether or not caching is enabled and must not depend on it. The dataset
in the cache is a drop-in replacement for the one at the specified URL.
@yarikoptic
Copy link
Member

Per discussion, squash, and greens of CI - merging! Thanks @bpoldrack!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants