Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid running apt-cache policy on every import #4690

Closed
yarikoptic opened this issue Jul 11, 2020 · 4 comments
Closed

Avoid running apt-cache policy on every import #4690

yarikoptic opened this issue Jul 11, 2020 · 4 comments
Labels
performance Improve performance of an existing feature

Comments

@yarikoptic
Copy link
Member

$> strace -f python -c 'import os' 2>&1 | grep -v ENOENT | grep execv.*apt-cache     

$> strace -f python -c 'import datalad' 2>&1 | grep -v ENOENT | grep execv.*apt-cache
[pid 1637620] execve("/usr/bin/apt-cache", ["apt-cache", "policy"], 0x227f5c0 /* 107 vars */ <unfinished ...>

I guess through some module/checks. That ends up quite a number of invocations when we test our base external remotes. E.g. while testing datalad-crawler:

$> grep 'exec.*\(apt-cache.*\)' strace.log | grep -v ENOENT | nl | tail        
...
   196	1626634 22:16:14.871171 execve("/usr/bin/apt-cache", ["apt-cache", "policy"], 0x19dc270 /* 128 vars */ <unfinished ...>
   197	1626648 22:16:16.521609 execve("/usr/bin/apt-cache", ["apt-cache", "policy"], 0x2079270 /* 128 vars */ <unfinished ...>

which is more less in agreement with number of special remote invocations:

$> grep 'exec.*\(git-annex-remote-data.*\)' strace.log | grep -v ENOENT | nl | tail -n 2
   177	1626631 22:16:14.714512 execve("/home/yoh/proj/datalad/datalad-crawler/venvs/dev3/bin/git-annex-remote-datalad-archives", ["/home/yoh/proj/datalad/datalad-c"...], 0x42000e6af8 /* 127 vars */ <unfinished ...>
   178	1626645 22:16:16.377021 execve("/home/yoh/proj/datalad/datalad-crawler/venvs/dev3/bin/git-annex-remote-datalad-archives", ["/home/yoh/proj/datalad/datalad-c"...], 0x42000e1af8 /* 127 vars */ <unfinished ...>

given that each invocation is about 200-300ms on my laptop:

$> time apt-cache policy 2>&1 | tail -n 2
     origin deb.debian.org
Pinned packages:
noglob apt-cache policy 2>&1  0.20s user 0.04s system 99% cpu 0.238 total
tail -n 2  0.00s user 0.00s system 1% cpu 0.237 total

that is at least 20 seconds of tests run time (unlikely it is some async) contributing (and this is just datalad-crawler which tests are relatively speedy .

altogether running the datalad-crawler tests results in over 100k invocations of external tools

$> grep 'execve' strace.log | grep -v ENOENT | nl | tail -n 2  
112096	1628529 22:16:25.497272 execve("/usr/lib/git-annex.linux/bin/git", ["git", "--git-dir=", "config", "-z", "-l", "--show-origin"], 0x55c3a37a1038 /* 123 vars */) = 0
112097	1628529 22:16:25.498675 execve("/usr/lib/git-annex.linux/exe/git", ["/usr/lib/git-annex.linux/exe/git", "--library-path", "/usr/lib/git-annex.linux//usr/li"..., "/usr/lib/git-annex.linux/shimmed"..., "--git-dir=", "config", "-z", "-l", "--show-origin"], 0x55c0a5f22ee8 /* 124 vars */) = 0

I hope that identifying the source of some of such invocations and adding some caching, where possible (our direct invocation), could help to at least partially mitigate them. May be https://github.com/con/pyfscacher (see con/fscacher#1) whenever it is born could be of use.

@yarikoptic yarikoptic added the performance Improve performance of an existing feature label Jul 11, 2020
@yarikoptic
Copy link
Member Author

FWIW, that apt-cache invocation is via

yoh      1639869  0.0  0.0  23392 18244 pts/19   S+   11:17   0:00         python -c import datalad
yoh      1639871  0.0  0.0  14616 11036 pts/19   S+   11:17   0:00           /usr/bin/python3 -Es /usr/bin/lsb_release -a
yoh      1639872  0.0  0.0   5564  3392 pts/19   S+   11:17   0:00             /bin/bash /tmp/bbbb/apt-cache policy

and it is coming from our datalad.utils

def get_linux_distribution():
    """Compatibility wrapper for {platform,distro}.linux_distribution().
    """
    if hasattr(platform, "linux_distribution"):
        # Use deprecated (but faster) method if it's available.
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=DeprecationWarning)
            result = platform.linux_distribution()
    else:
        import distro  # We require this for Python 3.8 and above.
        result = distro.linux_distribution(full_distribution_name=False)
    return result


try:
    linux_distribution_name, linux_distribution_release \
        = get_linux_distribution()[:2]
    on_debian_wheezy = on_linux \
                       and linux_distribution_name == 'debian' \
                       and linux_distribution_release.startswith('7.')
except:  # pragma: no cover
    # MIH: IndexError?
    on_debian_wheezy = False
    linux_distribution_name = linux_distribution_release = None

since unlikely the HOME is shared across radically different OSes (and wheezy is in the past, and I do not even see on_debian_wheezy used any longer) - we could simplify and make OS checks cached.
Also external_versions structure could probably be worked out to persist at least a bit (e.g. for seconds) across datalad invocations -- should probably provide an additional boost

@yarikoptic yarikoptic changed the title FOI: "we" run even apt-cache policy Avoid running apt-cache policy on every import Jul 11, 2020
@yarikoptic
Copy link
Member Author

just confirming that it is indeed about 200ms for the entire distro import/invocation

$> python -c 'from time import time; t0=time(); import distro; distro.linux_distribution(full_distribution_name=False); print(time()-t0)'
0.20607256889343262

@yarikoptic
Copy link
Member Author

and we have missed when our overall import time has grown twice

$> time python -c 'import datalad'
python -c 'import datalad'  0.30s user 0.08s system 100% cpu 0.384 total

# commented out use of distro
$> time python -c 'import datalad'
python -c 'import datalad'  0.15s user 0.04s system 100% cpu 0.190 total

@yarikoptic
Copy link
Member Author

yarikoptic commented Jul 12, 2020

Quick note: A backward compatible fix for distro necessity could be to rely on what discovered from stack overflow:

Note to future readers: Since Python 3.7 (8 years after this was asked) this is now possible with module level __getattr__. – jedwards Feb 10 '19 at 12:56

since distro module is needed iirc only since 3.8 . So not used until asked for, but yet to check if check if unavoidable for typical cases...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant