-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scipy lambda layer for 3.9 and 3.10 #360
Comments
If you're asking about the office AWS layer, I don't really know. We can try to add Scipy for 3.10 here, but we may run into the size MB limit, which is a hard limit that can't be worked around. |
Scipy wheels are roughly 30-40 MB in size lately: https://github.com/scipy/scipy/releases/tag/v1.11.4 . Does that seem too much? I would like to see if I can help out with this issue. As regular SciPy contributor, I am familiar with the scipy tooling and I use Lambda at my day job but I am pretty new to Lambda layer creation. Do you have old scripts for SciPy still lying around? |
If someone would make a pull request to add these packages, then I'll merge them and automatically build :) |
I tried building SciPy for Lambda, but currently it's size exceeds the accepted in Lambda. Lambda has a limit of 50MB, and ScIPy size is above that (~57MB). Note this is the result of a pip install scipy ... which includes not just SciPy but numpy as well. I will see if we can remove the cache files to resuce the size, but at the moment this is the size :( |
I will experiment with removing pycache directories --- and just keeping the pycache directories to see what happens. |
IIRC the size limit is 250 MB unzipped, rather than 50 MB on upload. You can significantly cut down the size by deleting all the It used to be possible to get numpy/scipy/pandas in a single layer. I'd be curious what the status is now. |
Thanks, I'll check and see if it's possible. But there's a lot of bespoke effort that may be unsustainable. The Lambda limit is 50MB zipped, and currently the total zipped size is bigger than that :(. |
I am also interested in a scipy layer for 3.10+, and can't find a workaround for the size limit. I am not sure if you already do this but running something like |
Friendly ping: was there any progress here? For the custom removal of code, is it possible to automatically inject such package specific code into the whole terraform build script? |
If someone could modify the build function, that'd be much appreciated :). I think for now we can remove all pycache files to save space, that may help. |
Not sure if this would help at all but this saved a lot of space when building the layer.
|
Tested and it works. I also added This is one of the most hilarious black magic I've ever seen. |
Wow. I need to find someway to automate this. What does --no-compile do? |
@keithrozario we have just implemented proper support for this in NumPy, via "install tags" in the build system. Here is how to use it: numpy/numpy#26289 (comment). I'm planning to do the same for SciPy. It would come down to adding
That together should make all this a one-liner. It should work for NumPy now. |
It will not precompile python code into byte code during the install process. But the test suites are those which consuming a lot of megabytes, the byte code takes just a few megabytes at most. My approach is summed up in this script: # install cpython implementation only
pip install numpy pandas scipy --no-compile --implementation cp -t python
# remove all dist-info
rm -r *.dist-info
# delete all tests directories
find . | grep -E "*/tests$" | xargs rm -rf
# clean up python byte code if any
find . | grep -E "(/__pycache__$|\.pyc$|\.pyo$)" | xargs rm -rf
# Xoá cả pyproject vì không cần đến
find . | grep -E "pyproject.toml$" | xargs rm -rf
# delete unused .dat file which is deprecated since scipy 1.10
find . | grep -E "scipy\misc\*.dat$" | xargs rm -rf Btw i think modifying the bundled source code is not a good practice tho. |
I don't get why numpy and scipy has their test suite in the wheel, when they don't contribute anything on the run process. I thought it was the sanity check in every |
@aperture147 that's historical. Once upon a time, many more users built from source. And then it was critical to be able to run tests with For new projects started today, the test suite usually goes outside of the importable tree. Moving everything in numpy now though would be very disruptive, as it would (among other things) make all open PRs unmerge-able. |
Thank you so much all. I'll look into this next week or so, and hopefully we can get a layer of scipy out!!! I'm not sure how much of this is generic (can be applied to all packages) and how much is specific to scipy though. Will have to think a bit more. |
AFAIK, SciPy and NumPy are safe to have |
Test layer is here: arn:aws:lambda:ap-southeast-1:367660174341:layer:Klayers-p312-scipy:1 We perform the --no-compile flag to reduce the .pyc and pycache files, and also delete all directories marked 'tests', as recommended by experts on this thread :) Feel free to run some test on the layer. If all goes well, I'll push this before end of this week into production, and we'll have 'optimized' builds going forward. |
I forgot to remove .dat and dist-info as well. That's up next. |
I'll need to write docs for it, but this command will already remove test data is well as some large-ish
It's available in SciPy's
You probably want to keep |
Thanks. Unfortunately, I do not build the package from source, I merely pip install. Will take your comment on keeping the dist-info, but I'll see if I can identify any _test_xxx.so files to be removed as well. |
I think this is the full list:
|
thanks -- the challenge from Klayers at least is that we need to make the script generic. I'm very hesitant to include package specific build steps for something like scipy, because maintaining that going forward would be difficult. Although it sounds OK, deleting something like every file that meets the _test*.so might cause issues with other packages, but i would say the probability that someone has a runtime required .so file that begins with _test is very low. Still pondering. Wonder what others are thinking. |
Yep, this would be a nightmare to maintain in the long run. I would be interested to test it out on a fork of this repo though without making a PR to your main repo. Any chance we can make that work? |
You could try adding some specific script for specific library, like adding a file called I noticed that |
I can notice that having python byte-code truncated will increase cold start time. Should we keep those to reduce the cold start time or it's just me fiddling too much with the layer? |
Not really when building the layer from wheels published to PyPI. NumPy uses 64-bit (ILP64) OpenBLAS, while SciPy uses 32-bit (LP64). We have a long-term plan to unify these two builds, but PyPI/wheels make this very complex. I would not recommend doing manual surgery here. |
Yes. Do you know how much slower the cold start time. Python will need to convert the .py into byte code, and that will incur some latency. For big packages this might be a lot, but not sure. |
Normally it only takes about 500ms to 1s to warm up the lambda, but now it takes 2s+ (sometimes up to 5s+ if I import all numpy, scipy and pandas) to turn it up (tested on 1024GiB RAM python 3.10 lambda function). It's bytecode compilation problem or it's just me doing too much surgeries on the layer. |
No it's probably bytecode compilation. Let me think about this a bit more. Bytecode is major version specific, so should be shareable across functions even if the runtime is upgraded. But bytecode also takes space, we have to trade off between space considerations and speed considerations. Nothing will work for everyone -- so my thoughts are to remove bytecode only if the package is large. |
I love this conversation. I did a test today using just The findings: With Which suggest a ~50ms time penalty for compiling from py into .pyc. I think unless the package is huge (numpy is quite big already) you won't see any discernible performance gain. I think if you tweak the lambda settings like memory size, that performance difference would shrink even further. Given this, if you're importing something like boto3, or requests, the difference is so little nobody will notice if the cache is included or not. For the larger packages like numpy and scipy, most (not all) will want to optimize for space, so that their own code or additional layers can be larger. Defaulting to removing pycache seems to be a logical decision. So right now, we will remove .pyc files from all layers moving forward. Again, will not meet 100% of the requirements from everyone, but will meet the majority of users for the majority of times. Let me know your thoughts below. Does that mean I can remove the need for separate packages for different versions of python??? Interesting....!! |
Since the current AWS lambda layers doesnt support scipy only on 3.9 and above, it would be great if we could create an arn for scipy as well. Does anyone when will there be a aws layer for scipy for python 3.9 and 3.10
I have tried creating a custom layer for scipy that supports 3.9 o 3.10, however, it always gives C-extension error or says the the scipy module is broken when i try to create it from the cloud9 ide without numpy and then upload back to lambda. Moreover, it is not possible to add scipy from the cloud9 as well because it is above the mb limit that lambda can handle (the only way is to delete the numpy directories and scipy can be succesfully installed to lambda without any errors.
I would really appreciate it if anyone knows when will AWS will provide a aws layer just like in 3.7 or 3.8.
The text was updated successfully, but these errors were encountered: