-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed parsing from beautifulsoup to lxml #232
base: master
Are you sure you want to change the base?
Conversation
Updated with tests to make sure dataframes return same result |
Below the bench results incl API call for a full year query (in seconds). I would guess about 40% reduction on average. query_day_ahead_prices 32.01 7.8 |
@fboerman are you planning on merging this? Sounds like a great improvement to me. |
@fboerman is there any updates on this PR? I'm also looking forward to seeing it being merged. |
Is there any help needed for this one? It looks like a huge improvement. We are caching years of data, and this would save as a lot of trouble |
hi @carl-lange2 @nhlong2701 @ivandjuraki @jvanelteren I looked at this before and talked to jesse about it (we know each other in real life) and my problem with this is that it requires a fully fledged test suite to be sure that nothing breaks. This change would touch every single thing of the library which is a too large change to do without guaranteeing that nothing break. Also to do myself I dont have enough time right now. If any of you want to work on this, starting with the test suite, we can break it down in parts and slowly merge in changes. I can assist with that. As it stands now I am not confident enough to merge this whole thing in one go (the PR is also outdated atm). @ivandjuraki I dont fully understand your issue. Are you running this library for the same data over and over again? I would advice you to fetch the data once and then save into a database for further processing. |
Hey @fboerman sorry for ghosting, i have a lot of github accounts and i am not even sure which one i use. If this PR gets attention, i can make some time for writing test suites if needed. |
@fboerman I would also be interested in seeing this merged and can provide assistance if needed |
Once #131 lands we could follow up on #232 (comment). |
Beautifulsoup can be quite slow when parsing large xml files. When a user want to get a large amount of data off the API, this can be annoying. For example, getting the generation of 1 year takes ~75 seconds to parse (~35000 rows, 10 production types and actual & aggregated = 700k values).
Therefore we can change to the lxml package, which is quite a bit faster (3 up to 20x, tested on 3 types of queries with queries for a full year, although the generation query above runs in 4 seconds which is 20x). I didn't test on all functions because it's a PITA.
It almost seems to good to be true, and I'm not very familiar with lxml. Some parse functions do some additional checks, and I kept them in, which resulted in a smaller speedup.
Timing the complete function call (including API call) is easier, I did that for all functions, see this notebook, with date ranges of a month. The speedups are much lower, also because certain calls lead to a large amount of API calls due to the @documents decorator and the API is sometimes shaky.
Main changes
Tests
I've tested all functions on the EntsoePandasClient, and all except 4 returned an idential df (tested with df_1.equals(df2). For the other 4 functions, the current package fails:
I've tested with this notebook, scroll down to the bottom for the equality comparison. This file also included the timings. Including the API calls. The tests without API are not document because it's a PITA where the functions in the package itself need to be modified in order to detect when the API call is finished.