-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NaNs (Closes #86) #103
Conversation
added the nan test file Co-authored-by: Henri Froese <[email protected]>
fixed NaN issues in file preprocessing.py Co-authored-by: Henri Froese <[email protected]>
Every function in the library now handles NaNs correctly. Implemented through decorator @handle_nans in new file _helper.py. Tests added in test_nan.py As we went through the whole library anyways, argument "input" was renamed to "s" in some functions to be in line with the others. Co-authored-by: Maximilian Krahn <[email protected]>
Wow! That's cool stuff! 👍 The part I'm confused with is: why do we need to decorate any function? Given the example in
gives the exact same result as
But the first solution is faster (as there is no copy), cleaner and shorter. In which functions do we really need to use such a decorator? I would say we should use it only in extreme cases (as it makes a copy ...) or even better adjust such functions so that the decorator is not even necessary. You might find this interesting. By doing some tests, I found that |
Many preprocessing functions now manually handle nans. Tests also for pd.NA Also removed a comma in representation.py ;)
Thanks, we really did not know that Concerning the performance: we'll have to do a copy at some point anyway if we want to pass the non-nan values to a function (like apply, or a Vectorizer from sklearn, and not s.str.... which handles all nans for us) as this is the behaviour:
We looked at many methods to do this with one less copy but could not find any (without manipulating the input). Additionally, with e.g. PhrasesTransformer (and PCA etc.) we get an array of new values anyway that uses up the same amount of memory. So we will need a second Series with only the non-NaNs anyway for some functions. For example in So we believe our decorator is as performant as we can be if we want to keep the NaNs. It also frees developers later on from thinking about NaNs.
It also catches this so we're good in this regard 😅 |
Thanks! I will look into this in more detail on Monday (leaving now). Sounds cool that you are about to finish with part 1! 🎉 |
(Edit) Very short feedback: For functions that call
should do the job (or very similar code, without a new copy, and without the need of the decorator), I'm missing something? regards, |
I'll try to formulate the problem a bit clearer (we probably spent a little too much time on this the last days, at one point we dug into the Pandas source code to figure out how We have NaNs in the input, e.g.
We cannot do We cannot do
|
Hey @henrifroese! Thank you for your detailed answer. You are doing such a great job, really! I know you are right, I'm just asking since I would like to really understand the code and the reasons behind some choices. In about 1 month you will be probably less available, and I want to make sure I can have a full understanding of the codebase as well as argue all choices we made! ;) Now the reason why Some minor question/feedback:
Thank you again for such a great job!! 🎉 🎉 |
I have no idea! 🤦 I uncommented it.
Yes, that's a little annoying, we can solve this by doing
We did not change this I believe (or maybe I'm missing something?), so if there's something wrong maybe a new Issue would be good.
That's weird, we'll think about what we meant there again, sorry about that!
I'm not sure what you mean by this? Sorry
Yes, will do!
Will check if that's possible! |
As discussed, closing this as we'll do this differently in separate PRs. |
Yes, see #123 |
Implement dealing with np.nan, closes #86
Every function in the library now handles NaNs correctly.
Implemented through decorator @handle_nans in new file _helper.py.
Tests added in test_nan.py
As we went through the whole library anyways, argument "input" was renamed to "s" in some functions to be in line with the others.