Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NaNs (Closes #86) #103

Closed
wants to merge 8 commits into from

Conversation

henrifroese
Copy link
Collaborator

Implement dealing with np.nan, closes #86

Every function in the library now handles NaNs correctly.

Implemented through decorator @handle_nans in new file _helper.py.

Tests added in test_nan.py

As we went through the whole library anyways, argument "input" was renamed to "s" in some functions to be in line with the others.

Maximilian Krahn and others added 6 commits July 16, 2020 23:12
added the nan test file


Co-authored-by: Henri Froese <[email protected]>
fixed NaN issues in file preprocessing.py


Co-authored-by: Henri Froese <[email protected]>
Every function in the library now handles NaNs correctly.

Implemented through decorator @handle_nans in new file _helper.py.

Tests added in test_nan.py

As we went through the whole library anyways, argument "input" was renamed to "s" in some functions to be in line with the others.

Co-authored-by: Maximilian Krahn <[email protected]>
@jbesomi
Copy link
Owner

jbesomi commented Jul 17, 2020

Wow! That's cool stuff! 👍

The part I'm confused with is: why do we need to decorate any function?

Given the example in handle_nans :

s = pd.Series(["a", np.nan])
s.str.replace("a","b")

gives the exact same result as

s = pd.Series(["a", np.nan])
@handle_nans
def replace_a_with_b(s):
   return s.str.replace("a", "b")
replace_a_with_b(s)

But the first solution is faster (as there is no copy), cleaner and shorter.

In which functions do we really need to use such a decorator? I would say we should use it only in extreme cases (as it makes a copy ...) or even better adjust such functions so that the decorator is not even necessary.

You might find this interesting. By doing some tests, I found that np.nan and pd.NA are not exactly the same thing. If you want to know more: API: distinguish NA vs NaN in floating dtypes

Many preprocessing functions now manually handle nans.

Tests also for pd.NA

Also removed a comma in representation.py ;)
@henrifroese
Copy link
Collaborator Author

Thanks, we really did not know that pd.str. ... already takes care of NaNs 🤦 -> we now removed it from most of the functions in preprocessing 😰 . It's now only there if we really need it.

Concerning the performance: we'll have to do a copy at some point anyway if we want to pass the non-nan values to a function (like apply, or a Vectorizer from sklearn, and not s.str.... which handles all nans for us) as this is the behaviour:

>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series(["test", np.nan])
>>> def f(x):
...        # If we want to keep the NaNs, we'll have to do this.
...        x[~x.isna()] = x[~x.isna()].str.replace("t", "i")
...        return x
>>> f(s)
0    iesi
1     NaN
dtype: object
>>> # Function call without copy changes the argument!
>>> s
0    iesi
1     NaN
dtype: object

We looked at many methods to do this with one less copy but could not find any (without manipulating the input). Additionally, with e.g. PhrasesTransformer (and PCA etc.) we get an array of new values anyway that uses up the same amount of memory.

So we will need a second Series with only the non-NaNs anyway for some functions. For example in tokenize_with_phrases when we use the PhaseTransformer, we need to pass it the input Series without NaNs, but in the result we want to re-insert the NaNs, and as shown above we can't use s[~s.isna()] = ... as it would change the input itself.

So we believe our decorator is as performant as we can be if we want to keep the NaNs. It also frees developers later on from thinking about NaNs.

By doing some tests, I found that np.nan and pd.NA are not exactly the same thing

It also catches this so we're good in this regard 😅

@henrifroese
Copy link
Collaborator Author

After this, we'll then continue working on updating the documentation and should then have finished part 1 of #85 (and the " All function to deal with np.nan #86") 👍

@jbesomi
Copy link
Owner

jbesomi commented Jul 18, 2020

Thanks! I will look into this in more detail on Monday (leaving now). Sounds cool that you are about to finish with part 1! 🎉

@jbesomi
Copy link
Owner

jbesomi commented Jul 18, 2020

(Edit)

Very short feedback:

For functions that call apply:

s[~s.isna()] = s[~s.isna()].apply(lambda x: ...)

should do the job (or very similar code, without a new copy, and without the need of the decorator), I'm missing something?

regards,

@henrifroese
Copy link
Collaborator Author

henrifroese commented Jul 18, 2020

I'll try to formulate the problem a bit clearer (we probably spent a little too much time on this the last days, at one point we dug into the Pandas source code to figure out how s.str... handles the NaNs 🥴 )

We have NaNs in the input, e.g. s = pd.Series(["test", np.nan, pd.NA, "test2"]) . We now want to apply some function to the input and keep the original NaNs where they were. There are basically 3 cases:

  1. The function is implemented with functions that already don't touch the NaNs, e.g. s.str.replace(...). We don't have to do anything in this case ✔️
  2. The function at some point loops through the Series (e.g. stem loops through with apply). We can then just handle the NaNs manually in the loop with e.g. if text is np.nan or text is pd.NA (we implemented this e.g. in stem), along the lines of
s = s.apply(lambda x: x.upper() if x is not pd.NA and x is not np.nan else x)

We cannot do s = s[~s.isna()].apply(lambda x: ...) as we would lose the original NaNs (cells that were NaN in s originally would now be lost completely).

We cannot do s[~s.isna()] = s[~s.isna()].apply(lambda x: ...) (which seems great as it only changes the not-NaNs and keeps the original values) as doing this inside a function changes the original Series.

  1. The function at some point needs all the (non-nan) values to do something (e.g. use TfidfVectorizer). We have the same problems described in 2., but we can't just check in a loop if something is a NaN and handle it there. These are the options:
tf = CountVectorizer(
    max_features=max_features, tokenizer=lambda x: x, preprocessor=lambda x: x
)

# Option 1 -> changes original Series
s[~s.isna()] = pd.Series(tf.fit_transform(s[~s.isna()]).toarray().tolist(), index=s.index)

# Option 2 -> loses the original NaNs from the input
s = pd.Series(tf.fit_transform(s[~s.isna()]).toarray().tolist(), index=s.index)

# Option 3: the best we found, it's basically what the decorator does (if one would write it into the function directly)
# As you can see, it's just Option 1, but not changing the input anymore
s_result = s.copy()
s_result[~s_result.isna()] = pd.Series(tf.fit_transform(s_result[~s_result.isna()]).toarray().tolist(), index=s_result.index)

@jbesomi
Copy link
Owner

jbesomi commented Jul 19, 2020

Hey @henrifroese!

Thank you for your detailed answer. You are doing such a great job, really! I know you are right, I'm just asking since I would like to really understand the code and the reasons behind some choices. In about 1 month you will be probably less available, and I want to make sure I can have a full understanding of the codebase as well as argue all choices we made! ;)

Now the reason why s[~s.isna()] = s[~s.isna()].apply(lambda x: ...) (1) does not work is much clear. Thank you!

Some minor question/feedback:

  1. Why tfidf is commented on the tests?
  2. Why count_sentence returns now an object Series instead of an int Series? As output values are either nan or int, dtype=int64 is more natural
  3. It is correct that remove_diacritics has a the .astype("unicode")? This question does not have much to do with this PR. Depending on your answer, I might need to open another issue to solve/check that.
  4. Renaming all input to s is clearly a good idea. Next time, this kind of change should be preferably done in another PR. The simple reason is that the moment we will merge this PR, the other pull requests will have probably some conflicts.
  5. handle_nan
    1. is written in the docstring "and in pca there is no return", is that right?
    2. As discussed, Examples should probably show an example of a function that makes use of apply
    3. As I had the question of why simple not use (1), other contributors will have the same. I think we should add more information to the docstring to make sure it's really clear the purpose of this wrapper
  6. @wrapt.decorator do we really need to use this external library? Can we not simply use the wrapper from the stlib functools? (Adding dependencies is great when is really necessary.)

Thank you again for such a great job!! 🎉 🎉

@henrifroese
Copy link
Collaborator Author

Why tfidf is commented on the tests?

I have no idea! 🤦 I uncommented it.

Why count_sentence returns now an object Series instead of an int Series? As output values are either nan or int, dtype=int64 is more natural

Yes, that's a little annoying, we can solve this by doing s_return = s_return.astype(output[0].dtype) to get the results type correct. We'll think about this some more.

It is correct that remove_diacritics has a the .astype("unicode")? This question does not have much to do with this PR. Depending on your answer, I might need to open another issue to solve/check that.

We did not change this I believe (or maybe I'm missing something?), so if there's something wrong maybe a new Issue would be good.

handle_nan is written in the docstring "and in pca there is no return", is that right?

That's weird, we'll think about what we meant there again, sorry about that!

As discussed, Examples should probably show an example of a function that makes use of apply

I'm not sure what you mean by this? Sorry

As I had the question of why simple not use (1), other contributors will have the same. I think we should add more information to the docstring to make sure it's really clear the purpose of this wrapper

Yes, will do!

@wrapt.decorator do we really need to use this external library? Can we not simply use the wrapper from the stlib functools? (Adding dependencies is great when is really necessary.)

Will check if that's possible!

@henrifroese
Copy link
Collaborator Author

As discussed, closing this as we'll do this differently in separate PRs.

@jbesomi
Copy link
Owner

jbesomi commented Jul 27, 2020

Yes, see #123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

All function to deal with np.nan
2 participants