You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a huge amount of duplicates in my data and I'm wondering if it is fine to remove them. I'm aware that duplicate removal changes the distribution of the data and most machine learning models would not work well with it. My problem is that my data is huge and any way to make corex faster would certainly be of much benefit (like duplicate removal).
So, is it fine to remove duplicates? If yes, what are the adverse effects?
The text was updated successfully, but these errors were encountered:
Yes, I think you should remove duplicates. CorEx looks for clusters of variables with high mutual information. Duplicate columns have the highest mutual information possible, so they will dominate the signal and possibly wash out more interesting relationships.
I would look at it like duplicates reflect something artificial about the data processing, by taking them out we can discover the intrinsic relationships in the data.
Another way to look at this is hierarchically. CorEx will lump together duplicates in the first layer and associate these duplicate columns with a single factor. Then you might be able to find weaker relationships between the factor representing the duplicates and other factors in the second layer.
By taking out the duplicates, you are essentially just adding a layer 0 that manually extracts this known and uninteresting source of dependence.
I have a huge amount of duplicates in my data and I'm wondering if it is fine to remove them. I'm aware that duplicate removal changes the distribution of the data and most machine learning models would not work well with it. My problem is that my data is huge and any way to make corex faster would certainly be of much benefit (like duplicate removal).
So, is it fine to remove duplicates? If yes, what are the adverse effects?
The text was updated successfully, but these errors were encountered: