Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we remove duplicates? #9

Open
ianchute opened this issue Mar 1, 2019 · 1 comment
Open

Should we remove duplicates? #9

ianchute opened this issue Mar 1, 2019 · 1 comment

Comments

@ianchute
Copy link

ianchute commented Mar 1, 2019

I have a huge amount of duplicates in my data and I'm wondering if it is fine to remove them. I'm aware that duplicate removal changes the distribution of the data and most machine learning models would not work well with it. My problem is that my data is huge and any way to make corex faster would certainly be of much benefit (like duplicate removal).

So, is it fine to remove duplicates? If yes, what are the adverse effects?

@gregversteeg
Copy link
Owner

gregversteeg commented Mar 2, 2019

Yes, I think you should remove duplicates. CorEx looks for clusters of variables with high mutual information. Duplicate columns have the highest mutual information possible, so they will dominate the signal and possibly wash out more interesting relationships.
I would look at it like duplicates reflect something artificial about the data processing, by taking them out we can discover the intrinsic relationships in the data.

Another way to look at this is hierarchically. CorEx will lump together duplicates in the first layer and associate these duplicate columns with a single factor. Then you might be able to find weaker relationships between the factor representing the duplicates and other factors in the second layer.
By taking out the duplicates, you are essentially just adding a layer 0 that manually extracts this known and uninteresting source of dependence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants