Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bias term using NCD? #44

Open
Maimonator opened this issue Aug 3, 2023 · 4 comments
Open

Bias term using NCD? #44

Maimonator opened this issue Aug 3, 2023 · 4 comments

Comments

@Maimonator
Copy link

Hey!
I've been playing around with this method and I've noticed something interesting.

When calculating the NCD you use

ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1,Cx2)

And Cx1x2, Cx1 and Cx2 are computed using gzip.compress
The thing is, that the output of gzip.compress contains not only the compressed data but metadata on it.
This includes signature, compression method, flags, modification time, crc32 and more. Depends on the configuration this is at least 20 bytes that I think should not be a part of the NCD calculation as they don't represent any actual information of the string that was compressed.
On my WSL (Ubuntu 22.04) the default behavior is 20 bytes of metadata.

To be clear, essentially what happens is this:

ncd = (Cx1x2 + metadata_size - min(Cx1,Cx2) - metadata_size) / (max(Cx1, Cx2) + metadata_size)

Basically all the distances are calculated with a bias term in the denominator of +20, as the metadata size is nullified in the numerator.
This means that all samples receive a bias towards being closer to each other, and not in the same way for each two samples that the NCD is calculated for.

I've tested various values for metadata_size by adding a constant to the denominator on the AG News dataset.
Note that I specifically used majority vote for accuracy with different K's so the results are not the same as the paper's

When the denominator constant is 0 the value is the same as in the original implementation of NCD, -20 is how I suspect it should be.

image

I also tested this with increments of 20 up to 200
image

It seems that for AG News the higher the denominator constant the lower the accuracy.

Does this make sense or am I missing something?

@bazingagin
Copy link
Owner

hi @Maimonator, this is an interesting finding! The metadata occupying the denominator is larger than I thought.
It seems removing the bias increases the accuracy a bit? (Maybe needs to confirm by using e.g., -40 as the bias term and see if the accuracy decreases?)

@Maimonator
Copy link
Author

hi @Maimonator, this is an interesting finding! The metadata occupying the denominator is larger than I thought. It seems removing the bias increases the accuracy a bit? (Maybe needs to confirm by using e.g., -40 as the bias term and see if the accuracy decreases?)

Yes I I think that removing the bias increases the accuracy but I wouldn't use anything higher than -20 as you increase the risk of zero division or getting negative values which as I understand the paper, the idea is just to normalize the distance to be between 0 and 1.
Also it's important to note that this has a larger impact on shorter strings .

@bazingagin
Copy link
Owner

Yeah, you are right - subtracting more than 20 may increase the chance of zero division.
This bias term will have more impact on shorter strings but I'm curious whether this "relatively higher" impact on the shorter string affects the preference for the length of the training sample.

@Maimonator
Copy link
Author

Yeah, you are right - subtracting more than 20 may increase the chance of zero division. This bias term will have more impact on shorter strings but I'm curious whether this "relatively higher" impact on the shorter string affects the preference for the length of the training sample.

I'll try and find a dataset with class imbalance where I think the impact could be prominent in false negatives/false positives.
Maybe I'll do it later today 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants