Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation of what exactly the various dedup strategies do #299

Open
aditya-hari opened this issue Oct 22, 2024 · 0 comments
Open

Explanation of what exactly the various dedup strategies do #299

aditya-hari opened this issue Oct 22, 2024 · 0 comments

Comments

@aditya-hari
Copy link

Is there any documentation of what the various deduplication strategies actually do? My use case is that I have a corpus of texts and the document have certain template texts that reoccur with slight variations to them. I don't know if any of the strategies here fit for that, and the code examples and whatever documentation there is don't help. For instance what does "only_dedup_in_index" mean for SentenceDedup?

Maybe it's just because I'm not very familiar with deduplication in general that I'm struggling with this, so would appreciate help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant