Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/cleanup fuzzyreplace #123

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jaderabbit
Copy link
Member

  • Replaced fuzzywuzzy with the C++ based rapidfuzz
  • Added a progress bar for fuzzywuzzy
  • Moved all the installations to the top cell

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

Review Jupyter notebook visual diffs & provide feedback on notebooks.


Powered by ReviewNB

Copy link
Contributor

@espoirMur espoirMur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome @JADE,

did we get the time. to benchmark it?

How long does it take on a medium-size dataset?

pandas
p_tqdm
rapidfuzz
joeynmt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaderabbit and @juliakreutzer is this the right way to install joeynmt?

like pip install joeynmt

I remember last time I tried it it gave me headaches and I was forced to install it from Github...

using pip install git+https://github.com/joeynmt/joeynmt.git

also it is missing the version...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@espoirMur I added it to pypi a few weeks ago so this should be fine now - I still need to update the folder locations later down so we don't need to install joeynmt via cloning it.

Yes, I should do versions for all the requirements - thanks for the push!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is on PyPI that is cool....

@jaderabbit
Copy link
Member Author

This looks awesome @JADE,

did we get the time. to benchmark it?

How long does it take on a medium-size dataset?

@espoirMur I'm actually going to change it.

  • The benchmark I did do for fuzzywuzzy vs rapidfuzz though using timeit: Where fuzzywuzzy takes 1s for 100 patterns, rapidfuzz takes 32ms 🎉
  • On the other hand: The new progress bar I added slows EXTENSIVELY - in fact, so far, your approach (which technically has no logging at all) is the fastest still. I might rewrite using a slightly different multiprocessing library to allow for some sense of output - or at a progress bar which updates based on an estimation or something.

Copy link
Contributor

@espoirMur espoirMur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@espoirMur
Copy link
Contributor

This looks awesome @JADE,
did we get the time. to benchmark it?
How long does it take on a medium-size dataset?

@espoirMur I'm actually going to change it.

  • The benchmark I did do for fuzzywuzzy vs rapidfuzz though using timeit: Where fuzzywuzzy takes 1s for 100 patterns, rapidfuzz takes 32ms 🎉
  • On the other hand: The new progress bar I added slows EXTENSIVELY - in fact, so far, your approach (which technically has no logging at all) is the fastest still. I might rewrite using a slightly different multiprocessing library to allow for some sense of output - or at a progress bar which updates based on an estimation or something.

If that is the speed we got .... it's really really nice....

@juliakreutzer
Copy link
Collaborator

That's an awesome improvement, thank you @jaderabbit! 💯
Regarding the progress bar: do you want to keep it for now and merge or wait for the merge until there's a better solution?

Copy link
Collaborator

@juliakreutzer juliakreutzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaderabbit shall we merge this? Or not because of the progress bar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants