Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python bindings #5

Open
xamgore opened this issue Jan 11, 2025 · 8 comments
Open

Python bindings #5

xamgore opened this issue Jan 11, 2025 · 8 comments

Comments

@xamgore
Copy link
Owner

xamgore commented Jan 11, 2025

@bunny-therapist described how it works.

You publish a python package that has the rust crate as a dependency. The compiled binary gets built into any wheels built, but if there is no wheel matching the user's system, installing the python package will download the crate and compile it.

The python package contains python code, as well as some additional rust code that calls the crate, just with wrapping functions that can be decorated with macros to generate bindings. (I do not recommend adding these binding macros to the original rust code - you will likely end up with issues related to python garbage collection, threading, memory management etc. I intend to wrap them with additional rust code.)

@bunny-therapist
Copy link
Collaborator

My intention is to put this together as soon as we are happy with yake-rust.

Though I could imagine merging this repo into the original project first, then doing the python bindings directly there. Maybe even after a first 1.0.0 release, as a 1.1.0 feature. I have a bunch of code for this already, we just need to work out exactly how we want this to work, and even if it should be in the same project or not.

@bunny-therapist
Copy link
Collaborator

I was trying to make it earlier, but the rust interface kept changing and I could not keep up, so I decided to wait until it stabilized.

@bunny-therapist
Copy link
Collaborator

Do you think the interface (yake constructor, configs, extract method) will change further?

@xamgore
Copy link
Owner Author

xamgore commented Jan 12, 2025

Sure it will. Especially if we decide to further optimize the code.

  • segtok will probably return a list of structs referencing the text &str, not just Vec.
  • Yake will start accepting Sentences, and not just a text &str. The Sentence structure will be different. There will be a helper to build Sentences from segtok as well as huggingface tokenizers.
  • The output is only n best, though it should be an iterator or the full list.

But we can postpone all of those changes or speed them up. I am busy the next 1-2 weeks, so let's do smth that will allow you moving further. We should think about benchmarks, as the crate is a bit useless without them, comparing to py impl.

@bunny-therapist
Copy link
Collaborator

In my local code where I wrote python bindings, I had tests that tested

  • agreement with LIAAD/yake (this is where all my bug reports came from, initially)
  • speed compared to LIAAD/yake, e.g., "finish processing this text in half the time"
  • that the code worked multithreaded (i.e. that the GIL is released so that rust can take advantage of multiple CPUs - unlike python).

It is not quite benchmarks, but the speed comparison part could of course be an embryo of something like that.

The GIL-release is actually a huge deal. And I intend to put make that point very clear wherever we write about the python bindings.

@bunny-therapist
Copy link
Collaborator

Personally, I would prefer it if we, in order:

  1. fixed the remaining problems and cleaned up the code
  2. settled on a semi-stable interface for version 1.0.0 without introducing a bunch of extra features like accepting sentences, custom tokenization, etc. We can always make a version 2.0.0. For extra public functions which take sentences, you don may not even need that, just new feature releases.
  3. added python bindings (I can adapt what I already have if we have an interface) - including tests that compares it to LIAAD/yake
  4. publish version 1.0.0 to crates.io and its python bindings to pypi
  5. work on benchmarking and presentation and publish that
  6. consider what improvements should take priority

I don't want us to get bogged down just adding features forever so we never get to 1.0.0.

Btw, it seems like we do not have any CI? Like running tests and formatting (and clippy?) in github actions? I have little experience with rust projects, but I would expect something like that.

@xamgore
Copy link
Owner Author

xamgore commented Jan 12, 2025

I used to think the rust code could do miltithreading inside i.e. through rayon, to utilize CPU. But looks like both approaches can coexist.

Also, we could ask Kyle either to share ownership or transfer it. If that won't work, then create a separate crate. Asking him to publish new versions is not really practical.

@bunny-therapist
Copy link
Collaborator

bunny-therapist commented Jan 12, 2025

Yeah maybe. I am not familiar with rayon. I also don't want to go further discussing that now (i.e. before 1.0.0) because I am starting to worry about scope.

I agree that maintenance need to be discussed. I am not sure how involved Kyle wants to be, but it seems weird at this point that we would not be able to publish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants