-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] PyPDF2 ❤️ pdfrw #232
Comments
Hello, Martin!
I have to admit to neglecting pdfrw lately, but it's possible that at some
point I could be reenergized, and there certainly might be some room for
collaboration. pdfrw used to be included in several linux distros, but
then I think maybe it wasn't working with later versions of Python 3 (like
maybe 3.8?) so they dropped it. Kind of snowballed from there.
pdfrw is my own original work for the most part, although:
a) I think I might have taken some compression/decomp code from pypdf?
Don't remember, the code should say. I certainly looked at other
implementations; and
b) I definitely added a couple of methods to make pdfrw compatible with
pypdf for really simple things.
Having said that, and with the caveat that I haven't looked at either
package in awhile, I will say that from what I remember (back when I was
working on this):
1) pdfrw operates at a much lower level than pypdf, e.g.
2) pypdf is much more full featured, with, e.g. lots and lots of
convenience methods, and
3) pdfrw is much faster than pypdf
If these things are still true (you could certainly try routing a couple of
pdfs through each, and let me know if you have issues or what the heck the
later python problem is, I dunno), and if the pypdf architecture would
allow this, then it *might* (bearing in mind that I'm pulling all this out
of my ass right now) make sense to use some of pdfrw as the underlying base
of pypdf.
I probably won't have much time over the next few months, but I am always
open to answering questions, and if you wanted to, for example merge the
projects somehow, I don't have a real problem with that, other than that I
would like the core functionality to remain clean and fast, and (from the
outside looking in) it looks like a lot of code was merged into pypdf in a
topsy-turvy fashion that emphasized getting something working over keeping
it clean, maintainable, and fast.
Good luck and best regards,
Pat
…On Sat, Apr 9, 2022 at 9:15 AM Martin Thoma ***@***.***> wrote:
I've recently became the maintainer of PyPDF2. While it will take me some
time to increase test coverage, merge/close PRs, deal with Github tickets
(issues), I'm looking forward to a new major release in which I can
deprecate some parts.
One thing I like to do is to change the interface. It starts with simple
things like reader.getNumPages() to become len(reader), changing the
camelCase method names to snake_case, and adding type annotations.
I was wondering how big the difference between pdfrw is and how it's
related to PyPDF2. Some things look super similar. Does pdfrw have its
origins as a fork of PyPDF2?
Maybe it's possible to join efforts (I could imaging merging pdfrw into
PyPDF2, creating a shared "base library" for both projects, or sharing
parts of the test suite between both projects)
—
Reply to this email directly, view it on GitHub
<#232>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASE2NUT74IYWTPJHO7XVBLVEGGJVANCNFSM5S7CWLIQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I can confirm this. When I found this issue I wanted to check the speed difference, and followed this pdfrw example in order to compare them: I adapted it to just pass a list of page numbers to subset from a given pdf with 340 pages (EDIT: 58 MB scanned book, 175 KB/page on average).
@MartinThoma the difference is so amazing that I'd say there is something wrong with pypdf memory usage Regards EDIT: a recent version of the test is posted here now py-pdf/benchmarks#7 |
I've recently became the maintainer of PyPDF2. While it will take me some time to increase test coverage, merge/close PRs, deal with Github tickets (issues), I'm looking forward to a new major release in which I can deprecate some parts.
One thing I like to do is to change the interface. It starts with simple things like
reader.getNumPages()
to becomelen(reader)
, changing the camelCase method names to snake_case, and adding type annotations.I was wondering how big the difference between pdfrw is and how it's related to PyPDF2. Some things look super similar. Does pdfrw have its origins as a fork of PyPDF2?
Maybe it's possible to join efforts (I could imaging merging pdfrw into PyPDF2, creating a shared "base library" for both projects, or sharing parts of the test suite between both projects)
The text was updated successfully, but these errors were encountered: