[Discussion] PyPDF2 ❤️ pdfrw #232

MartinThoma · 2022-04-09T14:15:43Z

I've recently became the maintainer of PyPDF2. While it will take me some time to increase test coverage, merge/close PRs, deal with Github tickets (issues), I'm looking forward to a new major release in which I can deprecate some parts.

One thing I like to do is to change the interface. It starts with simple things like reader.getNumPages() to become len(reader), changing the camelCase method names to snake_case, and adding type annotations.

I was wondering how big the difference between pdfrw is and how it's related to PyPDF2. Some things look super similar. Does pdfrw have its origins as a fork of PyPDF2?

Maybe it's possible to join efforts (I could imaging merging pdfrw into PyPDF2, creating a shared "base library" for both projects, or sharing parts of the test suite between both projects)

The text was updated successfully, but these errors were encountered:

pmaupin · 2022-04-09T15:42:59Z

Hello, Martin! I have to admit to neglecting pdfrw lately, but it's possible that at some point I could be reenergized, and there certainly might be some room for collaboration. pdfrw used to be included in several linux distros, but then I think maybe it wasn't working with later versions of Python 3 (like maybe 3.8?) so they dropped it. Kind of snowballed from there. pdfrw is my own original work for the most part, although: a) I think I might have taken some compression/decomp code from pypdf? Don't remember, the code should say. I certainly looked at other implementations; and b) I definitely added a couple of methods to make pdfrw compatible with pypdf for really simple things. Having said that, and with the caveat that I haven't looked at either package in awhile, I will say that from what I remember (back when I was working on this): 1) pdfrw operates at a much lower level than pypdf, e.g. 2) pypdf is much more full featured, with, e.g. lots and lots of convenience methods, and 3) pdfrw is much faster than pypdf If these things are still true (you could certainly try routing a couple of pdfs through each, and let me know if you have issues or what the heck the later python problem is, I dunno), and if the pypdf architecture would allow this, then it *might* (bearing in mind that I'm pulling all this out of my ass right now) make sense to use some of pdfrw as the underlying base of pypdf. I probably won't have much time over the next few months, but I am always open to answering questions, and if you wanted to, for example merge the projects somehow, I don't have a real problem with that, other than that I would like the core functionality to remain clean and fast, and (from the outside looking in) it looks like a lot of code was merged into pypdf in a topsy-turvy fashion that emphasized getting something working over keeping it clean, maintainable, and fast. Good luck and best regards, Pat

…

On Sat, Apr 9, 2022 at 9:15 AM Martin Thoma ***@***.***> wrote: I've recently became the maintainer of PyPDF2. While it will take me some time to increase test coverage, merge/close PRs, deal with Github tickets (issues), I'm looking forward to a new major release in which I can deprecate some parts. One thing I like to do is to change the interface. It starts with simple things like reader.getNumPages() to become len(reader), changing the camelCase method names to snake_case, and adding type annotations. I was wondering how big the difference between pdfrw is and how it's related to PyPDF2. Some things look super similar. Does pdfrw have its origins as a fork of PyPDF2? Maybe it's possible to join efforts (I could imaging merging pdfrw into PyPDF2, creating a shared "base library" for both projects, or sharing parts of the test suite between both projects) — Reply to this email directly, view it on GitHub <#232>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASE2NUT74IYWTPJHO7XVBLVEGGJVANCNFSM5S7CWLIQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

abubelinha · 2023-02-20T00:45:08Z

pdfrw is much faster than pypdf

I can confirm this.
I was using pypdf to extract some pages of a big pdf to create smaller files.

When I found this issue I wanted to check the speed difference, and followed this pdfrw example in order to compare them:
https://github.com/pmaupin/pdfrw/blob/master/examples/subset.py

I adapted it to just pass a list of page numbers to subset from a given pdf with 340 pages (EDIT: 58 MB scanned book, 175 KB/page on average).
I noticed the following:

speed difference was huge: pdfrw many times faster than pypdf with my first tests ... until I realized this difference increased with number of extracted pages: pdfrw speed is not that much affected by increasing the number of pages, whereas pypdf output time (in seconds) increases a lot.

--------------------------------------------------
4 DIFFERENT PAGES (repeated N=1, 2 or 3 times):
N only affects pypdf in both time and output size
--------------------------------------------------
N=1, pageslist: [4, 6, 8, 9]
pdfrw: 1478.83 KB output size, took 1.61 seconds
pypdf: 1486.26 KB output size, took 6.684 seconds
pypdf_time / pdfrw_time = 4.15 ratio
--------------------------------------------------
N=2, pageslist: [4, 6, 8, 9, 4, 6, 8, 9]
pdfrw: 1479.90 KB output size, took 1.146 seconds
pypdf: 2972.23 KB output size, took 13.163 seconds
pypdf_time / pdfrw_time = 11.49 ratio
--------------------------------------------------
N=3, pageslist: [4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9]
pdfrw: 1480.97 KB output size, took 1.127 seconds
pypdf: 4458.29 KB output size, took 19.644 seconds
pypdf_time / pdfrw_time = 17.43 ratio
--------------------------------------------------
NOW 8 DIFFERENT PAGES (repeated N=1, 2 or 3 times):
Different pages number only affects pdfrw output size, but not its speed
Pages number (no matter they are repeated or different) affects pypdf in both time and output size
--------------------------------------------------
N=1, pageslist: [4, 6, 8, 9, 193, 1, 16, 18]
pdfrw: 2774.33 KB output size, took 1.691 seconds
pypdf: 2790.06 KB output size, took 13.073 seconds
pypdf_time / pdfrw_time = 7.73 ratio
--------------------------------------------------
N=2, pageslist: [4, 6, 8, 9, 193, 1, 16, 18, 4, 6, 8, 9, 193, 1, 16, 18]
pdfrw: 2776.51 KB output size, took 1.181 seconds
pypdf: 5580.01 KB output size, took 26.387 seconds
pypdf_time / pdfrw_time = 22.34 ratio
--------------------------------------------------
N=3, pageslist: [4, 6, 8, 9, 193, 1, 16, 18, 4, 6, 8, 9, 193, 1, 16, 18, 4, 6, 8, 9, 193, 1, 16, 18]
pdfrw: 2778.69 KB output size, took 1.171 seconds
pypdf: 8369.96 KB output size, took 39.936 seconds
pypdf_time / pdfrw_time = 34.1 ratio
--------------------------------------------------

memory consumption is much bigger in pypdf; i.e., if I do N=5 this eats all my RAM (whereas pdfrw is not affected at all)
also, pdf file sizes generated by pdfrw are much smaller: particularly, repeated pages do not affect pdfrw at all, whereas pypdf multiplies the output file size and script time

@MartinThoma the difference is so amazing that I'd say there is something wrong with pypdf memory usage
I tested this on Windows 7, Python 3.8

Regards
@abubelinha

EDIT: a recent version of the test is posted here now py-pdf/benchmarks#7

MartinThoma mentioned this issue Mar 27, 2023

Future maintenance sarnold/pdfrw#15

Open

abubelinha mentioned this issue Jul 2, 2023

pdfrw vs pypdf page extraction & merge py-pdf/benchmarks#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] PyPDF2 ❤️ pdfrw #232

[Discussion] PyPDF2 ❤️ pdfrw #232

MartinThoma commented Apr 9, 2022

pmaupin commented Apr 9, 2022 via email

abubelinha commented Feb 20, 2023 •

edited

Loading

[Discussion] PyPDF2 ❤️ pdfrw #232

[Discussion] PyPDF2 ❤️ pdfrw #232

Comments

MartinThoma commented Apr 9, 2022

pmaupin commented Apr 9, 2022 via email

abubelinha commented Feb 20, 2023 • edited Loading

abubelinha commented Feb 20, 2023 •

edited

Loading