Reworking EasyOCR #839

JulianOrteil · 2022-09-01T01:32:20Z

JulianOrteil
Sep 1, 2022

This discussion relates to reworking EasyOCR and bringing it up to standards set by various PEPs and any suggested here.

The issue:
EasyOCR is a very popular library within the machine vision community. It competes with top-of-the-line software like Tesseract while maintaining a robust community. However its many achievements--which shouldn't be understated--the library falls short in multiple areas like documentation, testing, and readability; among others.

While perhaps not an issue for large swathes of users, these hindrances do present issues to others who want to contribute, fine-tune, or just learn OCR in general. Because of this, @ystoll and I have decided to donate our time to overhauling EasyOCR to address all of these issues while maintaining its excellent performance and ease of use for the typical user.

Breakdown:
Information about specific pain points we've identified is spread between two issues (#823 and #829); they'll be consolidated below. This is by no means an exhaustive list, so if you have a suggestion, please feel free to share it in a reply.

EasyOCR is severely non-PEP compliant. Some specifics include variable names not being snake_case, code comments being strings (not # comment), unused variables, etc. All of these severely reduce the readability of EasyOCR and do present themselves as a hurdle for potential contributors.
Documentation is severely lacking. Especially related to how parameters affect detection in methods like Reader.detect. The library presents itself as "ready-to-go, out-of-the-box" and the current documentation may be sufficient for less-adept users. However, "power users" will spend considerable amounts of time tweaking parameters to find the optimal outcome for their problems and should be able to discern what they do from reading the docstrings. Not to mention, the documentation hosted on JaidedAI's website is separate from the code. Using ReadTheDocs, documentation can be updated with code commits automatically, reducing overhead.
Type hints are virtually non-existent. These are extremely useful for static type checkers and users to determine what possible types of values a parameter expects.
Tests don't exist. The library appears to be largely tested through usage which isn't proper. These are starting to be added through 9bd8be0, but they are being added in a way that emphasizes the previous point about readability.
Everything is manual. The library is not automated in any real way for performing tasks such as running tests, producing releases, etc. These can be easily performed through tools like tox and makes contributors' lives much easier.

Approach:
@rkcosmos stated in #823 that he is concerned about one massive PR being used for a project like this. This is a valid point. To appease this concern, the best approach will likely be making changes, with compatibility, over multiple versions. This allows for a more seamless change in the code base--keeping the old API for a couple of versions with deprecation warnings alongside the new API--but does introduce much compatibility work on top of the proposed changes.

Plan:
For this rework to take place, a game plan should be determined before any work is performed. As such, I propose the following:

The first thing to tackle should be automating tests, releases, linting, etc (anything related to CI/CD as well). This is typically in the form of tox and/or GitHub Actions and is the least invasive change. Because it doesn't affect the main source code, this change will be invisible to end-users. This also sets up workflows for further steps.
Tests, tests, tests. Before any changes are made to the library, we need to add tests to establish the "check and balance" on our changes. If our changes break the tests, then those changes shouldn't be accepted.
Next should be type-hints. Doing this as the first code change kills three birds with one stone: provides type hints, allows us to get a better handle on how data flows throughout the code, and helps automate documentation.
Start documenting. Like type-hints, this takes care of multiple issues: documentation itself and further understanding of what each function in the API is doing.
Identify and eliminate things discussed in the points above relating to PEP compliance--to an extent. We only want to put in as much work as necessary to make reworking easier. Optimizations should not happen, and the cleaning of code should be kept to a minimum so as to reduce unnecessary overhead.
Start implementing the new architecture. However, no code should be moved from old to new, the new should just call the old like a proxy. This allows us to start adding DeprecationWarnings everywhere in the old code telling users to start using the new API while maintaining the old API as-is. Our changes should be invisible to the end user unless they absolutely cannot be avoided.
Systematically start porting code. This is where things will get muddy because we need to keep old code for at least 2 minor revisions. Also, new tests, documentation, and other necessary functions should be performed at this time. This is where the hard work will be as this step should be broken down into multiple PRs addressing specific modules or classes.

Step 7 should be the extent of this rework. Obviously, as we work through these changes, there may be areas where we can improve or optimize code, but that should be left for another day. The entire purpose of this rework should be limited to just addressing the above points and any points discussed below.

If you have any ideas, concerns, questions, or other comments related to this rework you feel would be beneficial to add, then please post them. We are also open to adding additional contributors to this project if you feel the desire to.

ystoll · 2022-09-01T12:40:46Z

ystoll
Sep 1, 2022

@JulianOrteil Thank you for summarizing the TODO list !
One concern though, #839 does not seem to appear in the issue list, is this normal ?
I will start to work on a PyTest tests suit myself and will propose a PR when I will have completed a few of them.
Please note that I am not an expert in unitary testing (I have a physics background), although I will do my best, remarks on my contributions will be more than appreciated.

@rkcosmos You said that you were concerned about the stability of the package w.r.t PR we can make: maybe you can create a new development branch into which we will merge our PRs. You can then test intensively our changes before merging it into the master branch ?

23 replies

JulianOrteil Oct 19, 2022
Author

@ystoll No worries at all. I know how it gets!

My apologies for being vague and unclear. I was thinking of trying to kill two birds with one stone: an application that demonstrates EasyOCR that users can run while simultaneously being compatible with pytest to test the library. Whatever we can't test with the application is then explicitly tested; whether it be an obscure, buried method or an operation that only runs under specific circumstances.

ystoll Nov 7, 2022

@JulianOrteil,

Hi Julian, I hope you are doing well. I would like to get back to you concerning your previous message:

I think that it will be preferable to do things right the first time: writing a specific application for testing some parts of the code, and then testing the leftover parts with a classical approach seems, to me, not quite robust. I would rather starts by writing a full test suit, a little bit like what was done for mmocr (mmocr test suit). FYI, I will start myself to write tests this afternoon, I hope that my first commit on that matter will be published tomorrow.
I think that the easiest way to keep us motivated, and here I speak mostly for myself would be to start iterating on the test suit asap.
Indeed, I perfectly aware that the two of us have many things to do on the side. But, I think that once we will get started, it will get easier to keep us going.
To this end, I would like to open a private communication channel with you in order to send you messages which do not fit here. I communicate with my team via discord. If you are up for, we can use the same platform for the two of us to chat. You can ping me via my professional email available on my public Github account.

Have a nice day !

Yannick

ystoll Nov 7, 2022

Hi @JulianOrteil , by the way, I just found that there is a testing method which already exist for constructing a test suit over a code base, which is exactly what we have to do here: the so-called **golden master ** . Maybe you already knew about this method - I didn't. Here are a few useful links about it:

JulianOrteil Nov 8, 2022
Author

@ystoll No worries. I don't have any issue with writing tests as they should be. I just thought it would be a good idea to reduce duplicating how much code is tested multiple times by the same test methods.

Also, I sent you an email just now for setting up a private channel.

JulianOrteil Nov 13, 2022
Author

@ystoll I sent you a message on Discord a few days ago, just want to make sure it is you.

I've updated the "tests_suit" branch with my first attempt at using the pytest-golden framework. Please look through it and verify I've done it right as I haven't used this method of testing before. I don't want to write more tests until we've agreed on if this implementation I've attempted is correct.

Additionally, VSCode allows us to debug tests like we would normal Python code through their dedicated testing framework. However, there is an issue between pdb and coverage which prevents the former from working properly. If you happen to use VSCode and this debug mode, please add the following to your .vscode/launch.json file:

{
    "name": "Debug Unit Test",
    "type": "python",
    "request": "launch",
    "justMyCode": false,
    "program": "${file}",
    "purpose": ["debug-test"],
    "console": "integratedTerminal",
    "env": {
        "PYTEST_ADDOPTS": "--no-cov"
    },
}

This disables coverage during just debug testing, so running tox will still report coverage as expected.

yuyang3478 · 2022-10-01T23:39:05Z

yuyang3478
Oct 1, 2022

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reworking EasyOCR #839

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 23 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Reworking EasyOCR #839

JulianOrteil Sep 1, 2022

Replies: 2 comments · 23 replies

ystoll Sep 1, 2022

JulianOrteil Oct 19, 2022 Author

ystoll Nov 7, 2022

ystoll Nov 7, 2022

JulianOrteil Nov 8, 2022 Author

JulianOrteil Nov 13, 2022 Author

yuyang3478 Oct 1, 2022

JulianOrteil
Sep 1, 2022

Replies: 2 comments 23 replies

ystoll
Sep 1, 2022

JulianOrteil Oct 19, 2022
Author

JulianOrteil Nov 8, 2022
Author

JulianOrteil Nov 13, 2022
Author

yuyang3478
Oct 1, 2022