Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass User-Agent to all web requests, set async job timeout, and retry header retrieval #92

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

benoit74
Copy link
Contributor

@benoit74 benoit74 commented Nov 25, 2024

Fix #79

Changes:

  • pass a User-Agent compliant with upload.wikimedia.org Policy in all web requests
  • set a job timeout for all async jobs, serving as a "fallback" should something go terribly wrong (avoid scraper being stuck forever and never failing)
  • move code fetching online header in the try...catch for proper operation of logic avoiding to retry some known bad assets
  • add a new Context class to hold contextual / configuration information about current scraper run
    • remove all default values from argparse configuration and move them to this Context class, including sourcing from environment variables when appropriate
    • remove all dest properties from argparse since we now have a custom parsing function in context module, no need to do the same job at two different locations
    • if this proves to work well, idea is to do the same in zimscraperlib for everything one might want to customize and/or is not really a constant

@benoit74 benoit74 self-assigned this Nov 25, 2024
Copy link

codecov bot commented Nov 25, 2024

Codecov Report

Attention: Patch coverage is 42.28188% with 86 lines in your changes missing coverage. Please review.

Project coverage is 44.29%. Comparing base (08e9733) to head (e51c9cd).

Files with missing lines Patch % Lines
scraper/src/mindtouch2zim/context.py 59.70% 26 Missing and 1 partial ⚠️
scraper/src/mindtouch2zim/processor.py 9.09% 20 Missing ⚠️
scraper/src/mindtouch2zim/entrypoint.py 6.25% 15 Missing ⚠️
scraper/src/mindtouch2zim/asset.py 18.75% 13 Missing ⚠️
scraper/src/mindtouch2zim/client.py 30.00% 7 Missing ⚠️
scraper/src/mindtouch2zim/download.py 63.63% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #92      +/-   ##
==========================================
+ Coverage   43.14%   44.29%   +1.14%     
==========================================
  Files          15       17       +2     
  Lines         978     1043      +65     
  Branches      133      148      +15     
==========================================
+ Hits          422      462      +40     
- Misses        545      569      +24     
- Partials       11       12       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 marked this pull request as ready for review November 25, 2024 08:29
@benoit74 benoit74 changed the title Pass User-Agent to all web requests Pass User-Agent to all web requests, set async job timeout, and retry header retrieval Nov 25, 2024
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rethink the contact thing. It's a new scraper, it will surely serve as example for future ones. Current behavior is sketchy.

scraper/src/mindtouch2zim/constants.py Outdated Show resolved Hide resolved
scraper/src/mindtouch2zim/download.py Outdated Show resolved Hide resolved
scraper/src/mindtouch2zim/download.py Outdated Show resolved Hide resolved
scraper/src/mindtouch2zim/processor.py Outdated Show resolved Hide resolved
@benoit74
Copy link
Contributor Author

I made significant code changes to address your concern as discussed:

  • add a new Context class to hold contextual / configuration information about current scraper run
    • simple class, not a dataclass since we need to be able to initialize / modify "on-the-fly"
    • remove all default values from argparse configuration and move them to this Context class, including sourcing from environment variables when appropriate
    • remove all dest properties from argparse since we now have a custom parsing function in context module, no need to do the same job at two different locations
    • if this proves to work well, idea is to do the same in zimscraperlib for everything one might want to customize and/or is not really a constant => we will have two contexts, one from scraperlib and one from the scraper itself (and I think it is important and correct to have two distinct context to really know where configuration comes from)

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this will be a great addition!

simple class, not a dataclass since we need to be able to initialize / modify "on-the-fly"

This justification is erroneous. We can (and do in other scrapers) initialize and modify on the fly.

remove all default values from argparse configuration and move them to this Context class, including sourcing from environment variables when appropriate

👍

remove all dest properties from argparse since we now have a custom parsing function in context module, no need to do the same job at two different locations

It's not entirely correct. You can rely on argparse's auto naming or be formal to control it but you always need the name to retrieve data. It is thus written twice but differently.

if this proves to work well, idea is to do the same in zimscraperlib for everything one might want to customize and/or is not really a constant => we will have two contexts, one from scraperlib and one from the scraper itself
(and I think it is important and correct to have two distinct context to really know where configuration comes from)

I agree


Here's a general comment as this is a new approach and it makes more sense to me this way. I know this is a mindtouch PR but I think we're aligned in having this replicated ; hence the care in designing it.

I think the current API is not conducting enough meaning/structure.
I like the concept of initializing the context. It helps understanding what's default, what needs initialization and what changes frequently.
In current form, we have a class with class variables that is immediately initialized into a pseudo-singleton and then it's dynamic. tmp_folder for instance is not assigned.
Also, the fact that we manipulate argparse's object in the context module is not appealing.

I'd suggest the following:

  • class def as it is but as a kw-only dataclass
  • argparse options dest only specified when not matching context variables or we want a distinction.
  • A singleton impl to ease importing/usage
  • Inited with cli values so once inited all the base context is present. Can(and will) change though
  • logger should be part of the context IMO

This provides the intent/meaning a lot better I find. It also allows clear Path and other initialization in post_init.

@dataclass(kw_only=True)
class Context:
    _instance: Self | None = None
    ...
    name: str

    # not necessary but it conveys intent
    @classmethod
    def setup(cls, **kwargs):
        cls(**kwargs)

    def __post_init__(self):
        self.__class__._instance = self

    @classmethod
    def get(cls) -> "Context":
        if not getattr(cls, "_instance"):
            raise OSError("Uninitialized context")
        return cls._instance

# entrypoint
from xxx.context import Context
Context.setup(**dict(args._get_kwargs()))

# anywhere
from xxx.context import Context
context = Context.get()  # as we do for logger

scraper/src/mindtouch2zim/download.py Show resolved Hide resolved
)


# Singleton instance holding current scraper context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a singleton. It's the only instance you create. I think the comment should be clearer (or maybe it's not useful)

bad_assets_threshold: int = 10

# current processing info to use for debug message / exception
_processing_step: threading.local = threading.local()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name of this (maybe not this holder but the API) should make it very clear that this is local-only of thread-local. Otherwise that would surely turn a future maintainer crazy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass proper user-agent
2 participants