-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass User-Agent to all web requests, set async job timeout, and retry header retrieval #92
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #92 +/- ##
==========================================
+ Coverage 43.14% 44.29% +1.14%
==========================================
Files 15 17 +2
Lines 978 1043 +65
Branches 133 148 +15
==========================================
+ Hits 422 462 +40
- Misses 545 569 +24
- Partials 11 12 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rethink the contact thing. It's a new scraper, it will surely serve as example for future ones. Current behavior is sketchy.
I made significant code changes to address your concern as discussed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this will be a great addition!
simple class, not a dataclass since we need to be able to initialize / modify "on-the-fly"
This justification is erroneous. We can (and do in other scrapers) initialize and modify on the fly.
remove all default values from argparse configuration and move them to this Context class, including sourcing from environment variables when appropriate
👍
remove all dest properties from argparse since we now have a custom parsing function in context module, no need to do the same job at two different locations
It's not entirely correct. You can rely on argparse's auto naming or be formal to control it but you always need the name to retrieve data. It is thus written twice but differently.
if this proves to work well, idea is to do the same in zimscraperlib for everything one might want to customize and/or is not really a constant => we will have two contexts, one from scraperlib and one from the scraper itself
(and I think it is important and correct to have two distinct context to really know where configuration comes from)
I agree
Here's a general comment as this is a new approach and it makes more sense to me this way. I know this is a mindtouch PR but I think we're aligned in having this replicated ; hence the care in designing it.
I think the current API is not conducting enough meaning/structure.
I like the concept of initializing the context. It helps understanding what's default, what needs initialization and what changes frequently.
In current form, we have a class with class variables that is immediately initialized into a pseudo-singleton and then it's dynamic. tmp_folder
for instance is not assigned.
Also, the fact that we manipulate argparse's object in the context module is not appealing.
I'd suggest the following:
- class def as it is but as a kw-only dataclass
- argparse options dest only specified when not matching context variables or we want a distinction.
- A singleton impl to ease importing/usage
- Inited with cli values so once inited all the base context is present. Can(and will) change though
- logger should be part of the context IMO
This provides the intent/meaning a lot better I find. It also allows clear Path and other initialization in post_init.
@dataclass(kw_only=True)
class Context:
_instance: Self | None = None
...
name: str
# not necessary but it conveys intent
@classmethod
def setup(cls, **kwargs):
cls(**kwargs)
def __post_init__(self):
self.__class__._instance = self
@classmethod
def get(cls) -> "Context":
if not getattr(cls, "_instance"):
raise OSError("Uninitialized context")
return cls._instance
# entrypoint
from xxx.context import Context
Context.setup(**dict(args._get_kwargs()))
# anywhere
from xxx.context import Context
context = Context.get() # as we do for logger
scraper/src/mindtouch2zim/context.py
Outdated
) | ||
|
||
|
||
# Singleton instance holding current scraper context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a singleton. It's the only instance you create. I think the comment should be clearer (or maybe it's not useful)
scraper/src/mindtouch2zim/context.py
Outdated
bad_assets_threshold: int = 10 | ||
|
||
# current processing info to use for debug message / exception | ||
_processing_step: threading.local = threading.local() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name of this (maybe not this holder but the API) should make it very clear that this is local-only of thread-local. Otherwise that would surely turn a future maintainer crazy.
Fix #79
Changes:
Context
class to hold contextual / configuration information about current scraper runContext
class, including sourcing from environment variables when appropriatedest
properties from argparse since we now have a custom parsing function incontext
module, no need to do the same job at two different locations