-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processor changes #99
Conversation
tests/test_pages_price.py
Outdated
assert await page.currencyRaw is None | ||
assert page.call_count == 2 # we want this to be 1 | ||
assert await page.currencyRaw == "$" | ||
assert page.call_count == 1 # we want this to be 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why did the count change and what should be the new expectation in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've noticed that comment info doesn't match test values in places in master. Not sure if this is a proper behavior.
On master such difference can be found in 5 places:
- https://github.com/zytedata/zyte-common-items/blob/main/tests/test_pages_price.py#L41
- https://github.com/zytedata/zyte-common-items/blob/main/tests/test_pages_price.py#L69
- https://github.com/zytedata/zyte-common-items/blob/main/tests/test_pages_price.py#L71
- https://github.com/zytedata/zyte-common-items/blob/main/tests/test_pages_price.py#L80
- https://github.com/zytedata/zyte-common-items/blob/main/tests/test_pages_price.py#L188
This PR fixes two of these:
assert page.call_count == 1 # we want this to be 1 assert page.call_count == 1 # we want this to be 1
While three are same as before:assert page.call_count == 2 # we want this to be 1 assert page.call_count == 2 # we want this to be 1 zyte-common-items/tests/test_pages_price.py
Line 188 in 6620a15
assert page.call_count == 2 # we want this to be 1
It seems like an improvement, albeit incomplete (but fixing this for all cases seems kind of out of scope so perhaps we can live with it for now...?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted parsing strings by price_processor
as noted in #99 (comment) so this is no longer relevant as this PR no longer modify test_pages_price.py
.
This reverts commit 67cf862.
value = _handle_selectorlist(value) | ||
|
||
if isinstance(value, Real): | ||
return f"{value:.2f}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about moving it to https://github.com/zytedata/zyte-parsers/blob/main/zyte_parsers/price.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this would allow to remove the duplication in price_processor vs simple_price_processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd like to migrate our stuff quickly before certain deadlines, which is why I want to have this thing working for now and avoid coordinating between PR in yet another dependency repository (which is also why I placed proposed parsing of images into TODO). I think we can improve it later and migrate into more fitting place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
tests/test_pages_price.py
Outdated
@@ -62,21 +62,21 @@ def price(self): | |||
url = "https://example.com" | |||
page = CustomProductPage(response=HttpResponse(url=url, body=html)) | |||
assert page.call_count == 0 | |||
assert page.price == "$13.2" | |||
assert page.price == "13.20" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big change, because it goes against the documentation - see https://zyte-common-items.readthedocs.io/en/latest/usage/field-processors.html#overview:
By design, the processors enabled by default are “transparent”: they don’t change the output of the field if the result is of the expected final type. For example, if there is a str attribute in the item, and the field returns str value, the default processor returns the value as-is.
This design decision was made for two reasons:
- It makes it possible to safely add new default processors, without a risk of breaking user code. If the code was working before a processor is introduced, it means the field was returning the final data type. Processor won't touch it, so no breaking is possible.
- it provides an escape hatch for the user; if you want to return a final data, you can do it just by returning the value.
It's not an issue for Real support in price parsing, for Brand processing, or for Images (it all looks good in this PR), but it's an issue for price fields.
Obviously, this "design principle" has a downside: in most cases (?), we do want to process strings by default. I think most large Scrapy projects develop this kind of utilities over time, where all values are processed by default. We do want to make it more standard.
I can see 2 ways to address it.
- Start procesing final data types by default; drop this design principle.
- Introduce more aggressive / complete versions of procesors. Keep "cautious" processors enabled by default, which don't handle the final data types. Make it easy to enable the complete versions - either by defining Processors class which users can use, or by providing another base ProductPage class, with processing of final values enabled by default.
I'm +0.5 for option (2), but willing to hear the opinions and arguments. I like keeping the "safe" approach, but it introduces more complexity (e.g. an additional base class, where users need to pick the right one).
cc @wRAR @Gallaecio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of against this transparency principle, and I've even proposed another PR #101 to ensure all strings are trimmed.
It's difficult to imagine for me why we would want to emit data with extra whitespaces and it's bothersome to always manually add some sort of default text cleaning processor into all used fields. I think this kind of processing should be applied by default at some point and would help to ensure all our output data respect a common standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to revert this change for now, to not block the rest.
Overall, 👍, much needed changes. Having e.g. an images processor is a must. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #99 +/- ##
=====================================
Coverage 0.00% 0.00%
=====================================
Files 54 55 +1
Lines 2107 2160 +53
=====================================
- Misses 2107 2160 +53
|
In response to #99 (comment) and after some discussion in slack I've decided to just return strings without any change in The discussion about this approach can continue but I do not want to block image processors and brand processor because of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks @Nykakin! |
We'd like to remove processors from our utils and replace them with
zyte-common-items
counterparts. There are some functionality missing, however.Summary of changes:
images_processor
, that attempts to convert input to list ofzyte_common_items.Image
instancesbrand_processor
returnzyte_common_items.Brand
instances instead of stringsbrand_processor
to take string as an input argumentprice_processor
to takestring ornumber as input argument (resolves Standard price processors #94)