Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processor changes #99

Merged
merged 14 commits into from
Aug 26, 2024
Merged

Processor changes #99

merged 14 commits into from
Aug 26, 2024

Conversation

Nykakin
Copy link
Contributor

@Nykakin Nykakin commented Aug 12, 2024

We'd like to remove processors from our utils and replace them with zyte-common-items counterparts. There are some functionality missing, however.

Summary of changes:

  • Add new processor, images_processor, that attempts to convert input to list of zyte_common_items.Image instances
  • Make brand_processor return zyte_common_items.Brand instances instead of strings
  • Allow brand_processor to take string as an input argument
  • Allow price_processor to take string or number as input argument (resolves Standard price processors #94)

@BurnzZ BurnzZ requested review from wRAR, kmike and Gallaecio August 12, 2024 23:35
zyte_common_items/processors.py Outdated Show resolved Hide resolved
zyte_common_items/processors.py Outdated Show resolved Hide resolved
zyte_common_items/processors.py Outdated Show resolved Hide resolved
zyte_common_items/processors.py Outdated Show resolved Hide resolved
assert await page.currencyRaw is None
assert page.call_count == 2 # we want this to be 1
assert await page.currencyRaw == "$"
assert page.call_count == 1 # we want this to be 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why did the count change and what should be the new expectation in the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've noticed that comment info doesn't match test values in places in master. Not sure if this is a proper behavior.

On master such difference can be found in 5 places:

This PR fixes two of these:

It seems like an improvement, albeit incomplete (but fixing this for all cases seems kind of out of scope so perhaps we can live with it for now...?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted parsing strings by price_processor as noted in #99 (comment) so this is no longer relevant as this PR no longer modify test_pages_price.py.

@Nykakin Nykakin requested a review from wRAR August 20, 2024 15:36
value = _handle_selectorlist(value)

if isinstance(value, Real):
return f"{value:.2f}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this would allow to remove the duplication in price_processor vs simple_price_processor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd like to migrate our stuff quickly before certain deadlines, which is why I want to have this thing working for now and avoid coordinating between PR in yet another dependency repository (which is also why I placed proposed parsing of images into TODO). I think we can improve it later and migrate into more fitting place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

tests/test_processors.py Outdated Show resolved Hide resolved
@@ -62,21 +62,21 @@ def price(self):
url = "https://example.com"
page = CustomProductPage(response=HttpResponse(url=url, body=html))
assert page.call_count == 0
assert page.price == "$13.2"
assert page.price == "13.20"
Copy link
Contributor

@kmike kmike Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big change, because it goes against the documentation - see https://zyte-common-items.readthedocs.io/en/latest/usage/field-processors.html#overview:

By design, the processors enabled by default are “transparent”: they don’t change the output of the field if the result is of the expected final type. For example, if there is a str attribute in the item, and the field returns str value, the default processor returns the value as-is.

This design decision was made for two reasons:

  1. It makes it possible to safely add new default processors, without a risk of breaking user code. If the code was working before a processor is introduced, it means the field was returning the final data type. Processor won't touch it, so no breaking is possible.
  2. it provides an escape hatch for the user; if you want to return a final data, you can do it just by returning the value.

It's not an issue for Real support in price parsing, for Brand processing, or for Images (it all looks good in this PR), but it's an issue for price fields.

Obviously, this "design principle" has a downside: in most cases (?), we do want to process strings by default. I think most large Scrapy projects develop this kind of utilities over time, where all values are processed by default. We do want to make it more standard.

I can see 2 ways to address it.

  1. Start procesing final data types by default; drop this design principle.
  2. Introduce more aggressive / complete versions of procesors. Keep "cautious" processors enabled by default, which don't handle the final data types. Make it easy to enable the complete versions - either by defining Processors class which users can use, or by providing another base ProductPage class, with processing of final values enabled by default.

I'm +0.5 for option (2), but willing to hear the opinions and arguments. I like keeping the "safe" approach, but it introduces more complexity (e.g. an additional base class, where users need to pick the right one).

cc @wRAR @Gallaecio.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of against this transparency principle, and I've even proposed another PR #101 to ensure all strings are trimmed.

It's difficult to imagine for me why we would want to emit data with extra whitespaces and it's bothersome to always manually add some sort of default text cleaning processor into all used fields. I think this kind of processing should be applied by default at some point and would help to ensure all our output data respect a common standard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to revert this change for now, to not block the rest.

@kmike
Copy link
Contributor

kmike commented Aug 21, 2024

Overall, 👍, much needed changes. Having e.g. an images processor is a must.

Copy link

codecov bot commented Aug 21, 2024

Codecov Report

Attention: Patch coverage is 0% with 45 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (cce3b94) to head (30baf8a).
Report is 6 commits behind head on main.

Files Patch % Lines
zyte_common_items/processors.py 0.00% 43 Missing ⚠️
zyte_common_items/pages/product.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main     #99   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         54      55    +1     
  Lines       2107    2160   +53     
=====================================
- Misses      2107    2160   +53     
Files Coverage Δ
zyte_common_items/pages/product.py 0.00% <0.00%> (ø)
zyte_common_items/processors.py 0.00% <0.00%> (ø)

@Nykakin
Copy link
Contributor Author

Nykakin commented Aug 22, 2024

In response to #99 (comment) and after some discussion in slack I've decided to just return strings without any change in price_processor instead of parsing it, thus retaining original philosophy of transparent processors.

The discussion about this approach can continue but I do not want to block image processors and brand processor because of it.

@Nykakin Nykakin requested a review from kmike August 22, 2024 14:10
Copy link
Contributor

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@kmike kmike merged commit a926028 into zytedata:main Aug 26, 2024
9 checks passed
@kmike
Copy link
Contributor

kmike commented Aug 26, 2024

Thanks @Nykakin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standard price processors
4 participants