Processor changes #99

Nykakin · 2024-08-12T05:11:53Z

We'd like to remove processors from our utils and replace them with zyte-common-items counterparts. There are some functionality missing, however.

Summary of changes:

Add new processor, images_processor, that attempts to convert input to list of zyte_common_items.Image instances
Make brand_processor return zyte_common_items.Brand instances instead of strings
Allow brand_processor to take string as an input argument
Allow price_processor to take ~~string or~~ number as input argument (resolves Standard price processors #94)

zyte_common_items/processors.py

wRAR · 2024-08-16T10:47:59Z

tests/test_pages_price.py

-    assert await page.currencyRaw is None
-    assert page.call_count == 2  # we want this to be 1
+    assert await page.currencyRaw == "$"
+    assert page.call_count == 1  # we want this to be 1


I wonder why did the count change and what should be the new expectation in the comment.

Yes, I've noticed that comment info doesn't match test values in places in master. Not sure if this is a proper behavior.

On master such difference can be found in 5 places:

https://github.com/zytedata/zyte-common-items/blob/main/tests/test_pages_price.py#L41

https://github.com/zytedata/zyte-common-items/blob/main/tests/test_pages_price.py#L69

https://github.com/zytedata/zyte-common-items/blob/main/tests/test_pages_price.py#L71

https://github.com/zytedata/zyte-common-items/blob/main/tests/test_pages_price.py#L80

https://github.com/zytedata/zyte-common-items/blob/main/tests/test_pages_price.py#L188

This PR fixes two of these:

zyte-common-items/tests/test_pages_price.py

Line 69 in 6620a15

assert page.call_count == 1 # we want this to be 1

zyte-common-items/tests/test_pages_price.py

Line 71 in 6620a15

assert page.call_count == 1 # we want this to be 1

While three are same as before:

zyte-common-items/tests/test_pages_price.py

Line 41 in 6620a15

assert page.call_count == 2 # we want this to be 1

zyte-common-items/tests/test_pages_price.py

Line 80 in 6620a15

assert page.call_count == 2 # we want this to be 1

zyte-common-items/tests/test_pages_price.py

Line 188 in 6620a15

assert page.call_count == 2 # we want this to be 1

It seems like an improvement, albeit incomplete (but fixing this for all cases seems kind of out of scope so perhaps we can live with it for now...?)

Reverted parsing strings by price_processor as noted in #99 (comment) so this is no longer relevant as this PR no longer modify test_pages_price.py.

This reverts commit 67cf862.

zyte_common_items/processors.py

kmike · 2024-08-21T13:08:13Z

zyte_common_items/processors.py

+    value = _handle_selectorlist(value)
+
+    if isinstance(value, Real):
+        return f"{value:.2f}"


What do you think about moving it to https://github.com/zytedata/zyte-parsers/blob/main/zyte_parsers/price.py?

It seems this would allow to remove the duplication in price_processor vs simple_price_processor.

We'd like to migrate our stuff quickly before certain deadlines, which is why I want to have this thing working for now and avoid coordinating between PR in yet another dependency repository (which is also why I placed proposed parsing of images into TODO). I think we can improve it later and migrate into more fitting place.

Sounds good!

tests/test_processors.py

kmike · 2024-08-21T13:49:36Z

tests/test_pages_price.py

@@ -62,21 +62,21 @@ def price(self):
    url = "https://example.com"
    page = CustomProductPage(response=HttpResponse(url=url, body=html))
    assert page.call_count == 0
-    assert page.price == "$13.2"
+    assert page.price == "13.20"


This is a big change, because it goes against the documentation - see https://zyte-common-items.readthedocs.io/en/latest/usage/field-processors.html#overview:

By design, the processors enabled by default are “transparent”: they don’t change the output of the field if the result is of the expected final type. For example, if there is a str attribute in the item, and the field returns str value, the default processor returns the value as-is.

This design decision was made for two reasons:

It makes it possible to safely add new default processors, without a risk of breaking user code. If the code was working before a processor is introduced, it means the field was returning the final data type. Processor won't touch it, so no breaking is possible.

it provides an escape hatch for the user; if you want to return a final data, you can do it just by returning the value.

It's not an issue for Real support in price parsing, for Brand processing, or for Images (it all looks good in this PR), but it's an issue for price fields.

Obviously, this "design principle" has a downside: in most cases (?), we do want to process strings by default. I think most large Scrapy projects develop this kind of utilities over time, where all values are processed by default. We do want to make it more standard.

I can see 2 ways to address it.

Start procesing final data types by default; drop this design principle.

Introduce more aggressive / complete versions of procesors. Keep "cautious" processors enabled by default, which don't handle the final data types. Make it easy to enable the complete versions - either by defining Processors class which users can use, or by providing another base ProductPage class, with processing of final values enabled by default.

I'm +0.5 for option (2), but willing to hear the opinions and arguments. I like keeping the "safe" approach, but it introduces more complexity (e.g. an additional base class, where users need to pick the right one).

cc @wRAR @Gallaecio.

I'm kind of against this transparency principle, and I've even proposed another PR #101 to ensure all strings are trimmed.

It's difficult to imagine for me why we would want to emit data with extra whitespaces and it's bothersome to always manually add some sort of default text cleaning processor into all used fields. I think this kind of processing should be applied by default at some point and would help to ensure all our output data respect a common standard.

Decided to revert this change for now, to not block the rest.

kmike · 2024-08-21T13:51:18Z

Overall, 👍, much needed changes. Having e.g. an images processor is a must.

codecov · 2024-08-21T23:11:34Z

Codecov Report

Attention: Patch coverage is 0% with 45 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (cce3b94) to head (30baf8a).
Report is 6 commits behind head on main.

Files	Patch %	Lines
zyte_common_items/processors.py	0.00%	43 Missing ⚠️
zyte_common_items/pages/product.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff          @@
##            main     #99   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         54      55    +1     
  Lines       2107    2160   +53     
=====================================
- Misses      2107    2160   +53

Files	Coverage Δ
zyte_common_items/pages/product.py	`0.00% <0.00%> (ø)`
zyte_common_items/processors.py	`0.00% <0.00%> (ø)`

Nykakin · 2024-08-22T13:49:22Z

In response to #99 (comment) and after some discussion in slack I've decided to just return strings without any change in price_processor instead of parsing it, thus retaining original philosophy of transparent processors.

The discussion about this approach can continue but I do not want to block image processors and brand processor because of it.

kmike

👍

tests/test_processors.py

zyte_common_items/processors.py

kmike · 2024-08-26T08:55:52Z

Thanks @Nykakin!

Nykakin added 5 commits August 12, 2024 06:53

Update processors

78f0655

Clean-up

0e0e733

Fix typos

308db2e

Fix twinecheck exception in CI

3fc6bb2

Add trailing ","

6620a15

BurnzZ requested review from wRAR, kmike and Gallaecio August 12, 2024 23:35

Variant processor

67cf862

wRAR reviewed Aug 16, 2024

View reviewed changes

Nykakin added 2 commits August 20, 2024 17:21

Apply changes from code review

ba1a5e9

Revert "Variant processor"

71eb060

This reverts commit 67cf862.

Nykakin requested a review from wRAR August 20, 2024 15:36