Make the errback handling method configurable #156

radostyle · 2024-01-11T23:02:25Z

This patch makes the errback method that is called on the spider configurable.

By setting it to None in the configuration it allows the user to return scrapy to it's default exception handling.

Currently, in scrapyrt, there is the undocumented bug that exceptions are sent the 'parse' method.

Here is the commit where this was added:
75d2b3e

The errback should never have been defaulted to the 'parse' method of the spider. By doing this it invalidates what the scrapy docs say. Also, there is no documentation on the scrapy site that says that exceptions get sent to the parse method. The reason this was found is because the error handling in the process_spider_exception middleware was never getting called as the scrapy docs said it should be.

The author was adding a feature so that one could pass the errback as a GET parameter as in the case for the "callback".
It seems they copy-pasted the "callback" line to get it to work, not realizing that 'parse' was a bad default for errback.

The correct default is to allow the exception to propagate to the existing scrapy functionality.

For backwards compatibility for anyone who relies on now sending exceptions to their 'parse' method, this patch keeps the bug, but adds some documentation, and allows users who want the unmodified scrapy exception handling to get it back.

The errback should never have been defaulted to the 'parse' method of the spider. By doing this it invalidates what the scrapy docs say. Also, there is no documentation on the scrapy site that says that exceptions get sent to the parse method. The reason this was found is because the error handling in the `process_spider_exception` middleware was never getting called as the scrapy docs said it should be. The workaround to get it to work the way it did before with the 'parse' method is add `&errback=parse` in the request.

This will allow the ability to change the non-standard behavior of sending exceptions to the `parse` method of the spider without introducing a breaking change to scrapyrt. It also introduces some documentation of the existing behavior.

Gallaecio

Looks good to me, and I agree that parse should never have been the default, but I don’t see what this has to do with process_spider_exception. That catches exceptions from a callback or an earlier spider middleware, while errback handles exceptions from the download handler and downloader middlewares, as far as I recall.

If your goal is to catch download exceptions, you might want to look into process_exception from downloader middlewares instead of process_spider_exception from spider middlewares.

docs/source/api.rst

pawelmhm · 2024-01-18T09:01:41Z

scrapyrt/core.py

+
+            if self.errback_name:
+                errback = getattr(self.crawler.spider, self.errback_name)
+                assert callable(errback), 'Invalid errback'


the problem with this here is that on bad parameters for errback it will fail silently on API side and write errors to logs if errback is not callable or spider has no "errback_name". Possibly we could handle this better by adding & validating errback on crawlerManager init and then wraping line 254 in run_crawl to throw 400 HTTP on some type of exceptions for example if error related to errback provided but not callable.

The line

getattr(self.crawler.spider, self.errback_name)

can also fail if errback is not an attribute of spider so in this case I would also try to throw 400 status code as this would be user error.

I'm not really sure if we should keep backward compat for behavior which seems to be an error. If parse should not be default errback then having it as default is not helping people. We can add warning to new release about this.

It also needs a unit test. So please add unit test.

Thank you for providing this fix. I'm happy to merge it and release it this week.

radostyle · 2024-01-18T17:22:45Z

Looks good to me, and I agree that parse should never have been the default, but I don’t see what this has to do with process_spider_exception. That catches exceptions from a callback or an earlier spider middleware, while errback handles exceptions from the download handler and downloader middlewares, as far as I recall.

If your goal is to catch download exceptions, you might want to look into process_exception from downloader middlewares instead of process_spider_exception from spider middlewares.

Here is a cleanroom project where you can test it out. Exception is thrown in the middleware process_spider_input and ends up in the parse method instead of in process_spider_exception. But when not running under scrapyrt it works as expected.
https://gitlab.com/jasource/scrapyrtexception

Currently the application is not reporting to the user when the user provides an invalid errback or callback method. The scheduling of the request and validation of the spider callback and errback happens in a different thread than the one which is handling the api request. So, we need a different mechanism to communicate with the api request thread than simply raising the exception. We already do this for other errors and responses by adding properties to the CrawlManager object. So it seems best to also communicate this exception to the api request by using a user_error property on the CrawlManager. Then the exception can be raised in the context of the api request.

Co-authored-by: Adrián Chaves <[email protected]>

radostyle · 2024-01-22T19:41:47Z

@pawelmhm I added some code to report user errors that occur in the spider_idle method to the api request, added a unit test, and modified existing unit tests. I also am not sure if we should keep backward compatibility for behavior which seems to be an error, but I'll let you make that call (we can just change the default to None instead of 'parse' in that case) Let me know what you think.

pawelmhm · 2024-02-14T06:02:19Z

thanks @radostyle let's go with default errback None, I'll check and release this today.

pawelmhm · 2024-02-14T08:46:48Z

I made changes to default to None here, added some more docs and also cleaned up error message a bit so that it doesn't return full traceback and logs info to file. #158

radostyle added 2 commits January 11, 2024 16:54

Add a DEFAULT_ERRBACK_NAME to settings

40aa643

This will allow the ability to change the non-standard behavior of sending exceptions to the `parse` method of the spider without introducing a breaking change to scrapyrt. It also introduces some documentation of the existing behavior.

radostyle force-pushed the master branch from c298b4b to 40aa643 Compare January 15, 2024 21:39

radostyle changed the title ~~errback should default to None rather than the 'parse' method~~ Make the errback handling method configurable Jan 15, 2024

akshayphilar requested a review from pawelmhm January 18, 2024 01:12

Gallaecio reviewed Jan 18, 2024

View reviewed changes

docs/source/api.rst Outdated Show resolved Hide resolved

Gallaecio reviewed Jan 18, 2024

View reviewed changes

docs/source/api.rst Outdated Show resolved Hide resolved

pawelmhm reviewed Jan 18, 2024

View reviewed changes

radostyle and others added 3 commits January 22, 2024 13:23

Update docs/source/api.rst

13394dd

Co-authored-by: Adrián Chaves <[email protected]>

Update docs/source/api.rst

4fe6615

Co-authored-by: Adrián Chaves <[email protected]>

pawelmhm merged commit 4fe6615 into scrapinghub:master Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the errback handling method configurable #156

Make the errback handling method configurable #156

radostyle commented Jan 11, 2024 •

edited

Loading

Gallaecio left a comment •

edited

Loading

pawelmhm Jan 18, 2024 •

edited

Loading

radostyle commented Jan 18, 2024

radostyle commented Jan 22, 2024

pawelmhm commented Feb 14, 2024

pawelmhm commented Feb 14, 2024

Make the errback handling method configurable #156

Make the errback handling method configurable #156

Conversation

radostyle commented Jan 11, 2024 • edited Loading

Gallaecio left a comment • edited Loading

Choose a reason for hiding this comment

pawelmhm Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

radostyle commented Jan 18, 2024

radostyle commented Jan 22, 2024

pawelmhm commented Feb 14, 2024

pawelmhm commented Feb 14, 2024

radostyle commented Jan 11, 2024 •

edited

Loading

Gallaecio left a comment •

edited

Loading

pawelmhm Jan 18, 2024 •

edited

Loading