Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the errback handling method configurable #156

Merged
merged 5 commits into from
Feb 14, 2024

Conversation

radostyle
Copy link
Contributor

@radostyle radostyle commented Jan 11, 2024

This patch makes the errback method that is called on the spider configurable.

By setting it to None in the configuration it allows the user to return scrapy to it's default exception handling.

Currently, in scrapyrt, there is the undocumented bug that exceptions are sent the 'parse' method.

Here is the commit where this was added:
75d2b3e

The errback should never have been defaulted to the 'parse' method of the spider. By doing this it invalidates what the scrapy docs say. Also, there is no documentation on the scrapy site that says that exceptions get sent to the parse method. The reason this was found is because the error handling in the process_spider_exception middleware was never getting called as the scrapy docs said it should be.

The author was adding a feature so that one could pass the errback as a GET parameter as in the case for the "callback".
It seems they copy-pasted the "callback" line to get it to work, not realizing that 'parse' was a bad default for errback.

The correct default is to allow the exception to propagate to the existing scrapy functionality.

For backwards compatibility for anyone who relies on now sending exceptions to their 'parse' method, this patch keeps the bug, but adds some documentation, and allows users who want the unmodified scrapy exception handling to get it back.

The errback should never have been defaulted to the 'parse' method
of the spider.  By doing this it invalidates what the scrapy docs
say.  Also, there is no documentation on the scrapy site that says
that exceptions get sent to the parse method.  The reason this was
found is because the error handling in the `process_spider_exception`
middleware was never getting called as the scrapy docs said it
should be.

The workaround to get it to work the way it did before with the
'parse' method is add `&errback=parse` in the request.
This will allow the ability to change the non-standard behavior of
sending exceptions to the `parse` method of the spider without
introducing a breaking change to scrapyrt.

It also introduces some documentation of the existing behavior.
@radostyle radostyle changed the title errback should default to None rather than the 'parse' method Make the errback handling method configurable Jan 15, 2024
@akshayphilar akshayphilar requested a review from pawelmhm January 18, 2024 01:12
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, and I agree that parse should never have been the default, but I don’t see what this has to do with process_spider_exception. That catches exceptions from a callback or an earlier spider middleware, while errback handles exceptions from the download handler and downloader middlewares, as far as I recall.

If your goal is to catch download exceptions, you might want to look into process_exception from downloader middlewares instead of process_spider_exception from spider middlewares.

docs/source/api.rst Outdated Show resolved Hide resolved
docs/source/api.rst Outdated Show resolved Hide resolved
scrapyrt/core.py Outdated

if self.errback_name:
errback = getattr(self.crawler.spider, self.errback_name)
assert callable(errback), 'Invalid errback'
Copy link
Member

@pawelmhm pawelmhm Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with this here is that on bad parameters for errback it will fail silently on API side and write errors to logs if errback is not callable or spider has no "errback_name". Possibly we could handle this better by adding & validating errback on crawlerManager init and then wraping line 254 in run_crawl to throw 400 HTTP on some type of exceptions for example if error related to errback provided but not callable.

The line

getattr(self.crawler.spider, self.errback_name)

can also fail if errback is not an attribute of spider so in this case I would also try to throw 400 status code as this would be user error.

I'm not really sure if we should keep backward compat for behavior which seems to be an error. If parse should not be default errback then having it as default is not helping people. We can add warning to new release about this.

It also needs a unit test. So please add unit test.

Thank you for providing this fix. I'm happy to merge it and release it this week.

@radostyle
Copy link
Contributor Author

Looks good to me, and I agree that parse should never have been the default, but I don’t see what this has to do with process_spider_exception. That catches exceptions from a callback or an earlier spider middleware, while errback handles exceptions from the download handler and downloader middlewares, as far as I recall.

If your goal is to catch download exceptions, you might want to look into process_exception from downloader middlewares instead of process_spider_exception from spider middlewares.

Here is a cleanroom project where you can test it out. Exception is thrown in the middleware process_spider_input and ends up in the parse method instead of in process_spider_exception. But when not running under scrapyrt it works as expected.
https://gitlab.com/jasource/scrapyrtexception

radostyle and others added 3 commits January 22, 2024 13:23
Currently the application is not reporting to the user when the user provides an invalid errback or callback method.  The scheduling of the request and validation of the spider callback and errback happens in a different thread than the one which is handling the api request. So, we need a different mechanism to communicate with the api request thread than simply raising the exception.  We already do this for other errors and responses by adding properties to the CrawlManager object.  So it seems best to also communicate this exception to the api request by using a user_error property on the CrawlManager.  Then the exception can be raised in the context of the api request.
Co-authored-by: Adrián Chaves <[email protected]>
Co-authored-by: Adrián Chaves <[email protected]>
@radostyle
Copy link
Contributor Author

@pawelmhm I added some code to report user errors that occur in the spider_idle method to the api request, added a unit test, and modified existing unit tests. I also am not sure if we should keep backward compatibility for behavior which seems to be an error, but I'll let you make that call (we can just change the default to None instead of 'parse' in that case) Let me know what you think.

@pawelmhm
Copy link
Member

thanks @radostyle let's go with default errback None, I'll check and release this today.

@pawelmhm
Copy link
Member

I made changes to default to None here, added some more docs and also cleaned up error message a bit so that it doesn't return full traceback and logs info to file. #158

@pawelmhm pawelmhm merged commit 4fe6615 into scrapinghub:master Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants