Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the errback handling method configurable #156

Merged
merged 5 commits into from
Feb 14, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -517,6 +517,13 @@ Encoding that's used to encode log messages.

Default: ``utf-8``.

DEFAULT_ERRBACK_NAME
~~~~~~~~
radostyle marked this conversation as resolved.
Show resolved Hide resolved

The name of the default errback method to call on the spider in case of an exception. The default errback method is ``parse`` to maintain backwards compatibility but it is not standard to scrapy and may interfere with the use of middlewares which implement the ``process_spider_exception`` method. Use a setting of ``None`` if you don't want to use the default scrapy exception handling.

Default: ``parse``. Use the ``parse`` method on scrapy spider to handle exceptions. Be aware that this is non-standard to typical scrapy spiders.
radostyle marked this conversation as resolved.
Show resolved Hide resolved


Spider settings
---------------
Expand Down
4 changes: 3 additions & 1 deletion scrapyrt/conf/default_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,6 @@
# disable in production
DEBUG = True

TWISTED_REACTOR = None
TWISTED_REACTOR = None

DEFAULT_ERRBACK_NAME = 'parse'
10 changes: 6 additions & 4 deletions scrapyrt/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ def __init__(self, spider_name, request_kwargs,
# because we need to know if spider has method available
self.callback_name = request_kwargs.pop('callback', None) or 'parse'
# do the same for errback
self.errback_name = request_kwargs.pop('errback', None) or 'parse'
self.errback_name = request_kwargs.pop('errback', None) or app_settings.DEFAULT_ERRBACK_NAME

if request_kwargs.get("url"):
self.request = self.create_spider_request(deepcopy(request_kwargs))
Expand Down Expand Up @@ -175,9 +175,11 @@ def spider_idle(self, spider):
assert callable(callback), 'Invalid callback'
self.request = self.request.replace(callback=callback)

errback = getattr(self.crawler.spider, self.errback_name)
assert callable(errback), 'Invalid errback'
self.request = self.request.replace(errback=errback)

if self.errback_name:
errback = getattr(self.crawler.spider, self.errback_name)
assert callable(errback), 'Invalid errback'
Copy link
Member

@pawelmhm pawelmhm Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with this here is that on bad parameters for errback it will fail silently on API side and write errors to logs if errback is not callable or spider has no "errback_name". Possibly we could handle this better by adding & validating errback on crawlerManager init and then wraping line 254 in run_crawl to throw 400 HTTP on some type of exceptions for example if error related to errback provided but not callable.

The line

getattr(self.crawler.spider, self.errback_name)

can also fail if errback is not an attribute of spider so in this case I would also try to throw 400 status code as this would be user error.

I'm not really sure if we should keep backward compat for behavior which seems to be an error. If parse should not be default errback then having it as default is not helping people. We can add warning to new release about this.

It also needs a unit test. So please add unit test.

Thank you for providing this fix. I'm happy to merge it and release it this week.

self.request = self.request.replace(errback=errback)
modify_request = getattr(
self.crawler.spider, "modify_realtime_request", None)
if callable(modify_request):
Expand Down