You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks in advance for looking into this. I ran into this issue and couldn't find an existing report for it. Also, apologies if I'm not using the standard terminology (as I'm new Python and scrapy). I'm a software engineer though.
While scraping the text out of a webpage, I use the function: node.xpath('name()').get(): # Returns None for text nodes
to check if it's just text or another HTML node. This function call is part of a recursive function to iterate through a section of a webpage and extract just the text.
However, for certain HTML nodes, such as nodes that only contain numeric characters, the internal parser interprets them as JSON and the call to xpath() raises an error: ValueError: Cannot use xpath on a Selector of type 'json'.
The node that is interpreted as JSON is not JSON (unless you consider it a JSON fragment; like an integer value without a key).
Also, this node came while parsing a parent node that is HTML, so I'm not sure how the parser arrived at the conclusion that it is JSON. Shouldn't it get a hint about the type from the parent node? E.g. my parent is HTML so I'm likely HTML.
Steps to Reproduce
from scrapy import Selector
text_marked_as_json = Selector(text='20')
print(f'text_marked_as_json.type = {text_marked_as_json.type}')
# Prints: text_marked_as_json.type = json. # Why is this type json?
text_marked_as_html = Selector(text='20 hello')
print(f'text_marked_as_html.type = {text_marked_as_html.type}')
# Prints: text_marked_as_json.type = html
Expected behavior:
If an HTML element contains only numeric characters, its type should be HTML.
Actual behavior:
If an HTML element contains only numeric characters, its type is JSON, even though it is not a valid JSON string.
Reproduces how often:
All the time.
Versions
Please paste here the output of executing scrapy version --verbose in the command line.
Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.12.9
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.13.1 (v3.13.1:06714517797, Dec 3 2024, 14:00:22) [Clang 15.0.0 (clang-1500.3.9.4)]
pyOpenSSL : 24.3.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform : macOS-15.1.1-arm64-arm-64bit-Mach-O
Additional context
Screenshot from Google Colab:
!scrapy version --verbose in Google Colab (form the screenshot)
# Prints: text_marked_as_json.type = json. # Why is this type json?
Because the text you passed is valid JSON.
While scraping the text out of a webpage, I use the function: node.xpath('name()').get(): # Returns None for text nodes
to check if it's just text or another HTML node.
I feel like you should change the logic for this, e.g. checking if the underlying element (.root) is an Element or a string.
Description
Thanks in advance for looking into this. I ran into this issue and couldn't find an existing report for it. Also, apologies if I'm not using the standard terminology (as I'm new Python and scrapy). I'm a software engineer though.
While scraping the text out of a webpage, I use the function:
node.xpath('name()').get(): # Returns None for text nodes
to check if it's just text or another HTML node. This function call is part of a recursive function to iterate through a section of a webpage and extract just the text.
However, for certain HTML nodes, such as nodes that only contain numeric characters, the internal parser interprets them as JSON and the call to
xpath()
raises an error:ValueError: Cannot use xpath on a Selector of type 'json'
.The node that is interpreted as JSON is not JSON (unless you consider it a JSON fragment; like an integer value without a key).
Also, this node came while parsing a parent node that is HTML, so I'm not sure how the parser arrived at the conclusion that it is JSON. Shouldn't it get a hint about the type from the parent node? E.g. my parent is HTML so I'm likely HTML.
Steps to Reproduce
Expected behavior:
If an HTML element contains only numeric characters, its type should be HTML.
Actual behavior:
If an HTML element contains only numeric characters, its type is JSON, even though it is not a valid JSON string.
Reproduces how often:
All the time.
Versions
Please paste here the output of executing
scrapy version --verbose
in the command line.Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.12.9
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.13.1 (v3.13.1:06714517797, Dec 3 2024, 14:00:22) [Clang 15.0.0 (clang-1500.3.9.4)]
pyOpenSSL : 24.3.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform : macOS-15.1.1-arm64-arm-64bit-Mach-O
Additional context
Screenshot from Google Colab:
!scrapy version --verbose
in Google Colab (form the screenshot)Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.12.9
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]
pyOpenSSL : 24.2.1 (OpenSSL 3.3.2 3 Sep 2024)
cryptography : 43.0.3
Platform : Linux-6.1.85+-x86_64-with-glibc2.35
The text was updated successfully, but these errors were encountered: