Text nodes with only numeric characters are treated as JSON nodes (despite not being valid JSON and being a child node of valid HTML) #310

neirar · 2024-12-30T02:18:47Z

Description

Thanks in advance for looking into this. I ran into this issue and couldn't find an existing report for it. Also, apologies if I'm not using the standard terminology (as I'm new Python and scrapy). I'm a software engineer though.

While scraping the text out of a webpage, I use the function:
node.xpath('name()').get(): # Returns None for text nodes
to check if it's just text or another HTML node. This function call is part of a recursive function to iterate through a section of a webpage and extract just the text.

However, for certain HTML nodes, such as nodes that only contain numeric characters, the internal parser interprets them as JSON and the call to xpath() raises an error: ValueError: Cannot use xpath on a Selector of type 'json'.

The node that is interpreted as JSON is not JSON (unless you consider it a JSON fragment; like an integer value without a key).

Also, this node came while parsing a parent node that is HTML, so I'm not sure how the parser arrived at the conclusion that it is JSON. Shouldn't it get a hint about the type from the parent node? E.g. my parent is HTML so I'm likely HTML.

Steps to Reproduce

from scrapy import Selector
text_marked_as_json = Selector(text='20')
print(f'text_marked_as_json.type = {text_marked_as_json.type}')
# Prints: text_marked_as_json.type = json. # Why is this type json?

text_marked_as_html = Selector(text='20 hello')
print(f'text_marked_as_html.type = {text_marked_as_html.type}')
# Prints: text_marked_as_json.type = html

Expected behavior:
If an HTML element contains only numeric characters, its type should be HTML.

Actual behavior:
If an HTML element contains only numeric characters, its type is JSON, even though it is not a valid JSON string.

Reproduces how often:
All the time.

Versions

Please paste here the output of executing scrapy version --verbose in the command line.
Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.12.9
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.13.1 (v3.13.1:06714517797, Dec 3 2024, 14:00:22) [Clang 15.0.0 (clang-1500.3.9.4)]
pyOpenSSL : 24.3.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform : macOS-15.1.1-arm64-arm-64bit-Mach-O

Additional context

Screenshot from Google Colab:

!scrapy version --verbose in Google Colab (form the screenshot)

Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.12.9
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]
pyOpenSSL : 24.2.1 (OpenSSL 3.3.2 3 Sep 2024)
cryptography : 43.0.3
Platform : Linux-6.1.85+-x86_64-with-glibc2.35

The text was updated successfully, but these errors were encountered:

wRAR · 2024-12-30T06:57:14Z

# Prints: text_marked_as_json.type = json. # Why is this type json?

Because the text you passed is valid JSON.

While scraping the text out of a webpage, I use the function:
node.xpath('name()').get(): # Returns None for text nodes
to check if it's just text or another HTML node.

I feel like you should change the logic for this, e.g. checking if the underlying element (.root) is an Element or a string.

wRAR transferred this issue from scrapy/scrapy Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text nodes with only numeric characters are treated as JSON nodes (despite not being valid JSON and being a child node of valid HTML) #310

Text nodes with only numeric characters are treated as JSON nodes (despite not being valid JSON and being a child node of valid HTML) #310

neirar commented Dec 30, 2024

wRAR commented Dec 30, 2024

Text nodes with only numeric characters are treated as JSON nodes (despite not being valid JSON and being a child node of valid HTML) #310

Text nodes with only numeric characters are treated as JSON nodes (despite not being valid JSON and being a child node of valid HTML) #310

Comments

neirar commented Dec 30, 2024

Description

Steps to Reproduce

Versions

Additional context

wRAR commented Dec 30, 2024