Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text nodes with only numeric characters are treated as JSON nodes (despite not being valid JSON and being a child node of valid HTML) #310

Open
neirar opened this issue Dec 30, 2024 · 1 comment

Comments

@neirar
Copy link

neirar commented Dec 30, 2024

Description

Thanks in advance for looking into this. I ran into this issue and couldn't find an existing report for it. Also, apologies if I'm not using the standard terminology (as I'm new Python and scrapy). I'm a software engineer though.

While scraping the text out of a webpage, I use the function:
node.xpath('name()').get(): # Returns None for text nodes
to check if it's just text or another HTML node. This function call is part of a recursive function to iterate through a section of a webpage and extract just the text.

However, for certain HTML nodes, such as nodes that only contain numeric characters, the internal parser interprets them as JSON and the call to xpath() raises an error: ValueError: Cannot use xpath on a Selector of type 'json'.

The node that is interpreted as JSON is not JSON (unless you consider it a JSON fragment; like an integer value without a key).

Also, this node came while parsing a parent node that is HTML, so I'm not sure how the parser arrived at the conclusion that it is JSON. Shouldn't it get a hint about the type from the parent node? E.g. my parent is HTML so I'm likely HTML.

Steps to Reproduce

from scrapy import Selector
text_marked_as_json = Selector(text='20')
print(f'text_marked_as_json.type = {text_marked_as_json.type}')
# Prints: text_marked_as_json.type = json. # Why is this type json?

text_marked_as_html = Selector(text='20 hello')
print(f'text_marked_as_html.type = {text_marked_as_html.type}')
# Prints: text_marked_as_json.type = html

Expected behavior:
If an HTML element contains only numeric characters, its type should be HTML.

Actual behavior:
If an HTML element contains only numeric characters, its type is JSON, even though it is not a valid JSON string.

Reproduces how often:
All the time.

Versions

Please paste here the output of executing scrapy version --verbose in the command line.
Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.12.9
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.13.1 (v3.13.1:06714517797, Dec 3 2024, 14:00:22) [Clang 15.0.0 (clang-1500.3.9.4)]
pyOpenSSL : 24.3.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform : macOS-15.1.1-arm64-arm-64bit-Mach-O

Additional context

Screenshot from Google Colab:
Screenshot 2024-12-29 at 6 11 16 PM

!scrapy version --verbose in Google Colab (form the screenshot)

Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.12.9
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]
pyOpenSSL : 24.2.1 (OpenSSL 3.3.2 3 Sep 2024)
cryptography : 43.0.3
Platform : Linux-6.1.85+-x86_64-with-glibc2.35

@wRAR wRAR transferred this issue from scrapy/scrapy Dec 30, 2024
@wRAR
Copy link
Member

wRAR commented Dec 30, 2024

# Prints: text_marked_as_json.type = json. # Why is this type json?

Because the text you passed is valid JSON.

While scraping the text out of a webpage, I use the function:
node.xpath('name()').get(): # Returns None for text nodes
to check if it's just text or another HTML node.

I feel like you should change the logic for this, e.g. checking if the underlying element (.root) is an Element or a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants