Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Page returns empty through scrapyrt only #116

Open
keyiyek opened this issue Dec 9, 2020 · 3 comments
Open

Search Page returns empty through scrapyrt only #116

keyiyek opened this issue Dec 9, 2020 · 3 comments
Labels
more info needed original poster should provide more details to allow us to identify the problem

Comments

@keyiyek
Copy link

keyiyek commented Dec 9, 2020

(Sorry can't find how to label this)
I hope this is the right place where to ask this.

I created a spider that can scrape a page in an e-commerce site and gather the data on the different items.
The spider works fine with specific pages of the site (www.sitedomain/123-item-category), as well as with the search page (www.sitedomain/searchpage?controller?search=keywords+item+to+be+found).

But, when I run it through scrapyrt the specific page works fine, but the search page returns 0 items. No errors, just 0 items.This occurs on 2 different sites with 2 different spiders.

Is there something specific to search pages that has to be taken in account when using scrapyrt?

@pawelmhm
Copy link
Member

Can you post your spider code? I don't see a way to reproduce it without spider code. Try to pinpoint the problem so that there is small code sample of spider running in raw ScrapyRT (without any middlewares, pipelines and other stuff from your project intefering). This way we can see this is problem on ScrapyRT side.

@pawelmhm pawelmhm added the more info needed original poster should provide more details to allow us to identify the problem label Jan 29, 2021
@keyiyek
Copy link
Author

keyiyek commented Jan 29, 2021

yes, sure.

so, my spider, stripped of all other suff looks like this:

`import scrapy

class QuotesSpider(scrapy.Spider):
name = "minimal"

def start_requests(self):
    urls = [
       "https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride",
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)
        

def parse(self, response):
    print("Found ", len(response.css("article")), " items")
    for article in response.css("article"):
        print("Item: ", [article.css("img::attr(title)").get())`]

and I set Obey_robots = False

when I do

scrape crawl minimal

I get 20 items in the response, but if I go

curl "http://localhost:9081/crawl.json?spider_name=minimal&url=https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride"

I get 0 items, no error, just 0 items.
I wonder if, in some way, returns the results before the page gets completely loaded?

(sorry couldn't get the markup to work correctly)

@Yansuko
Copy link

Yansuko commented Feb 3, 2022

Seems that when there is '&' on the url.
scrapyrt split it right before the &.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more info needed original poster should provide more details to allow us to identify the problem
Projects
None yet
Development

No branches or pull requests

3 participants