Add new crawlwer? (selenium-driverless) #882

mido-99 · 2025-01-08T01:24:08Z

mido-99
Jan 8, 2025

Hi,
First thank you so much for your effort in this package it's really great.

My question is: Is there any consideration to add new crawlers like selenium-driverless in current time? Recently I've been trying PlaywrightCrawler with many websites and some of them could detect it & raise cloudflare, though testing same sites with selenium-driverless could easily get through.

For example this site with the following code:

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee import Request
import asyncio

async def main():
    crawler = PlaywrightCrawler(
        browser_type='chromium',
        headless=False,
        max_requests_per_crawl=5,
        max_request_retries=1,
    )

    @crawler.router.default_handler
    async def default_handler(context: PlaywrightCrawlingContext):
        context.log.info(f"Processing {context.page.url}")

        products = await context.page.locator('a.button.product_type_variable').all()
        products_links = [await i.get_attribute('href') for i in products]
        await context.add_requests(
            [
                Request.from_url(
                    url=url, 
                    label='product',
                    user_data={'parent': context.request.loaded_url}
                    )
                for url in products_links
            ]
        )
    
    @crawler.router.handler('product')
    async def product_handler(context: PlaywrightCrawlingContext):
        context.log.info(f"Product {await context.page.title()}")

        name = await context.page.wait_for_selector('h1.product_title')
        price = await context.page.wait_for_selector('p.price bdi')
        # image = await context.page.wait_for_selector('img.wp-post-image')

        await context.push_data(
            {
                'url': context.request.loaded_url,
                'name': await name.inner_text() if name else None,
                'price': await price.inner_text() if price else None,
                # 'image': await image.get_attribute('src') if image else None,
                'parent':context.request.user_data['parent'],
            }
        )
    
    await crawler.run(['https://phones.mintmobile.com/'])

if __name__=='__main__':
    asyncio.run(main())

Expected behavior

Crawler should visit homepage, extract products links, visit each product & extract its data, very basic crawling.

Actual behavior

Success visiting the homepage, but all further requests are blocked with cloudflare.

What I tried

Mimic crawler's logic with selenium-driverless: It worked fine & no detection.
Mimic logic in native Playwright but opening a new context - not just a new page - for each request: worked fine (of course this's slightly more resource-intensive).
I haven't tested proxy but I'm sure it can help

Of course adding a whole new crawler is a headache & not that simple, but I'd like to hear your opinion.
If you suggest any flags I can pass to the PlaywrightCrawler to reduce its detection I'll be very thankful too.

Thanks in advance.

janbuchar · 2025-01-08T08:49:02Z

janbuchar
Jan 8, 2025
Maintainer

Hello, and thanks for your interest in Crawlee! We haven't looked into selenium-driverless yet, but at a glance, it looks interesting. Could you explain how it's different from "regular" selenium, as far as scraping/crawling is concerned?

By the way, a feature request for a Selenium-based crawler already exists - #284, but it doesn't seem to get much traction.

6 replies

Mantisus Jan 9, 2025
Collaborator

Hi @mido-99

That's a really interesting option.

But if your main goal is to avoid detection of Cloudflare and similar systems. Then consider guide on integrating Camoufox into PlaywrightCrawler, and also on the progress of this PR #829

I tested PlaywrightCrawler with Camoufox on some pretty complex cases, and it performed well

janbuchar Jan 9, 2025
Maintainer

So if I understand it correctly, selenium-driverless is used differently from plain old selenium, right? And implementing a SeleniumDriverlessCrawler would not satisfy those who want SeleniumCrawler?

mido-99 Jan 12, 2025
Author

Hi @Mantisus

Thank you for pointing at integrating Camoufox, it really works well in bypassing many cases, but its only issue is that we can't make a good use of its concurrency as it quickly takes a lot or memory if we increase this:

max_open_pages_per_browser=1,  # Increase, if camoufox can handle it in your use case.

Which didn't happen when I tried concurrency with selenium-driverless
But maybe this is only with me though as I have only 8 GB of RAM.

Anyway, thank you for the hint.

mido-99 Jan 12, 2025
Author

@janbuchar Well I don't know what those who want SeleniumCrawler need exactly.

If they just want it for selenium syntax then yes it will be enough of course + better bypassing anti-bot. but if they need it for specific selenium features (e.g: remote driver selenium.webdriver.Remote Then I don't think selenium-driverless supports this, and indeed there are some other small differences.

We need to clarify some points of what a SeleniumCrawler should do to know the answer.

janbuchar Jan 13, 2025
Maintainer

Okay, so we'd make a SeleniumDriverlessCrawler and keep the door open for a future SeleniumCrawler if it's requested by a significant number of potential users.

By the way, we'd also have to look into potential licensing issues with selenium-driverless, as it's free for non-commercial uses only (not a lawyer, but it looks like that's the case).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new crawlwer? (selenium-driverless) #882

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Add new crawlwer? (selenium-driverless) #882

mido-99 Jan 8, 2025

Expected behavior

Actual behavior

What I tried

Replies: 1 comment · 6 replies

janbuchar Jan 8, 2025 Maintainer

Mantisus Jan 9, 2025 Collaborator

janbuchar Jan 9, 2025 Maintainer

mido-99 Jan 12, 2025 Author

mido-99 Jan 12, 2025 Author

janbuchar Jan 13, 2025 Maintainer

mido-99
Jan 8, 2025

Replies: 1 comment 6 replies

janbuchar
Jan 8, 2025
Maintainer

Mantisus Jan 9, 2025
Collaborator

janbuchar Jan 9, 2025
Maintainer

mido-99 Jan 12, 2025
Author

mido-99 Jan 12, 2025
Author

janbuchar Jan 13, 2025
Maintainer