How to share and export consistent statistics across multiple crawlers? #966

francomanca93 · 2025-02-07T14:35:36Z

francomanca93
Feb 7, 2025

Hi,

I want to use the same statistics for different crawlers. I have one HTTP client that I pass to two crawlers (PlaywrightCrawler and BeautifulSoupCrawler). However, when I execute these crawlers, I receive different statistics.

Additionally, I want to export these statistics into FinalStatistics and save them in a storage format (JSON or CSV). My goal is to manage multiple scrapers and save the statistics to analyze them.

janbuchar · 2025-02-07T14:58:27Z

janbuchar
Feb 7, 2025
Maintainer

Hello, you can make a custom instance of crawlee.statistics.Statistics and pass it to your crawlers when you instantiate them - e.g., PlaywrightCrawler(statistics=your_custom_statistics).

I'm not sure I understand what you want to do with FinalStatistics, but that object is calculated by calling the Statistics.calculate() method - you can do that instead of relying on BasicCrawler.run to return it.

0 replies

francomanca93 · 2025-02-07T18:26:19Z

francomanca93
Feb 7, 2025
Author

Thanks @janbuchar. Your answer works fine for me, but the time of the first execution of the scraper does not add to the second scraper. Here are the logs:

[crawlee.statistics._statistics] INFO  Statistics
┌───────────────────────────────┬─────────┐
│ requests_finished             │ 0       │
│ requests_failed               │ 0       │
│ retry_histogram               │ [0]     │
│ request_avg_failed_duration   │ None    │
│ request_avg_finished_duration │ None    │
│ requests_finished_per_minute  │ 0       │
│ requests_failed_per_minute    │ 0       │
│ request_total_duration        │ 0.0     │
│ requests_total                │ 0       │
│ crawler_runtime               │ 0.02422 │
└───────────────────────────────┴─────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] INFO  Navigating to ...
[crawlee.crawlers._playwright._playwright_crawler] INFO  --- Fetch cookies ---
[crawlee.crawlers._playwright._playwright_crawler] INFO  --- End of cookies ---
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished             │ 1         │
│ requests_failed               │ 0         │
│ retry_histogram               │ [1]       │
│ request_avg_failed_duration   │ None      │
│ request_avg_finished_duration │ 11.137499 │
│ requests_finished_per_minute  │ 4         │
│ requests_failed_per_minute    │ 0         │
│ request_total_duration        │ 11.137499 │
│ requests_total                │ 1         │
│ crawler_runtime               │ 14.854891 │
└───────────────────────────────┴───────────┘
[rich] INFO  Found 12 cookies

 >>>>> Here finished the execution of the firsr scraper (Playwright) <<<<<<

[rich] INFO  Obteniendo ítems...
[crawlee.statistics._statistics] INFO  Statistics
┌───────────────────────────────┬───────────┐
│ requests_finished             │ 1         │
│ requests_failed               │ 0         │
│ retry_histogram               │ [1]       │
│ request_avg_failed_duration   │ None      │
│ request_avg_finished_duration │ 11.137499 │
│ requests_finished_per_minute  │ 2543      │
│ requests_failed_per_minute    │ 0         │
│ request_total_duration        │ 11.137499 │
│ requests_total                │ 1         │
│ crawler_runtime               │ 0.023598  │
└───────────────────────────────┴───────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 0. Items remaining: 31 of 31
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 1. Items remaining: 11 of 31
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 2. Items remaining: 0 of 40
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Allitems already processed
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Final request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished             │ 4         │
│ requests_failed               │ 0         │
│ retry_histogram               │ [4]       │
│ request_avg_failed_duration   │ None      │
│ request_avg_finished_duration │ 3.891656  │
│ requests_finished_per_minute  │ 13        │
│ requests_failed_per_minute    │ 0         │
│ request_total_duration        │ 15.566624 │
│ requests_total                │ 4         │
│ crawler_runtime               │ 18.088441 │
└───────────────────────────────┴───────────┘

Could it be the way I'm running them? Here is a summary of the code:

http_client = ...
my_stats = ...

crawler_1 = PlaywrightCrawler(statistics=my_stats)
# ... here I have stored the cookies in a storage/dataset/cookies
await crawler_1.run([my_url])

crawler_1 = BeautifulSoupCrawler(statistics=my_stats)
# Here I have added the cookies I have stored before.
await crawler_2.run([my_url])

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to share and export consistent statistics across multiple crawlers? #966

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How to share and export consistent statistics across multiple crawlers? #966

francomanca93 Feb 7, 2025

Replies: 2 comments

janbuchar Feb 7, 2025 Maintainer

francomanca93 Feb 7, 2025 Author

francomanca93
Feb 7, 2025

janbuchar
Feb 7, 2025
Maintainer

francomanca93
Feb 7, 2025
Author