How to share and export consistent statistics across multiple crawlers? #966
Replies: 2 comments
-
Hello, you can make a custom instance of I'm not sure I understand what you want to do with |
Beta Was this translation helpful? Give feedback.
-
Thanks @janbuchar. Your answer works fine for me, but the time of the first execution of the scraper does not add to the second scraper. Here are the logs: [crawlee.statistics._statistics] INFO Statistics
┌───────────────────────────────┬─────────┐
│ requests_finished │ 0 │
│ requests_failed │ 0 │
│ retry_histogram │ [0] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ None │
│ requests_finished_per_minute │ 0 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 0.0 │
│ requests_total │ 0 │
│ crawler_runtime │ 0.02422 │
└───────────────────────────────┴─────────┘
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] INFO Navigating to ...
[crawlee.crawlers._playwright._playwright_crawler] INFO --- Fetch cookies ---
[crawlee.crawlers._playwright._playwright_crawler] INFO --- End of cookies ---
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO Final request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished │ 1 │
│ requests_failed │ 0 │
│ retry_histogram │ [1] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 11.137499 │
│ requests_finished_per_minute │ 4 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 11.137499 │
│ requests_total │ 1 │
│ crawler_runtime │ 14.854891 │
└───────────────────────────────┴───────────┘
[rich] INFO Found 12 cookies
>>>>> Here finished the execution of the firsr scraper (Playwright) <<<<<<
[rich] INFO Obteniendo ítems...
[crawlee.statistics._statistics] INFO Statistics
┌───────────────────────────────┬───────────┐
│ requests_finished │ 1 │
│ requests_failed │ 0 │
│ retry_histogram │ [1] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 11.137499 │
│ requests_finished_per_minute │ 2543 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 11.137499 │
│ requests_total │ 1 │
│ crawler_runtime │ 0.023598 │
└───────────────────────────────┴───────────┘
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 0. Items remaining: 31 of 31
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 1. Items remaining: 11 of 31
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 2. Items remaining: 0 of 40
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Allitems already processed
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO Final request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished │ 4 │
│ requests_failed │ 0 │
│ retry_histogram │ [4] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 3.891656 │
│ requests_finished_per_minute │ 13 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 15.566624 │
│ requests_total │ 4 │
│ crawler_runtime │ 18.088441 │
└───────────────────────────────┴───────────┘ Could it be the way I'm running them? Here is a summary of the code: http_client = ...
my_stats = ...
crawler_1 = PlaywrightCrawler(statistics=my_stats)
# ... here I have stored the cookies in a storage/dataset/cookies
await crawler_1.run([my_url])
crawler_1 = BeautifulSoupCrawler(statistics=my_stats)
# Here I have added the cookies I have stored before.
await crawler_2.run([my_url]) |
Beta Was this translation helpful? Give feedback.
-
Hi,
I want to use the same statistics for different crawlers. I have one HTTP client that I pass to two crawlers (PlaywrightCrawler and BeautifulSoupCrawler). However, when I execute these crawlers, I receive different statistics.
Additionally, I want to export these statistics into FinalStatistics and save them in a storage format (JSON or CSV). My goal is to manage multiple scrapers and save the statistics to analyze them.
Beta Was this translation helpful? Give feedback.
All reactions