Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor how crawling is logged #10

Merged
merged 2 commits into from
Nov 2, 2023
Merged

refactor how crawling is logged #10

merged 2 commits into from
Nov 2, 2023

Conversation

BurnzZ
Copy link
Contributor

@BurnzZ BurnzZ commented Nov 2, 2023

See how the logs change Before and After

Before:

[page: category] https://example.com/ 
NO next page link.
3 subcategory links:
- (probability=98.72%) Tech, https://example.com/categ/tech
- (probability=96.37%) Books, https://example.com/categ/books
- (probability=10.00%) [heuristics] some potential link, https://example.com/some-potential-link
5 item links:
- (probability=99.96%) product 1, https://example.com/products?id=1
- (probability=99.58%) product 2, https://example.com/products?id=2
- (probability=99.40%) product 3, https://example.com/products?id=3
- (probability=98.39%) product 4, https://example.com/products?id=4
- (probability=98.26%) product 5, https://example.com/products?id=5

Now:

Crawling Logs for https://example.com/ (parsed as: productNavigation): 
Number of Requests per page type:
- product: 5
- nextPage: 0
- subCategories: 2
- productNavigation: 0
- productNavigation-heuristics: 1
- unknown: 0

Structured Logs:
{
  "time": "2023-10-10 13:13:06",
  "current": {
    "url": "https://example.com/",
    "request_url": "https://example.com/",
    "request_fingerprint": "dd66f5920d5b6ec722af27c4f75a4396c8be5e81",
    "page_type": "productNavigation",
    "probability": null
  },
  "to_crawl": {
    "product": [
      {
        "name": "product 1",
        "probability": 0.9996,
        "page_type": "product",
        "request_url": "https://example.com/products?id=1",
        "request_priority": 199,
        "request_fingerprint": "0846055c4244ebd6c72fd857ad1fcf34f0b6927e"
      },
      {
        "name": "product 2",
        "probability": 0.9958,
        "page_type": "product",
        "request_url": "https://example.com/products?id=2",
        "request_priority": 199,
        "request_fingerprint": "2dde78458cf8fc3122936f8517b37abaea9f505f"
      },
      {
        "name": "product 3",
        "probability": 0.9940,
        "page_type": "product",
        "request_url": "https://example.com/products?id=3",
        "request_priority": 199,
        "request_fingerprint": "489a2bf10667de3f01bb7ee520cd8676658120ba"
      },
      {
        "name": "product 4",
        "probability": 0.9839,
        "page_type": "product",
        "request_url": "https://example.com/products?id=4",
        "request_priority": 198,
        "request_fingerprint": "e5c7dbda491cc46fae8837acbc4a94a6fb70d664"
      },
      {
        "name": "product 5",
        "probability": 0.9826,
        "page_type": "product",
        "request_url": "https://example.com/products?id=5",
        "request_priority": 198,
        "request_fingerprint": "f2df0fb4ef3fc497577a3b0a3f27086773d01546"
      }
    ],    
    "nextPage": [],
    "subCategories": [
      {
        "name": "Tech",
        "probability": 0.9872,
        "page_type": "subCategories",
        "request_url": "https://example.com/categ/tech",
        "request_priority": 98,
        "request_fingerprint": "04bc46843f801abebe69482a6a8e64b4b71b641d"
      },
      {
        "name": "Books",
        "probability": 0.9637,
        "page_type": "subCategories",
        "request_url": "https://example.com/categ/books",
        "request_priority": 96,
        "request_fingerprint": "be7d0066ca3068a6090343ef92675fffe7ec33ce"
      }
    ],
    "productNavigation": [],
    "productNavigation-heuristics": [
      {
        "name": "some potential link",
        "probability": 0.1,
        "page_type": "productNavigation-heuristics",
        "request_url": "https://example.com/some-potential-link",
        "request_priority": 10,
        "request_fingerprint": "b0cdf0ba24cf2f15cff462358fa87f242c1b9dfc"
      }
    ],
    "unknown": []
  }
}

Problem: When performing experiments on certain enhancements/changes, it takes a lot of time to analyze it manually. This manual undertaking is also error-prone and we can miss out on some key details.

Motivation: We can automate the process of analyzing these crawling experiments by ensuring that we have machine-readable crawling logs that contain sufficient information to understand how the crawl was performed. This enables us to create ready-made scripts or notebooks and generate crawling reports from the crawling logs.

Some Notes about the implementation:

  • We log separate data for url and request_url to take into account when Zyte API handles redirections.
  • We use the same request_fingerprint as with https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy so that the hash matches with the ones inside Scrapy Cloud's Request-Tab.
  • The chosen implementation was a Spider Middleware since it allows us to examine which Requests the spider has actually yielded after receiving a given response.
    • In the previous implementation, we can't trust the logs since it only outputs the contents of productNavigation. This shouldn't be the case since users might override the spider or introduce other middlewares that filter out some requests based on some criteria and thus, the crawling logs doesn't match the actual spider requests.

Other things we can do:

(We can do this in another PR as this PR attempts to remove slowdowns in how we currently analyze crawling experiments)

  • Setting to switch the middleware on/off
  • also maybe a way to toggle its logging levels (e.g. from INFO to DEBUG)
  • Perhaps YAML is easier to read than JSON? (it doesn't have those braces which can introduce noise)
  • Should we store the crawling logs to another resource? (though I think the default logs is good enough)
  • Handle cases when crawling log exceeds 1MB

@BurnzZ BurnzZ merged commit d1a5bbd into main Nov 2, 2023
7 checks passed
@BurnzZ BurnzZ deleted the crawling-logs branch November 2, 2023 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant