Skip to content

Latest commit

 

History

History
208 lines (172 loc) · 9.41 KB

README-dev.org

File metadata and controls

208 lines (172 loc) · 9.41 KB

TODO: https://doc.rust-lang.org/rustdoc/

database schema

crawl job

  • get job with lease
  • update lease ( can be done by work manager by observing url frontier of crawl job?

url frontier

url priority

  • domain crossing hops instead of external depth
  • depth in parallel to priority to have absolute crawl frontier
  • signals:
    • sitemap.xml itself and linked sitemaps have constant prio 1
    • priority of find location
    • number of urls on the find location
    • number of query params
    • number of path elements
    • outlink context
  • properties
    • priority from sitemaps is between 0 < p <= 1
    • query params are worse than path elements
    • remaining depth goes down

url to crawl map

  • stealing of work possible if crawl A already crawls an URL external to crawl B? Or rather work injection, if a crawl for a domain exists?

queries

  • insert initial URL(s)
  • Insert URLs found
    • only if URLs does not exist yet
  • get next uncrawled url
    • order by priority
    • round robin over URLs
  • set url to crawled

crawle archive

  • get archived pages for URL
  • get all archived URLs for domain below path

crates to use

MIME Types

Input:

  1. Mime type declaration pointing to URL:
    • <link type=”text/css” rel=”stylesheet” href=”…”
    • <a href=”…” type=”foo/bar”>
    • <img -> img sniffer
  2. Mime type from HTTP header, overriding 1. ? file extension?
  3. Sniffing, if no HTTP header

robots.txt

RFC 9309, Google, Yandex docs, robotstxt.org

Google based robotstxt does not(?) provide an object to hold a parsed robots.txt, is from a time before RFC 9309 and seems to be very “unrusty” (much mutable global state). After all it’s a simple transliteration from C++.

sitemaps

Crates

Canonical Link Element

URL normalization

remove tracking URL parameters

crates

  • query_map - generic wrapper around HashMap<String, Vec<String>> to handle different transformations like URL query strings
  • clearurl - implementation for ClearURL
  • clearurls - rm tracking params
  • qstring - query string parser
  • shucker - Tracking-param filtering library, designed to strip URLs down to their canonical forms
  • urlnorm - url normalization
  • url-cleaner - rm tracking garbage

compiling with openssl on Debian

sfackler/rust-openssl#2333

sudo apt install libc6-dev libssl-dev sudo ln -s /usr/include/x86_64-linux-gnu/openssl/opensslconf.h /usr/include/openssl/opensslconf.h sudo ln -s /usr/include/x86_64-linux-gnu/openssl/configuration.h /usr/include/openssl/configuration.h

interesting stuff

protocols in general

postgres

postgres crates

LISTEN/NOTIFY with postgres, diesel

crates

HTML content / article extraction