TODO: https://doc.rust-lang.org/rustdoc/

database schema

crawl job

get job with lease
update lease ( can be done by work manager by observing url frontier of crawl job?

url frontier

url priority

domain crossing hops instead of external depth
depth in parallel to priority to have absolute crawl frontier
signals:
- sitemap.xml itself and linked sitemaps have constant prio 1
- priority of find location
- number of urls on the find location
- number of query params
- number of path elements
- outlink context
properties
- priority from sitemaps is between 0 < p <= 1
- query params are worse than path elements
- remaining depth goes down

url to crawl map

stealing of work possible if crawl A already crawls an URL external to crawl B? Or rather work injection, if a crawl for a domain exists?

queries

insert initial URL(s)
Insert URLs found
- only if URLs does not exist yet
get next uncrawled url
- order by priority
- round robin over URLs
set url to crawled

crawle archive

get archived pages for URL
get all archived URLs for domain below path

crates to use

https://crates.io/crates/scraper - HTML parsing and querying with CSS selectors
https://crates.io/crates/anyhow - Flexible concrete Error type built on std::error::Error
https://github.com/utkarshkukreti/select.rs - extract data from HTML
https://crates.io/crates/rouille - mini HTTP server for status and control pages

MIME Types

There’s a standard! MIME Sniffing via Content Sniffing (WP)
https://crates.io/crates/mime_classifier Implementation of the std from Servo
How Mozilla determines MIME Types

Input:

Mime type declaration pointing to URL:
- <link type=”text/css” rel=”stylesheet” href=”…”
- <a href=”…” type=”foo/bar”>
- <img -> img sniffer
Mime type from HTTP header, overriding 1. ? file extension?
Sniffing, if no HTTP header

MIME Type Detection in Windows Internet Explorer

robots.txt

RFC 9309, Google, Yandex docs, robotstxt.org

Google based robotstxt does not(?) provide an object to hold a parsed robots.txt, is from a time before RFC 9309 and seems to be very “unrusty” (much mutable global state). After all it’s a simple transliteration from C++.

texting_robots
- forked by Spire-rs’ kit/exclusion
robotparser-rs
- forked by spider
robots_txt - unstabke, WIP, +4y

sitemaps

https://developers.google.com/search/docs/crawling-indexing/sitemaps
https://sitemaps.org
https://en.m.wikipedia.org/wiki/Sitemaps

Crates

https://crates.io/crates/sitemap xml-rs, old but 8 dependents
https://crates.io/crates/sitemap-iter roxmltree (2022-02)
https://crates.io/crates/sitemaps quick-xml (2024-06), experimental learning project
https://crates.io/crates/wls - check for ideas
https://crates.io/crates/sitemapo quick-xml (2023-07), dead repo

search topics
- “crawling strategy”
  - https://frontera.readthedocs.io
    - Crawling strategies
  - https://stackoverflow.com/questions/10331738/strategy-for-how-to-crawl-index-frequently-updated-webpages
    - “re-crawl strategy” or “page refresh policy”
- https://ssrg.eecs.uottawa.ca/docs/Benjamin-Thesis.pdf Strategy for Efficient Crawling of Rich Internet Applications
- “focused crawling”
https://developers.google.com/search/docs/crawling-indexing

Canonical Link Element

https://en.wikipedia.org/wiki/Canonical_link_element

URL normalization

https://crates.io/crates/urlnorm
https://en.wikipedia.org/wiki/URI_normalization
- “Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text)”

remove tracking URL parameters

https://github.com/brave/brave-browser/wiki/Query-String-Filter
https://gitlab.com/ClearURLs/ClearUrls
- https://gitlab.com/ClearURLs/rules -> data.min.json -> “globalRules”

crates

query_map - generic wrapper around HashMap<String, Vec<String>> to handle different transformations like URL query strings
clearurl - implementation for ClearURL
clearurls - rm tracking params
qstring - query string parser
shucker - Tracking-param filtering library, designed to strip URLs down to their canonical forms
urlnorm - url normalization
url-cleaner - rm tracking garbage

compiling with openssl on Debian

sfackler/rust-openssl#2333

sudo apt install libc6-dev libssl-dev sudo ln -s /usr/include/x86_64-linux-gnu/openssl/opensslconf.h /usr/include/openssl/opensslconf.h sudo ln -s /usr/include/x86_64-linux-gnu/openssl/configuration.h /usr/include/openssl/configuration.h

interesting stuff

GOGGLES: Democracy dies in darkness, and so does the Web paper by Brave Search Team, via Spyglass
- https://videos.cern.ch/record/2295289
- https://www.afaik.de/nona-werbefreie-suchmaschine-aus-deutschland/
https://github.com/spyglass-search
https://github.com/iipc - International Internet Preservation Consortium
- https://github.com/iipc/openwayback/wiki/OpenWayback-Users

protocols in general

https://sans-io.readthedocs.io/how-to-sans-io.html
- Niri WM wurde nach sansi-io Prinzipien programmiert (handgeschriebene englische Untertitel des Autors)

postgres

https://github.com/dhamaniasad/awesome-postgres
https://www.postgresguide.com
https://github.com/elierotenberg/coding-styles/blob/master/postgres.md

postgres crates

https://github.com/sfackler/rust-postgres
- rust wire protocol but uses tokio even in synchronous client
- probably problems due to async? sfackler/rust-postgres#725
- postgres-protocol, postgres-types do not depend on tokio
https://crates.io/crates/pgwire
- recomends rust-postgres from sfackler for clients, focuses on servers
- depends on tokio
diesel
- uses pq_sys C wrapper for libpg
- not pub
- no support for notifications
- previous request for LISTEN diesel-rs/diesel#2166
- https://docs.diesel.rs/2.2.x/src/diesel/pg/connection/raw.rs.html
- issues
  - Removing libpq (to enable async)
  - Async I/O
  - Postgres: We should avoid sending one query per custom type bind enum!
  - PostgreSQL Large Objects - would require access to internals?
  - testing diesel-rs/diesel#1549

LISTEN/NOTIFY with postgres, diesel

diesel-rs/diesel#4420
waiting for notifications is more involved as it requires selecting a fd
- https://blog.pjam.me/posts/select-syscall-in-rust/
- crates nix or rustix help

crates

https://github.com/rinja-rs/askama Type-safe, compiled Jinja-like templates
https://crates.io/crates/fetcher Automatic news fetching and parsing
https://crates.io/crates/httptest HTTP testing facilities including a mock server
https://github.com/lipanski/mockito HTTP mocking for Rust! https://zupzup.org/rust-http-testing/
https://crates.io/crates/tempfile
https://crates.io/crates/pretty_assertions
https://crates.io/crates/nonzero
https://crates.io/crates/webpage
https://crates.io/crates/warc
https://crates.io/crates/feedfinder Auto-discovery of feeds in HTML content
https://crates.io/crates/governor - A rate-limiting implementation in Rust
https://crates.io/crates/thiserror
https://crates.io/crates/tracing https://gist.github.com/oliverdaff/d1d5e5bc1baba087b768b89ff82dc3ec
https://crates.io/crates/governor - complex rate limiting algorithm, used in spyglass-search/netrunner
https://crates.io/crates/apalis - background job processing
https://github.com/poem-web/poem - web framework
https://crates.io/crates/metrics-dashboard uses poem and metrics
https://crates.io/crates/metrics_server
https://crates.io/crates/memberlist-core - Gossip protocol for cluster membership
displaydoc derive macro for the standard library’s core::fmt::Display, especially for errors
scopeguard run a given closure when it goes out of scope (like defer in D)

HTML content / article extraction

telegram’s instantview
https://github.com/grangier/python-goose
https://pkg.go.dev/github.com/thatguystone/swan
https://crates.io/crates/extrablatt
https://crates.io/crates/mozilla-readability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README-dev.org

README-dev.org

database schema

crawl job

url frontier

url priority

url to crawl map

queries

crawle archive

crates to use

MIME Types

robots.txt

sitemaps

Crates

Canonical Link Element

URL normalization

remove tracking URL parameters

crates

compiling with openssl on Debian

interesting stuff

protocols in general

postgres

postgres crates

LISTEN/NOTIFY with postgres, diesel

crates

HTML content / article extraction

Files

README-dev.org

Latest commit

History

README-dev.org

File metadata and controls

database schema

crawl job

url frontier

url priority

url to crawl map

queries

crawle archive

crates to use

MIME Types

robots.txt

sitemaps

Crates

Canonical Link Element

URL normalization

remove tracking URL parameters

crates

compiling with openssl on Debian

interesting stuff

protocols in general

postgres

postgres crates

LISTEN/NOTIFY with postgres, diesel

crates

HTML content / article extraction