TODO: https://doc.rust-lang.org/rustdoc/
- get job with lease
- update lease ( can be done by work manager by observing url frontier of crawl job?
- domain crossing hops instead of external depth
- depth in parallel to priority to have absolute crawl frontier
- signals:
- sitemap.xml itself and linked sitemaps have constant prio 1
- priority of find location
- number of urls on the find location
- number of query params
- number of path elements
- outlink context
- properties
- priority from sitemaps is between 0 < p <= 1
- query params are worse than path elements
- remaining depth goes down
- stealing of work possible if crawl A already crawls an URL external to crawl B? Or rather work injection, if a crawl for a domain exists?
- insert initial URL(s)
- Insert URLs found
- only if URLs does not exist yet
- get next uncrawled url
- order by priority
- round robin over URLs
- set url to crawled
- get archived pages for URL
- get all archived URLs for domain below path
- https://crates.io/crates/scraper - HTML parsing and querying with CSS selectors
- https://crates.io/crates/anyhow - Flexible concrete Error type built on std::error::Error
- https://github.com/utkarshkukreti/select.rs - extract data from HTML
- https://crates.io/crates/rouille - mini HTTP server for status and control pages
- There’s a standard! MIME Sniffing via Content Sniffing (WP)
- https://crates.io/crates/mime_classifier Implementation of the std from Servo
- How Mozilla determines MIME Types
Input:
- Mime type declaration pointing to URL:
- <link type=”text/css” rel=”stylesheet” href=”…”
- <a href=”…” type=”foo/bar”>
- <img -> img sniffer
- Mime type from HTTP header, overriding 1. ? file extension?
- Sniffing, if no HTTP header
RFC 9309, Google, Yandex docs, robotstxt.org
Google based robotstxt does not(?) provide an object to hold a parsed robots.txt, is from a time before RFC 9309 and seems to be very “unrusty” (much mutable global state). After all it’s a simple transliteration from C++.
- texting_robots
- forked by Spire-rs’ kit/exclusion
- robotparser-rs
- forked by spider
- robots_txt - unstabke, WIP, +4y
- https://developers.google.com/search/docs/crawling-indexing/sitemaps
- https://sitemaps.org
- https://en.m.wikipedia.org/wiki/Sitemaps
- https://crates.io/crates/sitemap xml-rs, old but 8 dependents
- https://crates.io/crates/sitemap-iter roxmltree (2022-02)
- https://crates.io/crates/sitemaps quick-xml (2024-06), experimental learning project
- https://crates.io/crates/wls - check for ideas
- https://crates.io/crates/sitemapo quick-xml (2023-07), dead repo
- search topics
- “crawling strategy”
- https://frontera.readthedocs.io
- Crawling strategies
- https://stackoverflow.com/questions/10331738/strategy-for-how-to-crawl-index-frequently-updated-webpages
- “re-crawl strategy” or “page refresh policy”
- https://frontera.readthedocs.io
- https://ssrg.eecs.uottawa.ca/docs/Benjamin-Thesis.pdf Strategy for Efficient Crawling of Rich Internet Applications
- “focused crawling”
- “crawling strategy”
- https://developers.google.com/search/docs/crawling-indexing
- https://crates.io/crates/urlnorm
- https://en.wikipedia.org/wiki/URI_normalization
- “Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text)”
- https://github.com/brave/brave-browser/wiki/Query-String-Filter
- https://gitlab.com/ClearURLs/ClearUrls
- https://gitlab.com/ClearURLs/rules -> data.min.json -> “globalRules”
- query_map - generic wrapper around HashMap<String, Vec<String>> to handle different transformations like URL query strings
- clearurl - implementation for ClearURL
- clearurls - rm tracking params
- qstring - query string parser
- shucker - Tracking-param filtering library, designed to strip URLs down to their canonical forms
- urlnorm - url normalization
- url-cleaner - rm tracking garbage
sudo apt install libc6-dev libssl-dev sudo ln -s /usr/include/x86_64-linux-gnu/openssl/opensslconf.h /usr/include/openssl/opensslconf.h sudo ln -s /usr/include/x86_64-linux-gnu/openssl/configuration.h /usr/include/openssl/configuration.h
- GOGGLES: Democracy dies in darkness, and so does the Web paper by Brave Search Team, via Spyglass
- https://github.com/spyglass-search
- https://github.com/iipc - International Internet Preservation Consortium
- https://sans-io.readthedocs.io/how-to-sans-io.html
- Niri WM wurde nach sansi-io Prinzipien programmiert (handgeschriebene englische Untertitel des Autors)
- https://github.com/dhamaniasad/awesome-postgres
- https://www.postgresguide.com
- https://github.com/elierotenberg/coding-styles/blob/master/postgres.md
- https://github.com/sfackler/rust-postgres
- rust wire protocol but uses tokio even in synchronous client
- probably problems due to async? sfackler/rust-postgres#725
- postgres-protocol, postgres-types do not depend on tokio
- https://crates.io/crates/pgwire
- recomends rust-postgres from sfackler for clients, focuses on servers
- depends on tokio
- diesel
- uses pq_sys C wrapper for libpg
- not pub
- no support for notifications
- previous request for LISTEN diesel-rs/diesel#2166
- https://docs.diesel.rs/2.2.x/src/diesel/pg/connection/raw.rs.html
- issues
- Removing libpq (to enable async)
- Async I/O
- Postgres: We should avoid sending one query per custom type bind enum!
- PostgreSQL Large Objects - would require access to internals?
- testing diesel-rs/diesel#1549
- diesel-rs/diesel#4420
- waiting for notifications is more involved as it requires selecting a fd
- https://blog.pjam.me/posts/select-syscall-in-rust/
- crates nix or rustix help
- https://github.com/rinja-rs/askama Type-safe, compiled Jinja-like templates
- https://crates.io/crates/fetcher Automatic news fetching and parsing
- https://crates.io/crates/httptest HTTP testing facilities including a mock server
- https://github.com/lipanski/mockito HTTP mocking for Rust! https://zupzup.org/rust-http-testing/
- https://crates.io/crates/tempfile
- https://crates.io/crates/pretty_assertions
- https://crates.io/crates/nonzero
- https://crates.io/crates/webpage
- https://crates.io/crates/warc
- https://crates.io/crates/feedfinder Auto-discovery of feeds in HTML content
- https://crates.io/crates/governor - A rate-limiting implementation in Rust
- https://crates.io/crates/thiserror
- https://crates.io/crates/tracing https://gist.github.com/oliverdaff/d1d5e5bc1baba087b768b89ff82dc3ec
- https://crates.io/crates/governor - complex rate limiting algorithm, used in spyglass-search/netrunner
- https://crates.io/crates/apalis - background job processing
- https://github.com/poem-web/poem - web framework
- https://crates.io/crates/metrics-dashboard uses poem and metrics
- https://crates.io/crates/metrics_server
- https://crates.io/crates/memberlist-core - Gossip protocol for cluster membership
- displaydoc derive macro for the standard library’s core::fmt::Display, especially for errors
- scopeguard run a given closure when it goes out of scope (like defer in D)