diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml new file mode 100644 index 0000000..9c54df5 --- /dev/null +++ b/.github/workflows/deploy.yml @@ -0,0 +1,52 @@ +name: Deploy +on: + push: + branches: + - doc # TODO: change to tag only + +jobs: + deploy: + runs-on: ubuntu-latest + permissions: + contents: write # To push a branch + pages: write # To push to a GitHub Pages site + id-token: write # To update the deployment status + steps: + - uses: actions/checkout@v4 + with: + fetch-depth: 0 + - name: Install latest mdbook + run: | + tag=$(curl 'https://api.github.com/repos/rust-lang/mdbook/releases/latest' | jq -r '.tag_name') + url="https://github.com/rust-lang/mdbook/releases/download/${tag}/mdbook-${tag}-x86_64-unknown-linux-gnu.tar.gz" + mkdir mdbook + curl -sSL $url | tar -xz --directory=./mdbook + echo `pwd`/mdbook >> $GITHUB_PATH + # - name: Install latest mdbook-pagetoc + # run: | + # tag=$(curl 'https://api.github.com/repos/slowsage/mdbook-pagetoc/releases/latest' | jq -r '.tag_name') + # url="https://github.com/slowsage/mdbook-pagetoc/releases/download/${tag}/mdbook-pagetoc-${tag}-x86_64-unknown-linux-gnu.tar.gz" + # curl -sSL $url | tar -xz --directory=./mdbook + - name: Install latest mdbook-pagetoc + uses: baptiste0928/cargo-install@v2 + with: + crate: mdbook-pagetoc + locked: false + - name: Run tests + run: | + cd doc + mdbook test + - name: Build Book + run: | + cd doc + mdbook build + - name: Setup Pages + uses: actions/configure-pages@v2 + - name: Upload artifact + uses: actions/upload-pages-artifact@v1 + with: + # Upload entire repository + path: 'doc/book' + - name: Deploy to GitHub Pages + id: deployment + uses: actions/deploy-pages@v1 \ No newline at end of file diff --git a/README.md b/README.md index 3a0d204..9ff6444 100644 --- a/README.md +++ b/README.md @@ -1,9 +1 @@ # Sitemap Web Scraper - -## Bash completion - -Source the completion script in your `~/.bashrc` file: - -```bash -echo 'source <(sws completion)' >> ~/.bashrc -``` diff --git a/doc/.gitignore b/doc/.gitignore new file mode 100644 index 0000000..927206b --- /dev/null +++ b/doc/.gitignore @@ -0,0 +1,4 @@ +book +theme/index.hbs +theme/pagetoc.css +theme/pagetoc.js \ No newline at end of file diff --git a/doc/book.toml b/doc/book.toml new file mode 100644 index 0000000..d155b15 --- /dev/null +++ b/doc/book.toml @@ -0,0 +1,12 @@ +[book] +authors = ["Romain Leroux"] +language = "en" +multilingual = false +src = "src" +title = "Sitemap Web Scraper" + +# https://crates.io/crates/mdbook-pagetoc +[preprocessor.pagetoc] +[output.html] +additional-css = ["theme/pagetoc.css"] +additional-js = ["theme/pagetoc.js"] \ No newline at end of file diff --git a/doc/src/README.md b/doc/src/README.md new file mode 100644 index 0000000..c4aaf31 --- /dev/null +++ b/doc/src/README.md @@ -0,0 +1,39 @@ +# Introduction + +Sitemap Web Scraper, or [sws][], is a tool for simple, flexible, and yet performant web +pages scraping. It consists of a [CLI][] that executes a [Lua][] [JIT][lua-jit] script +and outputs a [CSV][] file. + +All the logic for crawling/scraping is defined in Lua and executed on a multiple threads +in [Rust][]. The actual parsing of HTML is done in Rust. Standard [CSS +selectors][css-sel] are also implemented in Rust (using Servo's [html5ever][] and +[selectors][]). Both functionalities are accessible through a Lua API for flexible +scraping logic. + +As for the crawling logic, multiple seeding options are available: [robots.txt][robots], +[sitemaps][], or a custom HTML pages list. By default, sitemaps (either provided or +extracted from `robots.txt`) will be crawled recursively and the discovered HTML pages +will be scraped with the provided Lua script. It's also possible to dynamically add page +links to the crawling queue when scraping an HTML page. See the [crawl][sub-crawl] +subcommand and the [Lua scraper][lua-scraper] for more details. + +Besides, the Lua scraping script can be used on HTML pages stored as local files, +without any crawling. See the [scrap][sub-scrap] subcommand doc for more details. + +Furthermore, the CLI is composed of `crates` that can be used independently in a custom +Rust program. + +[sws]: https://github.com/lerouxrgd/sws +[cli]: https://en.wikipedia.org/wiki/Command-line_interface +[rust]: https://www.rust-lang.org/ +[lua]: https://www.lua.org/ +[lua-jit]: https://luajit.org/ +[csv]: https://en.wikipedia.org/wiki/Comma-separated_values +[css-sel]: https://www.w3schools.com/cssref/css_selectors.asp +[html5ever]: https://crates.io/crates/html5ever +[selectors]: https://crates.io/crates/selectors +[robots]: https://en.wikipedia.org/wiki/Robots.txt +[sitemaps]: https://www.sitemaps.org/ +[sub-crawl]: ./crawl_overview.html +[sub-scrap]: ./scrap_overview.html +[lua-scraper]: ./lua_scraper.html diff --git a/doc/src/SUMMARY.md b/doc/src/SUMMARY.md new file mode 100644 index 0000000..3176c5e --- /dev/null +++ b/doc/src/SUMMARY.md @@ -0,0 +1,13 @@ +# Summary + +[Introduction](README.md) + +[Getting Started](getting_started.md) + +- [Subcommand: crawl](./crawl_overview.md) + - [Crawler Configuration](./crawl_config.md) + +- [Subcommand: scrap](./scrap_overview.md) + +- [Lua Scraper](./lua_scraper.md) + - [Lua API Overview](./lua_api_overview.md) diff --git a/doc/src/crawl_config.md b/doc/src/crawl_config.md new file mode 100644 index 0000000..395b7f1 --- /dev/null +++ b/doc/src/crawl_config.md @@ -0,0 +1,83 @@ +# Crawler Config + +The crawler configurable parameters are: + +| Parameter | Default | Description | +|----------------|--------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| user_agent | "SWSbot" | The `User-Agent` header that will be used in all HTTP requests | +| page_buffer | 10_000 | The size of the pages download queue. When the queue is full new downloads are on hold. This parameter is particularly relevant when using concurrent throttling. | +| throttle | `Concurrent(100)` if `robot` is `None`

Otherwise `Delay(N)` where `N` is read from `robots.txt` field `Crawl-delay: N` | A throttling strategy for HTML pages download.

`Concurrent(N)` means at max `N` downloads at the same time, `PerSecond(N)` means at max `N` downloads per second, `Delay(N)` means wait for `N` seconds betwen downloads | +| num_workers | max(1, num_cpus-2) | The number of CPU cores that will be used for scraping page in parallel using the provided Lua script. | +| on_dl_error | `SkipAndLog` | Behaviour when an error occurs while downloading an HTML page. Other possible value is `Fail`. | +| on_xml_error | `SkipAndLog` | Behaviour when an error occurs while processing a XML sitemap. Other possible value is `Fail`. | +| on_scrap_error | `SkipAndLog` | Behaviour when an error occurs while scraping an HTML page in Lua. Other possible value is `Fail`. | +| robot | `None` | An optional `robots.txt` URL used to retrieve a specific `Throttle::Delay`.

⚠ Conflicts with `seedRobotsTxt` in [Lua Scraper][lua-scraper], meaning that when `robot` is defined the `seed` cannot be a robot too. | + +These parameters can be changed through Lua script or CLI arguments. + +The priority order is: `CLI (highest priority) > Lua > Default values` + +[lua-scraper]: ./lua_scraper.html#seed-definition + +## Lua override + +You can override parameters in Lua through the global variable `sws.crawlerConfig`. + +| Parameter | Lua name | Example Lua value | +|----------------|--------------|-------------------------------------| +| user_agent | userAgent | "SWSbot" | +| page_buffer | pageBuffer | 10000 | +| throttle | throttle | { Concurrent = 100 } | +| num_workers | numWorkers | 4 | +| on_dl_error | onDlError | "SkipAndLog" | +| on_xml_error | onXmlError | "Fail" | +| on_scrap_error | onScrapError | "SkipAndLog" | +| robot | robot | "https://www.google.com/robots.txt" | + + +Here is an example of crawler configuration parmeters set using Lua: + +```lua +-- You don't have to specify all parameters, only the ones you want to override. +sws.crawlerConfig = { + userAgent = "SWSbot", + pageBuffer = 10000, + throttle = { Concurrent = 100 }, -- or: { PerSecond = 100 }, { Delay = 2 } + numWorkers = 4, + onDlError = "SkipAndLog", -- or: "Fail" + onXmlError = "SkipAndLog", + onScrapError = "SkipAndLog", + robot = nil, +} +``` + +## CLI override + +You can override parameters through the CLI arguments. + +| Parameter | CLI argument name | Example CLI argument value | +|----------------------|-------------------|-------------------------------------| +| user_agent | --user-agent | 'SWSbot' | +| page_buffer | --page-buffer | 10000 | +| throttle (Concurent) | --conc-dl | 100 | +| throttle (PerSecond) | --rps | 10 | +| throttle (Delay) | --delay | 2 | +| num_workers | --num-workers | 4 | +| on_dl_error | --on-dl-error | skip-and-log | +| on_xml_error | --on-xml-error | fail | +| on_scrap_error | --on-scrap-error | skip-and-log | +| robot | --robot | 'https://www.google.com/robots.txt' | + +Here is an example of crawler configuration parmeters set using CLI arguments: + +```sh +sws --script path/to/scrape_logic.lua -o results.csv \ + --user-agent 'SWSbot' \ + --page-buffer 10000 \ + --conc-dl 100 \ + --num-workers 4 \ + --on-dl-error skip-and-log \ + --on-xml-error fail \ + --on-scrap-error skip-and-log \ + --robot 'https://www.google.com/robots.txt' \ +``` diff --git a/doc/src/crawl_overview.md b/doc/src/crawl_overview.md new file mode 100644 index 0000000..109b56f --- /dev/null +++ b/doc/src/crawl_overview.md @@ -0,0 +1,23 @@ +# Subcommand: crawl + +```text +Crawl sitemaps and scrap pages content + +Usage: sws crawl [OPTIONS] --script