Add mdbook doc

lerouxrgd · Dec 16, 2023 · 7f30833 · 7f30833
1 parent d43550a
commit 7f30833
Show file tree

Hide file tree

Showing 13 changed files with 748 additions and 12 deletions.
diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml
@@ -0,0 +1,44 @@
+name: Deploy
+on:
+  push:
+    branches:
+      - doc # TODO: change to tag only
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    permissions:
+      contents: write # To push a branch
+      pages: write    # To push to a GitHub Pages site
+      id-token: write # To update the deployment status
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+      - name: Install latest mdbook
+        run: |
+          tag=$(curl 'https://api.github.com/repos/rust-lang/mdbook/releases/latest' | jq -r '.tag_name')
+          url="https://github.com/rust-lang/mdbook/releases/download/${tag}/mdbook-${tag}-x86_64-unknown-linux-gnu.tar.gz"
+          mkdir mdbook
+          curl -sSL $url | tar -xz --directory=./mdbook
+          echo `pwd`/mdbook >> $GITHUB_PATH
+      - name: Install latest mdbook-toc
+        run: |
+          tag=$(curl 'https://api.github.com/repos/badboy/mdbook-toc/releases/latest' | jq -r '.tag_name')
+          url="https://github.com/badboy/mdbook-toc/releases/download/${tag}/mdbook-toc-${tag}-x86_64-unknown-linux-gnu.tar.gz"
+          echo $url
+          curl -sSL $url | tar -xz --directory=./mdbook
+      - name: Build Book
+        run: |
+          cd doc
+          mdbook build
+      - name: Setup Pages
+        uses: actions/configure-pages@v2
+      - name: Upload artifact
+        uses: actions/upload-pages-artifact@v1
+        with:
+          # Upload entire repository
+          path: 'book'
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v1
diff --git a/README.md b/README.md
@@ -1,9 +1 @@
 # Sitemap Web Scraper
-
-## Bash completion
-
-Source the completion script in your `~/.bashrc` file:
-
-```bash
-echo 'source <(sws completion)' >> ~/.bashrc
-```
diff --git a/doc/.gitignore b/doc/.gitignore
@@ -0,0 +1,4 @@
+book
+theme/index.hbs
+theme/pagetoc.css
+theme/pagetoc.js
diff --git a/doc/book.toml b/doc/book.toml
@@ -0,0 +1,12 @@
+[book]
+authors = ["Romain Leroux"]
+language = "en"
+multilingual = false
+src = "src"
+title = "Sitemap Web Scraper"
+
+# https://crates.io/crates/mdbook-pagetoc
+[preprocessor.pagetoc]
+[output.html]
+additional-css = ["theme/pagetoc.css"]
+additional-js  = ["theme/pagetoc.js"]
diff --git a/doc/src/README.md b/doc/src/README.md
@@ -0,0 +1,39 @@
+# Introduction
+
+Sitemap Web Scraper, or [sws][], is a tool for simple, flexible, and yet performant web
+pages scraping. It consists of a [CLI][] that executes a [Lua][] [JIT][lua-jit] script
+and outputs a [CSV][] file.
+
+All the logic for crawling/scraping is defined in Lua and executed on a multiple threads
+in [Rust][]. The actual parsing of HTML is done in Rust. Standard [CSS
+selectors][css-sel] are also implemented in Rust (using Servo's [html5ever][] and
+[selectors][]). Both functionalities are accessible through a Lua API for flexible
+scraping logic.
+
+As for the crawling logic, multiple seeding options are available: [robots.txt][robots],
+[sitemaps][], or a custom HTML pages list. By default, sitemaps (either provided or
+extracted from `robots.txt`) will be crawled recursively and the discovered HTML pages
+will be scraped with the provided Lua script. It's also possible to dynamically add page
+links to the crawling queue when scraping an HTML page. See the [crawl][sub-crawl]
+subcommand and the [Lua scraper][lua-scraper] for more details.
+
+Besides, the Lua scraping script can be used on HTML pages stored as local files,
+without any crawling. See the [scrap][sub-scrap] subcommand doc for more details.
+
+Furthermore, the CLI is composed of `crates` that can be used independently in a custom
+Rust program.
+
+[sws]: https://github.com/lerouxrgd/sws
+[cli]: https://en.wikipedia.org/wiki/Command-line_interface
+[rust]: https://www.rust-lang.org/
+[lua]: https://www.lua.org/
+[lua-jit]: https://luajit.org/
+[csv]: https://en.wikipedia.org/wiki/Comma-separated_values
+[css-sel]: https://www.w3schools.com/cssref/css_selectors.asp
+[html5ever]: https://crates.io/crates/html5ever
+[selectors]: https://crates.io/crates/selectors
+[robots]: https://en.wikipedia.org/wiki/Robots.txt
+[sitemaps]: https://www.sitemaps.org/
+[sub-crawl]: ./crawl_overview.html
+[sub-scrap]: ./scrap_overview.html
+[lua-scraper]: ./lua_scraper.html
diff --git a/doc/src/SUMMARY.md b/doc/src/SUMMARY.md
@@ -0,0 +1,13 @@
+# Summary
+
+[Introduction](README.md)
+
+[Getting Started](getting_started.md)
+
+- [Subcommand: crawl](./crawl_overview.md)
+  - [Crawler Configuration](./crawl_config.md)
+
+- [Subcommand: scrap](./scrap_overview.md)
+
+- [Lua Scraper](./lua_scraper.md)
+  - [Lua API Overview](./lua_api_overview.md)
diff --git a/doc/src/crawl_config.md b/doc/src/crawl_config.md
@@ -0,0 +1,83 @@
+# Crawler Config
+
+The crawler configurable parameters are:
+
+| Parameter      | Default                                                                                                                        | Description                                                                                                                                                                                                                      |
+|----------------|--------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| user_agent     | "SWSbot"                                                                                                                       | The `User-Agent` header that will be used in all HTTP requests                                                                                                                                                                   |
+| page_buffer    | 10_000                                                                                                                         | The size of the pages download queue. When the queue is full new downloads are on hold. This parameter is particularly relevant when using concurrent throttling.                                                                |
+| throttle       | `Concurrent(100)` if `robot` is `None` <br><br>Otherwise `Delay(N)` where `N` is read from `robots.txt` field `Crawl-delay: N` | A throttling strategy for HTML pages download. <br><br>`Concurrent(N)` means at max `N` downloads at the same time, `PerSecond(N)` means at max `N` downloads per second, `Delay(N)` means wait for `N` seconds betwen downloads |
+| num_workers    | max(1, num_cpus-2)                                                                                                             | The number of CPU cores that will be used for scraping page in parallel using the provided Lua script.                                                                                                                           |
+| on_dl_error    | `SkipAndLog`                                                                                                                   | Behaviour when an error occurs while downloading an HTML page. Other possible value is `Fail`.                                                                                                                                   |
+| on_xml_error   | `SkipAndLog`                                                                                                                   | Behaviour when an error occurs while processing a XML sitemap. Other possible value is `Fail`.                                                                                                                                   |
+| on_scrap_error | `SkipAndLog`                                                                                                                   | Behaviour when an error occurs while scraping an HTML page in Lua. Other possible value is `Fail`.                                                                                                                               |
+| robot          | `None`                                                                                                                         | An optional `robots.txt` URL used to retrieve a specific `Throttle::Delay`. <br><br>⚠ Conflicts with `seedRobotsTxt` in [Lua Scraper][lua-scraper], meaning that when `robot` is defined the `seed` cannot be a robot too. |
+
+These parameters can be changed through Lua script or CLI arguments.
+
+The priority order is: `CLI (highest priority) > Lua > Default values`
+
+[lua-scraper]: ./lua_scraper.html#seed-definition
+
+## Lua override
+
+You can override parameters in Lua through the global variable `sws.crawlerConfig`.
+
+| Parameter      | Lua name     | Example Lua value                   |
+|----------------|--------------|-------------------------------------|
+| user_agent     | userAgent    | "SWSbot"                            |
+| page_buffer    | pageBuffer   | 10000                               |
+| throttle       | throttle     | { Concurrent = 100 }                |
+| num_workers    | numWorkers   | 4                                   |
+| on_dl_error    | onDlError    | "SkipAndLog"                        |
+| on_xml_error   | onXmlError   | "Fail"                              |
+| on_scrap_error | onScrapError | "SkipAndLog"                        |
+| robot          | robot        | "https://www.google.com/robots.txt" |
+
+
+Here is an example of crawler configuration parmeters set using Lua:
+
+```lua
+-- You don't have to specify all parameters, only the ones you want to override.
+sws.crawlerConfig = {
+  userAgent = "SWSbot",
+  pageBuffer = 10000,
+  throttle = { Concurrent = 100 }, -- or: { PerSecond = 100 }, { Delay = 2 }
+  numWorkers = 4,
+  onDlError = "SkipAndLog", -- or: "Fail"
+  onXmlError = "SkipAndLog",
+  onScrapError = "SkipAndLog",
+  robot = nil,
+}
+```
+
+## CLI override
+
+You can override parameters through the CLI arguments.
+
+| Parameter            | CLI argument name | Example CLI argument value          |
+|----------------------|-------------------|-------------------------------------|
+| user_agent           | --user-agent      | 'SWSbot'                            |
+| page_buffer          | --page-buffer     | 10000                               |
+| throttle (Concurent) | --conc-dl         | 100                                 |
+| throttle (PerSecond) | --rps             | 10                                  |
+| throttle (Delay)     | --delay           | 2                                   |
+| num_workers          | --num-workers     | 4                                   |
+| on_dl_error          | --on-dl-error     | skip-and-log                        |
+| on_xml_error         | --on-xml-error    | fail                                |
+| on_scrap_error       | --on-scrap-error  | skip-and-log                        |
+| robot                | --robot           | 'https://www.google.com/robots.txt' |
+
+Here is an example of crawler configuration parmeters set using CLI arguments:
+
+```sh
+sws --script path/to/scrape_logic.lua -o results.csv     \
+    --user-agent     'SWSbot'                            \
+    --page-buffer    10000                               \
+    --conc-dl        100                                 \
+    --num-workers    4                                   \
+    --on-dl-error    skip-and-log                        \
+    --on-xml-error   fail                                \
+    --on-scrap-error skip-and-log                        \
+    --robot          'https://www.google.com/robots.txt' \
+```
diff --git a/doc/src/crawl_overview.md b/doc/src/crawl_overview.md
@@ -0,0 +1,23 @@
+# Subcommand: crawl
+
+```text
+Crawl sitemaps and scrap pages content
+
+Usage: sws crawl [OPTIONS] --script <SCRIPT>
+
+Options:
+  -s, --script <SCRIPT>
+          Path to the Lua script that defines scraping logic
+  -o, --output-file <OUTPUT_FILE>
+          Optional file that will contain scraped data, stdout otherwise
+      --append
+          Append to output file
+      --truncate
+          Truncate output file
+  -q, --quiet
+          Don't output logs
+  -h, --help
+          Print help information
+```
+
+More options in [CLI override](./crawl_config.md#cli-override)
diff --git a/doc/src/getting_started.md b/doc/src/getting_started.md
@@ -0,0 +1,85 @@
+# Getting Started
+
+## Get the binary
+
+Download the latest standalone binary for your OS on the [release][] page, and put it in
+a location available in your `PATH`.
+
+[release]: https://github.com/lerouxrgd/sws/releases
+
+## Basic example
+
+Let's create a simple `urbandict.lua` scraper for [Urban Dictionary][ud]. Copy paste the
+following command:
+
+```sh
+cat << 'EOF' > urbandict.lua
+sws.seedPages = {
+   "https://www.urbandictionary.com/define.php?term=Lua"
+}
+
+function scrapPage(page, context)
+   for defIndex, def in page:select("section .definition"):enumerate() do
+      local word = def:select("h1 a.word"):iter()()
+      if not word then
+         word = def:select("h2 a.word"):iter()()
+      end
+      if not word then
+         goto continue
+      end
+      word = word:innerHtml()
+
+      local contributor = def:select(".contributor"):iter()()
+      local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
+      date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")
+
+      local meaning = def:select(".meaning"):iter()()
+      meaning = meaning:innerText():gsub("[\n\r]+", " ")
+
+      local example = def:select(".example"):iter()()
+      example = example:innerText():gsub("[\n\r]+", " ")
+
+      if word and date and meaning and example then
+         local record = sws.Record()
+         record:pushField(word)
+         record:pushField(defIndex)
+         record:pushField(date)
+         record:pushField(meaning)
+         record:pushField(example)
+         context:sendRecord(record)
+      end
+
+      ::continue::
+   end
+end
+EOF
+```
+
+You can then run it with:
+
+```sh
+sws crawl --script urbandict.lua
+```
+
+As we have defined `sws.seedPages` to be a single page (that is [Urban Dictionary's
+Lua][ud-lua] definition), the `scrapPage` function will be run on that single page
+only. There are multiple seeding options which are detailed in the [Lua scraper - Seed
+definition][lua-scraper] section.
+
+By default the resulting csv file is written to stdout, however the `-o` (or
+`--output-file`) lets us specify a proper output file. Note that this file can be also
+be appended or truncated, using the additional flags `--append` or `--truncate`
+respectively. See the [crawl subcommand][crawl-doc] section for me details.
+
+[ud]: https://www.urbandictionary.com/
+[ud-lua]: https://www.urbandictionary.com/define.php?term=Lua
+[lua-scraper]: ./lua_scraper.html#seed-definition
+[crawl-doc]: ./crawl_overview.html
+
+## Bash completion
+
+You can source the completion script in your `~/.bashrc` file with:
+
+```bash
+echo 'source <(sws completion)' >> ~/.bashrc
+```