Skip to content

Commit

Permalink
Add mdbook doc
Browse files Browse the repository at this point in the history
  • Loading branch information
lerouxrgd committed Dec 16, 2023
1 parent d43550a commit 7f30833
Show file tree
Hide file tree
Showing 13 changed files with 748 additions and 12 deletions.
44 changes: 44 additions & 0 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Deploy
on:
push:
branches:
- doc # TODO: change to tag only

jobs:
deploy:
runs-on: ubuntu-latest
permissions:
contents: write # To push a branch
pages: write # To push to a GitHub Pages site
id-token: write # To update the deployment status
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Install latest mdbook
run: |
tag=$(curl 'https://api.github.com/repos/rust-lang/mdbook/releases/latest' | jq -r '.tag_name')
url="https://github.com/rust-lang/mdbook/releases/download/${tag}/mdbook-${tag}-x86_64-unknown-linux-gnu.tar.gz"
mkdir mdbook
curl -sSL $url | tar -xz --directory=./mdbook
echo `pwd`/mdbook >> $GITHUB_PATH
- name: Install latest mdbook-toc
run: |
tag=$(curl 'https://api.github.com/repos/badboy/mdbook-toc/releases/latest' | jq -r '.tag_name')
url="https://github.com/badboy/mdbook-toc/releases/download/${tag}/mdbook-toc-${tag}-x86_64-unknown-linux-gnu.tar.gz"
echo $url
curl -sSL $url | tar -xz --directory=./mdbook
- name: Build Book
run: |
cd doc
mdbook build
- name: Setup Pages
uses: actions/configure-pages@v2
- name: Upload artifact
uses: actions/upload-pages-artifact@v1
with:
# Upload entire repository
path: 'book'
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v1
8 changes: 0 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1 @@
# Sitemap Web Scraper

## Bash completion

Source the completion script in your `~/.bashrc` file:

```bash
echo 'source <(sws completion)' >> ~/.bashrc
```
4 changes: 4 additions & 0 deletions doc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
book
theme/index.hbs
theme/pagetoc.css
theme/pagetoc.js
12 changes: 12 additions & 0 deletions doc/book.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[book]
authors = ["Romain Leroux"]
language = "en"
multilingual = false
src = "src"
title = "Sitemap Web Scraper"

# https://crates.io/crates/mdbook-pagetoc
[preprocessor.pagetoc]
[output.html]
additional-css = ["theme/pagetoc.css"]
additional-js = ["theme/pagetoc.js"]
39 changes: 39 additions & 0 deletions doc/src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Introduction

Sitemap Web Scraper, or [sws][], is a tool for simple, flexible, and yet performant web
pages scraping. It consists of a [CLI][] that executes a [Lua][] [JIT][lua-jit] script
and outputs a [CSV][] file.

All the logic for crawling/scraping is defined in Lua and executed on a multiple threads
in [Rust][]. The actual parsing of HTML is done in Rust. Standard [CSS
selectors][css-sel] are also implemented in Rust (using Servo's [html5ever][] and
[selectors][]). Both functionalities are accessible through a Lua API for flexible
scraping logic.

As for the crawling logic, multiple seeding options are available: [robots.txt][robots],
[sitemaps][], or a custom HTML pages list. By default, sitemaps (either provided or
extracted from `robots.txt`) will be crawled recursively and the discovered HTML pages
will be scraped with the provided Lua script. It's also possible to dynamically add page
links to the crawling queue when scraping an HTML page. See the [crawl][sub-crawl]
subcommand and the [Lua scraper][lua-scraper] for more details.

Besides, the Lua scraping script can be used on HTML pages stored as local files,
without any crawling. See the [scrap][sub-scrap] subcommand doc for more details.

Furthermore, the CLI is composed of `crates` that can be used independently in a custom
Rust program.

[sws]: https://github.com/lerouxrgd/sws
[cli]: https://en.wikipedia.org/wiki/Command-line_interface
[rust]: https://www.rust-lang.org/
[lua]: https://www.lua.org/
[lua-jit]: https://luajit.org/
[csv]: https://en.wikipedia.org/wiki/Comma-separated_values
[css-sel]: https://www.w3schools.com/cssref/css_selectors.asp
[html5ever]: https://crates.io/crates/html5ever
[selectors]: https://crates.io/crates/selectors
[robots]: https://en.wikipedia.org/wiki/Robots.txt
[sitemaps]: https://www.sitemaps.org/
[sub-crawl]: ./crawl_overview.html
[sub-scrap]: ./scrap_overview.html
[lua-scraper]: ./lua_scraper.html
13 changes: 13 additions & 0 deletions doc/src/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Summary

[Introduction](README.md)

[Getting Started](getting_started.md)

- [Subcommand: crawl](./crawl_overview.md)
- [Crawler Configuration](./crawl_config.md)

- [Subcommand: scrap](./scrap_overview.md)

- [Lua Scraper](./lua_scraper.md)
- [Lua API Overview](./lua_api_overview.md)
83 changes: 83 additions & 0 deletions doc/src/crawl_config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Crawler Config

The crawler configurable parameters are:

| Parameter | Default | Description |
|----------------|--------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| user_agent | "SWSbot" | The `User-Agent` header that will be used in all HTTP requests |
| page_buffer | 10_000 | The size of the pages download queue. When the queue is full new downloads are on hold. This parameter is particularly relevant when using concurrent throttling. |
| throttle | `Concurrent(100)` if `robot` is `None` <br><br>Otherwise `Delay(N)` where `N` is read from `robots.txt` field `Crawl-delay: N` | A throttling strategy for HTML pages download. <br><br>`Concurrent(N)` means at max `N` downloads at the same time, `PerSecond(N)` means at max `N` downloads per second, `Delay(N)` means wait for `N` seconds betwen downloads |
| num_workers | max(1, num_cpus-2) | The number of CPU cores that will be used for scraping page in parallel using the provided Lua script. |
| on_dl_error | `SkipAndLog` | Behaviour when an error occurs while downloading an HTML page. Other possible value is `Fail`. |
| on_xml_error | `SkipAndLog` | Behaviour when an error occurs while processing a XML sitemap. Other possible value is `Fail`. |
| on_scrap_error | `SkipAndLog` | Behaviour when an error occurs while scraping an HTML page in Lua. Other possible value is `Fail`. |
| robot | `None` | An optional `robots.txt` URL used to retrieve a specific `Throttle::Delay`. <br><br>⚠ Conflicts with `seedRobotsTxt` in [Lua Scraper][lua-scraper], meaning that when `robot` is defined the `seed` cannot be a robot too. |

These parameters can be changed through Lua script or CLI arguments.

The priority order is: `CLI (highest priority) > Lua > Default values`

[lua-scraper]: ./lua_scraper.html#seed-definition

## Lua override

You can override parameters in Lua through the global variable `sws.crawlerConfig`.

| Parameter | Lua name | Example Lua value |
|----------------|--------------|-------------------------------------|
| user_agent | userAgent | "SWSbot" |
| page_buffer | pageBuffer | 10000 |
| throttle | throttle | { Concurrent = 100 } |
| num_workers | numWorkers | 4 |
| on_dl_error | onDlError | "SkipAndLog" |
| on_xml_error | onXmlError | "Fail" |
| on_scrap_error | onScrapError | "SkipAndLog" |
| robot | robot | "https://www.google.com/robots.txt" |


Here is an example of crawler configuration parmeters set using Lua:

```lua
-- You don't have to specify all parameters, only the ones you want to override.
sws.crawlerConfig = {
userAgent = "SWSbot",
pageBuffer = 10000,
throttle = { Concurrent = 100 }, -- or: { PerSecond = 100 }, { Delay = 2 }
numWorkers = 4,
onDlError = "SkipAndLog", -- or: "Fail"
onXmlError = "SkipAndLog",
onScrapError = "SkipAndLog",
robot = nil,
}
```

## CLI override

You can override parameters through the CLI arguments.

| Parameter | CLI argument name | Example CLI argument value |
|----------------------|-------------------|-------------------------------------|
| user_agent | --user-agent | 'SWSbot' |
| page_buffer | --page-buffer | 10000 |
| throttle (Concurent) | --conc-dl | 100 |
| throttle (PerSecond) | --rps | 10 |
| throttle (Delay) | --delay | 2 |
| num_workers | --num-workers | 4 |
| on_dl_error | --on-dl-error | skip-and-log |
| on_xml_error | --on-xml-error | fail |
| on_scrap_error | --on-scrap-error | skip-and-log |
| robot | --robot | 'https://www.google.com/robots.txt' |

Here is an example of crawler configuration parmeters set using CLI arguments:

```sh
sws --script path/to/scrape_logic.lua -o results.csv \
--user-agent 'SWSbot' \
--page-buffer 10000 \
--conc-dl 100 \
--num-workers 4 \
--on-dl-error skip-and-log \
--on-xml-error fail \
--on-scrap-error skip-and-log \
--robot 'https://www.google.com/robots.txt' \
```
23 changes: 23 additions & 0 deletions doc/src/crawl_overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Subcommand: crawl

```text
Crawl sitemaps and scrap pages content
Usage: sws crawl [OPTIONS] --script <SCRIPT>
Options:
-s, --script <SCRIPT>
Path to the Lua script that defines scraping logic
-o, --output-file <OUTPUT_FILE>
Optional file that will contain scraped data, stdout otherwise
--append
Append to output file
--truncate
Truncate output file
-q, --quiet
Don't output logs
-h, --help
Print help information
```

More options in [CLI override](./crawl_config.md#cli-override)
85 changes: 85 additions & 0 deletions doc/src/getting_started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Getting Started

## Get the binary

Download the latest standalone binary for your OS on the [release][] page, and put it in
a location available in your `PATH`.

[release]: https://github.com/lerouxrgd/sws/releases

## Basic example

Let's create a simple `urbandict.lua` scraper for [Urban Dictionary][ud]. Copy paste the
following command:

```sh
cat << 'EOF' > urbandict.lua
sws.seedPages = {
"https://www.urbandictionary.com/define.php?term=Lua"
}
function scrapPage(page, context)
for defIndex, def in page:select("section .definition"):enumerate() do
local word = def:select("h1 a.word"):iter()()
if not word then
word = def:select("h2 a.word"):iter()()
end
if not word then
goto continue
end
word = word:innerHtml()
local contributor = def:select(".contributor"):iter()()
local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")
local meaning = def:select(".meaning"):iter()()
meaning = meaning:innerText():gsub("[\n\r]+", " ")
local example = def:select(".example"):iter()()
example = example:innerText():gsub("[\n\r]+", " ")
if word and date and meaning and example then
local record = sws.Record()
record:pushField(word)
record:pushField(defIndex)
record:pushField(date)
record:pushField(meaning)
record:pushField(example)
context:sendRecord(record)
end
::continue::
end
end
EOF
```

You can then run it with:

```sh
sws crawl --script urbandict.lua
```

As we have defined `sws.seedPages` to be a single page (that is [Urban Dictionary's
Lua][ud-lua] definition), the `scrapPage` function will be run on that single page
only. There are multiple seeding options which are detailed in the [Lua scraper - Seed
definition][lua-scraper] section.

By default the resulting csv file is written to stdout, however the `-o` (or
`--output-file`) lets us specify a proper output file. Note that this file can be also
be appended or truncated, using the additional flags `--append` or `--truncate`
respectively. See the [crawl subcommand][crawl-doc] section for me details.

[ud]: https://www.urbandictionary.com/
[ud-lua]: https://www.urbandictionary.com/define.php?term=Lua
[lua-scraper]: ./lua_scraper.html#seed-definition
[crawl-doc]: ./crawl_overview.html

## Bash completion

You can source the completion script in your `~/.bashrc` file with:

```bash
echo 'source <(sws completion)' >> ~/.bashrc
```
Loading

0 comments on commit 7f30833

Please sign in to comment.