Skip to content

Commit 74d028e

Browse files
committed
Add mdbook doc
1 parent d43550a commit 74d028e

13 files changed

+747
-12
lines changed

.github/workflows/deploy.yml

+43
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
name: Deploy
2+
on:
3+
push:
4+
branches:
5+
- doc # TODO: change to tag only
6+
7+
jobs:
8+
deploy:
9+
runs-on: ubuntu-latest
10+
permissions:
11+
contents: write # To push a branch
12+
pages: write # To push to a GitHub Pages site
13+
id-token: write # To update the deployment status
14+
steps:
15+
- uses: actions/checkout@v4
16+
with:
17+
fetch-depth: 0
18+
- name: Install latest mdbook
19+
run: |
20+
tag=$(curl 'https://api.github.com/repos/rust-lang/mdbook/releases/latest' | jq -r '.tag_name')
21+
url="https://github.com/rust-lang/mdbook/releases/download/${tag}/mdbook-${tag}-x86_64-unknown-linux-gnu.tar.gz"
22+
mkdir mdbook
23+
curl -sSL $url | tar -xz --directory=./mdbook
24+
echo `pwd`/mdbook >> $GITHUB_PATH
25+
- name: Install latest mdbook-pagetoc
26+
run: |
27+
tag=$(curl 'https://api.github.com/repos/slowsage/mdbook-pagetoc/releases/latest' | jq -r '.tag_name')
28+
url="https://github.com/slowsage/mdbook-pagetoc/releases/download/${tag}/mdbook-pagetoc-${tag}-x86_64-unknown-linux-gnu.tar.gz"
29+
curl -sSL $url | tar -xz --directory=./mdbook
30+
- name: Build Book
31+
run: |
32+
cd doc
33+
mdbook build
34+
- name: Setup Pages
35+
uses: actions/configure-pages@v2
36+
- name: Upload artifact
37+
uses: actions/upload-pages-artifact@v1
38+
with:
39+
# Upload entire repository
40+
path: 'doc/book'
41+
- name: Deploy to GitHub Pages
42+
id: deployment
43+
uses: actions/deploy-pages@v1

README.md

-8
Original file line numberDiff line numberDiff line change
@@ -1,9 +1 @@
11
# Sitemap Web Scraper
2-
3-
## Bash completion
4-
5-
Source the completion script in your `~/.bashrc` file:
6-
7-
```bash
8-
echo 'source <(sws completion)' >> ~/.bashrc
9-
```

doc/.gitignore

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
book
2+
theme/index.hbs
3+
theme/pagetoc.css
4+
theme/pagetoc.js

doc/book.toml

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
[book]
2+
authors = ["Romain Leroux"]
3+
language = "en"
4+
multilingual = false
5+
src = "src"
6+
title = "Sitemap Web Scraper"
7+
8+
# https://crates.io/crates/mdbook-pagetoc
9+
[preprocessor.pagetoc]
10+
[output.html]
11+
additional-css = ["theme/pagetoc.css"]
12+
additional-js = ["theme/pagetoc.js"]

doc/src/README.md

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Introduction
2+
3+
Sitemap Web Scraper, or [sws][], is a tool for simple, flexible, and yet performant web
4+
pages scraping. It consists of a [CLI][] that executes a [Lua][] [JIT][lua-jit] script
5+
and outputs a [CSV][] file.
6+
7+
All the logic for crawling/scraping is defined in Lua and executed on a multiple threads
8+
in [Rust][]. The actual parsing of HTML is done in Rust. Standard [CSS
9+
selectors][css-sel] are also implemented in Rust (using Servo's [html5ever][] and
10+
[selectors][]). Both functionalities are accessible through a Lua API for flexible
11+
scraping logic.
12+
13+
As for the crawling logic, multiple seeding options are available: [robots.txt][robots],
14+
[sitemaps][], or a custom HTML pages list. By default, sitemaps (either provided or
15+
extracted from `robots.txt`) will be crawled recursively and the discovered HTML pages
16+
will be scraped with the provided Lua script. It's also possible to dynamically add page
17+
links to the crawling queue when scraping an HTML page. See the [crawl][sub-crawl]
18+
subcommand and the [Lua scraper][lua-scraper] for more details.
19+
20+
Besides, the Lua scraping script can be used on HTML pages stored as local files,
21+
without any crawling. See the [scrap][sub-scrap] subcommand doc for more details.
22+
23+
Furthermore, the CLI is composed of `crates` that can be used independently in a custom
24+
Rust program.
25+
26+
[sws]: https://github.com/lerouxrgd/sws
27+
[cli]: https://en.wikipedia.org/wiki/Command-line_interface
28+
[rust]: https://www.rust-lang.org/
29+
[lua]: https://www.lua.org/
30+
[lua-jit]: https://luajit.org/
31+
[csv]: https://en.wikipedia.org/wiki/Comma-separated_values
32+
[css-sel]: https://www.w3schools.com/cssref/css_selectors.asp
33+
[html5ever]: https://crates.io/crates/html5ever
34+
[selectors]: https://crates.io/crates/selectors
35+
[robots]: https://en.wikipedia.org/wiki/Robots.txt
36+
[sitemaps]: https://www.sitemaps.org/
37+
[sub-crawl]: ./crawl_overview.html
38+
[sub-scrap]: ./scrap_overview.html
39+
[lua-scraper]: ./lua_scraper.html

doc/src/SUMMARY.md

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Summary
2+
3+
[Introduction](README.md)
4+
5+
[Getting Started](getting_started.md)
6+
7+
- [Subcommand: crawl](./crawl_overview.md)
8+
- [Crawler Configuration](./crawl_config.md)
9+
10+
- [Subcommand: scrap](./scrap_overview.md)
11+
12+
- [Lua Scraper](./lua_scraper.md)
13+
- [Lua API Overview](./lua_api_overview.md)

doc/src/crawl_config.md

+83
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Crawler Config
2+
3+
The crawler configurable parameters are:
4+
5+
| Parameter | Default | Description |
6+
|----------------|--------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
7+
| user_agent | "SWSbot" | The `User-Agent` header that will be used in all HTTP requests |
8+
| page_buffer | 10_000 | The size of the pages download queue. When the queue is full new downloads are on hold. This parameter is particularly relevant when using concurrent throttling. |
9+
| throttle | `Concurrent(100)` if `robot` is `None` <br><br>Otherwise `Delay(N)` where `N` is read from `robots.txt` field `Crawl-delay: N` | A throttling strategy for HTML pages download. <br><br>`Concurrent(N)` means at max `N` downloads at the same time, `PerSecond(N)` means at max `N` downloads per second, `Delay(N)` means wait for `N` seconds betwen downloads |
10+
| num_workers | max(1, num_cpus-2) | The number of CPU cores that will be used for scraping page in parallel using the provided Lua script. |
11+
| on_dl_error | `SkipAndLog` | Behaviour when an error occurs while downloading an HTML page. Other possible value is `Fail`. |
12+
| on_xml_error | `SkipAndLog` | Behaviour when an error occurs while processing a XML sitemap. Other possible value is `Fail`. |
13+
| on_scrap_error | `SkipAndLog` | Behaviour when an error occurs while scraping an HTML page in Lua. Other possible value is `Fail`. |
14+
| robot | `None` | An optional `robots.txt` URL used to retrieve a specific `Throttle::Delay`. <br><br>⚠ Conflicts with `seedRobotsTxt` in [Lua Scraper][lua-scraper], meaning that when `robot` is defined the `seed` cannot be a robot too. |
15+
16+
These parameters can be changed through Lua script or CLI arguments.
17+
18+
The priority order is: `CLI (highest priority) > Lua > Default values`
19+
20+
[lua-scraper]: ./lua_scraper.html#seed-definition
21+
22+
## Lua override
23+
24+
You can override parameters in Lua through the global variable `sws.crawlerConfig`.
25+
26+
| Parameter | Lua name | Example Lua value |
27+
|----------------|--------------|-------------------------------------|
28+
| user_agent | userAgent | "SWSbot" |
29+
| page_buffer | pageBuffer | 10000 |
30+
| throttle | throttle | { Concurrent = 100 } |
31+
| num_workers | numWorkers | 4 |
32+
| on_dl_error | onDlError | "SkipAndLog" |
33+
| on_xml_error | onXmlError | "Fail" |
34+
| on_scrap_error | onScrapError | "SkipAndLog" |
35+
| robot | robot | "https://www.google.com/robots.txt" |
36+
37+
38+
Here is an example of crawler configuration parmeters set using Lua:
39+
40+
```lua
41+
-- You don't have to specify all parameters, only the ones you want to override.
42+
sws.crawlerConfig = {
43+
userAgent = "SWSbot",
44+
pageBuffer = 10000,
45+
throttle = { Concurrent = 100 }, -- or: { PerSecond = 100 }, { Delay = 2 }
46+
numWorkers = 4,
47+
onDlError = "SkipAndLog", -- or: "Fail"
48+
onXmlError = "SkipAndLog",
49+
onScrapError = "SkipAndLog",
50+
robot = nil,
51+
}
52+
```
53+
54+
## CLI override
55+
56+
You can override parameters through the CLI arguments.
57+
58+
| Parameter | CLI argument name | Example CLI argument value |
59+
|----------------------|-------------------|-------------------------------------|
60+
| user_agent | --user-agent | 'SWSbot' |
61+
| page_buffer | --page-buffer | 10000 |
62+
| throttle (Concurent) | --conc-dl | 100 |
63+
| throttle (PerSecond) | --rps | 10 |
64+
| throttle (Delay) | --delay | 2 |
65+
| num_workers | --num-workers | 4 |
66+
| on_dl_error | --on-dl-error | skip-and-log |
67+
| on_xml_error | --on-xml-error | fail |
68+
| on_scrap_error | --on-scrap-error | skip-and-log |
69+
| robot | --robot | 'https://www.google.com/robots.txt' |
70+
71+
Here is an example of crawler configuration parmeters set using CLI arguments:
72+
73+
```sh
74+
sws --script path/to/scrape_logic.lua -o results.csv \
75+
--user-agent 'SWSbot' \
76+
--page-buffer 10000 \
77+
--conc-dl 100 \
78+
--num-workers 4 \
79+
--on-dl-error skip-and-log \
80+
--on-xml-error fail \
81+
--on-scrap-error skip-and-log \
82+
--robot 'https://www.google.com/robots.txt' \
83+
```

doc/src/crawl_overview.md

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Subcommand: crawl
2+
3+
```text
4+
Crawl sitemaps and scrap pages content
5+
6+
Usage: sws crawl [OPTIONS] --script <SCRIPT>
7+
8+
Options:
9+
-s, --script <SCRIPT>
10+
Path to the Lua script that defines scraping logic
11+
-o, --output-file <OUTPUT_FILE>
12+
Optional file that will contain scraped data, stdout otherwise
13+
--append
14+
Append to output file
15+
--truncate
16+
Truncate output file
17+
-q, --quiet
18+
Don't output logs
19+
-h, --help
20+
Print help information
21+
```
22+
23+
More options in [CLI override](./crawl_config.md#cli-override)

doc/src/getting_started.md

+85
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Getting Started
2+
3+
## Get the binary
4+
5+
Download the latest standalone binary for your OS on the [release][] page, and put it in
6+
a location available in your `PATH`.
7+
8+
[release]: https://github.com/lerouxrgd/sws/releases
9+
10+
## Basic example
11+
12+
Let's create a simple `urbandict.lua` scraper for [Urban Dictionary][ud]. Copy paste the
13+
following command:
14+
15+
```sh
16+
cat << 'EOF' > urbandict.lua
17+
sws.seedPages = {
18+
"https://www.urbandictionary.com/define.php?term=Lua"
19+
}
20+
21+
function scrapPage(page, context)
22+
for defIndex, def in page:select("section .definition"):enumerate() do
23+
local word = def:select("h1 a.word"):iter()()
24+
if not word then
25+
word = def:select("h2 a.word"):iter()()
26+
end
27+
if not word then
28+
goto continue
29+
end
30+
word = word:innerHtml()
31+
32+
local contributor = def:select(".contributor"):iter()()
33+
local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?")
34+
date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d")
35+
36+
local meaning = def:select(".meaning"):iter()()
37+
meaning = meaning:innerText():gsub("[\n\r]+", " ")
38+
39+
local example = def:select(".example"):iter()()
40+
example = example:innerText():gsub("[\n\r]+", " ")
41+
42+
if word and date and meaning and example then
43+
local record = sws.Record()
44+
record:pushField(word)
45+
record:pushField(defIndex)
46+
record:pushField(date)
47+
record:pushField(meaning)
48+
record:pushField(example)
49+
context:sendRecord(record)
50+
end
51+
52+
::continue::
53+
end
54+
end
55+
EOF
56+
```
57+
58+
You can then run it with:
59+
60+
```sh
61+
sws crawl --script urbandict.lua
62+
```
63+
64+
As we have defined `sws.seedPages` to be a single page (that is [Urban Dictionary's
65+
Lua][ud-lua] definition), the `scrapPage` function will be run on that single page
66+
only. There are multiple seeding options which are detailed in the [Lua scraper - Seed
67+
definition][lua-scraper] section.
68+
69+
By default the resulting csv file is written to stdout, however the `-o` (or
70+
`--output-file`) lets us specify a proper output file. Note that this file can be also
71+
be appended or truncated, using the additional flags `--append` or `--truncate`
72+
respectively. See the [crawl subcommand][crawl-doc] section for me details.
73+
74+
[ud]: https://www.urbandictionary.com/
75+
[ud-lua]: https://www.urbandictionary.com/define.php?term=Lua
76+
[lua-scraper]: ./lua_scraper.html#seed-definition
77+
[crawl-doc]: ./crawl_overview.html
78+
79+
## Bash completion
80+
81+
You can source the completion script in your `~/.bashrc` file with:
82+
83+
```bash
84+
echo 'source <(sws completion)' >> ~/.bashrc
85+
```

0 commit comments

Comments
 (0)