-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
13 changed files
with
749 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
name: Deploy | ||
on: | ||
push: | ||
branches: | ||
- doc # TODO: change to tag only | ||
|
||
jobs: | ||
deploy: | ||
runs-on: ubuntu-latest | ||
permissions: | ||
contents: write # To push a branch | ||
pages: write # To push to a GitHub Pages site | ||
id-token: write # To update the deployment status | ||
steps: | ||
- uses: actions/checkout@v4 | ||
with: | ||
fetch-depth: 0 | ||
- name: Install latest mdbook | ||
run: | | ||
tag=$(curl 'https://api.github.com/repos/rust-lang/mdbook/releases/latest' | jq -r '.tag_name') | ||
url="https://github.com/rust-lang/mdbook/releases/download/${tag}/mdbook-${tag}-x86_64-unknown-linux-gnu.tar.gz" | ||
mkdir mdbook | ||
curl -sSL $url | tar -xz --directory=./mdbook | ||
echo `pwd`/mdbook >> $GITHUB_PATH | ||
- name: Install latest mdbook-pagetoc | ||
run: | | ||
tag=$(curl 'https://api.github.com/repos/slowsage/mdbook-pagetoc/releases/latest' | jq -r '.tag_name') | ||
url="https://github.com/slowsage/mdbook-pagetoc/releases/download/${tag}/mdbook-pagetoc-${tag}-x86_64-unknown-linux-gnu.tar.gz" | ||
curl -sSL $url | tar -xz --directory=./mdbook | ||
- name: Run tests | ||
run: mdbook test | ||
- name: Build Book | ||
run: | | ||
cd doc | ||
mdbook build | ||
- name: Setup Pages | ||
uses: actions/configure-pages@v2 | ||
- name: Upload artifact | ||
uses: actions/upload-pages-artifact@v1 | ||
with: | ||
# Upload entire repository | ||
path: 'doc/book' | ||
- name: Deploy to GitHub Pages | ||
id: deployment | ||
uses: actions/deploy-pages@v1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1 @@ | ||
# Sitemap Web Scraper | ||
|
||
## Bash completion | ||
|
||
Source the completion script in your `~/.bashrc` file: | ||
|
||
```bash | ||
echo 'source <(sws completion)' >> ~/.bashrc | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
book | ||
theme/index.hbs | ||
theme/pagetoc.css | ||
theme/pagetoc.js |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
[book] | ||
authors = ["Romain Leroux"] | ||
language = "en" | ||
multilingual = false | ||
src = "src" | ||
title = "Sitemap Web Scraper" | ||
|
||
# https://crates.io/crates/mdbook-pagetoc | ||
[preprocessor.pagetoc] | ||
[output.html] | ||
additional-css = ["theme/pagetoc.css"] | ||
additional-js = ["theme/pagetoc.js"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Introduction | ||
|
||
Sitemap Web Scraper, or [sws][], is a tool for simple, flexible, and yet performant web | ||
pages scraping. It consists of a [CLI][] that executes a [Lua][] [JIT][lua-jit] script | ||
and outputs a [CSV][] file. | ||
|
||
All the logic for crawling/scraping is defined in Lua and executed on a multiple threads | ||
in [Rust][]. The actual parsing of HTML is done in Rust. Standard [CSS | ||
selectors][css-sel] are also implemented in Rust (using Servo's [html5ever][] and | ||
[selectors][]). Both functionalities are accessible through a Lua API for flexible | ||
scraping logic. | ||
|
||
As for the crawling logic, multiple seeding options are available: [robots.txt][robots], | ||
[sitemaps][], or a custom HTML pages list. By default, sitemaps (either provided or | ||
extracted from `robots.txt`) will be crawled recursively and the discovered HTML pages | ||
will be scraped with the provided Lua script. It's also possible to dynamically add page | ||
links to the crawling queue when scraping an HTML page. See the [crawl][sub-crawl] | ||
subcommand and the [Lua scraper][lua-scraper] for more details. | ||
|
||
Besides, the Lua scraping script can be used on HTML pages stored as local files, | ||
without any crawling. See the [scrap][sub-scrap] subcommand doc for more details. | ||
|
||
Furthermore, the CLI is composed of `crates` that can be used independently in a custom | ||
Rust program. | ||
|
||
[sws]: https://github.com/lerouxrgd/sws | ||
[cli]: https://en.wikipedia.org/wiki/Command-line_interface | ||
[rust]: https://www.rust-lang.org/ | ||
[lua]: https://www.lua.org/ | ||
[lua-jit]: https://luajit.org/ | ||
[csv]: https://en.wikipedia.org/wiki/Comma-separated_values | ||
[css-sel]: https://www.w3schools.com/cssref/css_selectors.asp | ||
[html5ever]: https://crates.io/crates/html5ever | ||
[selectors]: https://crates.io/crates/selectors | ||
[robots]: https://en.wikipedia.org/wiki/Robots.txt | ||
[sitemaps]: https://www.sitemaps.org/ | ||
[sub-crawl]: ./crawl_overview.html | ||
[sub-scrap]: ./scrap_overview.html | ||
[lua-scraper]: ./lua_scraper.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Summary | ||
|
||
[Introduction](README.md) | ||
|
||
[Getting Started](getting_started.md) | ||
|
||
- [Subcommand: crawl](./crawl_overview.md) | ||
- [Crawler Configuration](./crawl_config.md) | ||
|
||
- [Subcommand: scrap](./scrap_overview.md) | ||
|
||
- [Lua Scraper](./lua_scraper.md) | ||
- [Lua API Overview](./lua_api_overview.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# Crawler Config | ||
|
||
The crawler configurable parameters are: | ||
|
||
| Parameter | Default | Description | | ||
|----------------|--------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| user_agent | "SWSbot" | The `User-Agent` header that will be used in all HTTP requests | | ||
| page_buffer | 10_000 | The size of the pages download queue. When the queue is full new downloads are on hold. This parameter is particularly relevant when using concurrent throttling. | | ||
| throttle | `Concurrent(100)` if `robot` is `None` <br><br>Otherwise `Delay(N)` where `N` is read from `robots.txt` field `Crawl-delay: N` | A throttling strategy for HTML pages download. <br><br>`Concurrent(N)` means at max `N` downloads at the same time, `PerSecond(N)` means at max `N` downloads per second, `Delay(N)` means wait for `N` seconds betwen downloads | | ||
| num_workers | max(1, num_cpus-2) | The number of CPU cores that will be used for scraping page in parallel using the provided Lua script. | | ||
| on_dl_error | `SkipAndLog` | Behaviour when an error occurs while downloading an HTML page. Other possible value is `Fail`. | | ||
| on_xml_error | `SkipAndLog` | Behaviour when an error occurs while processing a XML sitemap. Other possible value is `Fail`. | | ||
| on_scrap_error | `SkipAndLog` | Behaviour when an error occurs while scraping an HTML page in Lua. Other possible value is `Fail`. | | ||
| robot | `None` | An optional `robots.txt` URL used to retrieve a specific `Throttle::Delay`. <br><br>⚠ Conflicts with `seedRobotsTxt` in [Lua Scraper][lua-scraper], meaning that when `robot` is defined the `seed` cannot be a robot too. | | ||
|
||
These parameters can be changed through Lua script or CLI arguments. | ||
|
||
The priority order is: `CLI (highest priority) > Lua > Default values` | ||
|
||
[lua-scraper]: ./lua_scraper.html#seed-definition | ||
|
||
## Lua override | ||
|
||
You can override parameters in Lua through the global variable `sws.crawlerConfig`. | ||
|
||
| Parameter | Lua name | Example Lua value | | ||
|----------------|--------------|-------------------------------------| | ||
| user_agent | userAgent | "SWSbot" | | ||
| page_buffer | pageBuffer | 10000 | | ||
| throttle | throttle | { Concurrent = 100 } | | ||
| num_workers | numWorkers | 4 | | ||
| on_dl_error | onDlError | "SkipAndLog" | | ||
| on_xml_error | onXmlError | "Fail" | | ||
| on_scrap_error | onScrapError | "SkipAndLog" | | ||
| robot | robot | "https://www.google.com/robots.txt" | | ||
|
||
|
||
Here is an example of crawler configuration parmeters set using Lua: | ||
|
||
```lua | ||
-- You don't have to specify all parameters, only the ones you want to override. | ||
sws.crawlerConfig = { | ||
userAgent = "SWSbot", | ||
pageBuffer = 10000, | ||
throttle = { Concurrent = 100 }, -- or: { PerSecond = 100 }, { Delay = 2 } | ||
numWorkers = 4, | ||
onDlError = "SkipAndLog", -- or: "Fail" | ||
onXmlError = "SkipAndLog", | ||
onScrapError = "SkipAndLog", | ||
robot = nil, | ||
} | ||
``` | ||
|
||
## CLI override | ||
|
||
You can override parameters through the CLI arguments. | ||
|
||
| Parameter | CLI argument name | Example CLI argument value | | ||
|----------------------|-------------------|-------------------------------------| | ||
| user_agent | --user-agent | 'SWSbot' | | ||
| page_buffer | --page-buffer | 10000 | | ||
| throttle (Concurent) | --conc-dl | 100 | | ||
| throttle (PerSecond) | --rps | 10 | | ||
| throttle (Delay) | --delay | 2 | | ||
| num_workers | --num-workers | 4 | | ||
| on_dl_error | --on-dl-error | skip-and-log | | ||
| on_xml_error | --on-xml-error | fail | | ||
| on_scrap_error | --on-scrap-error | skip-and-log | | ||
| robot | --robot | 'https://www.google.com/robots.txt' | | ||
|
||
Here is an example of crawler configuration parmeters set using CLI arguments: | ||
|
||
```sh | ||
sws --script path/to/scrape_logic.lua -o results.csv \ | ||
--user-agent 'SWSbot' \ | ||
--page-buffer 10000 \ | ||
--conc-dl 100 \ | ||
--num-workers 4 \ | ||
--on-dl-error skip-and-log \ | ||
--on-xml-error fail \ | ||
--on-scrap-error skip-and-log \ | ||
--robot 'https://www.google.com/robots.txt' \ | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Subcommand: crawl | ||
|
||
```text | ||
Crawl sitemaps and scrap pages content | ||
Usage: sws crawl [OPTIONS] --script <SCRIPT> | ||
Options: | ||
-s, --script <SCRIPT> | ||
Path to the Lua script that defines scraping logic | ||
-o, --output-file <OUTPUT_FILE> | ||
Optional file that will contain scraped data, stdout otherwise | ||
--append | ||
Append to output file | ||
--truncate | ||
Truncate output file | ||
-q, --quiet | ||
Don't output logs | ||
-h, --help | ||
Print help information | ||
``` | ||
|
||
More options in [CLI override](./crawl_config.md#cli-override) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
# Getting Started | ||
|
||
## Get the binary | ||
|
||
Download the latest standalone binary for your OS on the [release][] page, and put it in | ||
a location available in your `PATH`. | ||
|
||
[release]: https://github.com/lerouxrgd/sws/releases | ||
|
||
## Basic example | ||
|
||
Let's create a simple `urbandict.lua` scraper for [Urban Dictionary][ud]. Copy paste the | ||
following command: | ||
|
||
```sh | ||
cat << 'EOF' > urbandict.lua | ||
sws.seedPages = { | ||
"https://www.urbandictionary.com/define.php?term=Lua" | ||
} | ||
function scrapPage(page, context) | ||
for defIndex, def in page:select("section .definition"):enumerate() do | ||
local word = def:select("h1 a.word"):iter()() | ||
if not word then | ||
word = def:select("h2 a.word"):iter()() | ||
end | ||
if not word then | ||
goto continue | ||
end | ||
word = word:innerHtml() | ||
local contributor = def:select(".contributor"):iter()() | ||
local date = string.match(contributor:innerHtml(), ".*\\?</a>%s*(.*)\\?") | ||
date = sws.Date(date, "%B %d, %Y"):format("%Y-%m-%d") | ||
local meaning = def:select(".meaning"):iter()() | ||
meaning = meaning:innerText():gsub("[\n\r]+", " ") | ||
local example = def:select(".example"):iter()() | ||
example = example:innerText():gsub("[\n\r]+", " ") | ||
if word and date and meaning and example then | ||
local record = sws.Record() | ||
record:pushField(word) | ||
record:pushField(defIndex) | ||
record:pushField(date) | ||
record:pushField(meaning) | ||
record:pushField(example) | ||
context:sendRecord(record) | ||
end | ||
::continue:: | ||
end | ||
end | ||
EOF | ||
``` | ||
|
||
You can then run it with: | ||
|
||
```sh | ||
sws crawl --script urbandict.lua | ||
``` | ||
|
||
As we have defined `sws.seedPages` to be a single page (that is [Urban Dictionary's | ||
Lua][ud-lua] definition), the `scrapPage` function will be run on that single page | ||
only. There are multiple seeding options which are detailed in the [Lua scraper - Seed | ||
definition][lua-scraper] section. | ||
|
||
By default the resulting csv file is written to stdout, however the `-o` (or | ||
`--output-file`) lets us specify a proper output file. Note that this file can be also | ||
be appended or truncated, using the additional flags `--append` or `--truncate` | ||
respectively. See the [crawl subcommand][crawl-doc] section for me details. | ||
|
||
[ud]: https://www.urbandictionary.com/ | ||
[ud-lua]: https://www.urbandictionary.com/define.php?term=Lua | ||
[lua-scraper]: ./lua_scraper.html#seed-definition | ||
[crawl-doc]: ./crawl_overview.html | ||
|
||
## Bash completion | ||
|
||
You can source the completion script in your `~/.bashrc` file with: | ||
|
||
```bash | ||
echo 'source <(sws completion)' >> ~/.bashrc | ||
``` |
Oops, something went wrong.