|
| 1 | +# Crawler Config |
| 2 | + |
| 3 | +The crawler configurable parameters are: |
| 4 | + |
| 5 | +| Parameter | Default | Description | |
| 6 | +|----------------|--------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 7 | +| user_agent | "SWSbot" | The `User-Agent` header that will be used in all HTTP requests | |
| 8 | +| page_buffer | 10_000 | The size of the pages download queue. When the queue is full new downloads are on hold. This parameter is particularly relevant when using concurrent throttling. | |
| 9 | +| throttle | `Concurrent(100)` if `robot` is `None` <br><br>Otherwise `Delay(N)` where `N` is read from `robots.txt` field `Crawl-delay: N` | A throttling strategy for HTML pages download. <br><br>`Concurrent(N)` means at max `N` downloads at the same time, `PerSecond(N)` means at max `N` downloads per second, `Delay(N)` means wait for `N` seconds betwen downloads | |
| 10 | +| num_workers | max(1, num_cpus-2) | The number of CPU cores that will be used for scraping page in parallel using the provided Lua script. | |
| 11 | +| on_dl_error | `SkipAndLog` | Behaviour when an error occurs while downloading an HTML page. Other possible value is `Fail`. | |
| 12 | +| on_xml_error | `SkipAndLog` | Behaviour when an error occurs while processing a XML sitemap. Other possible value is `Fail`. | |
| 13 | +| on_scrap_error | `SkipAndLog` | Behaviour when an error occurs while scraping an HTML page in Lua. Other possible value is `Fail`. | |
| 14 | +| robot | `None` | An optional `robots.txt` URL used to retrieve a specific `Throttle::Delay`. <br><br>⚠ Conflicts with `seedRobotsTxt` in [Lua Scraper][lua-scraper], meaning that when `robot` is defined the `seed` cannot be a robot too. | |
| 15 | + |
| 16 | +These parameters can be changed through Lua script or CLI arguments. |
| 17 | + |
| 18 | +The priority order is: `CLI (highest priority) > Lua > Default values` |
| 19 | + |
| 20 | +[lua-scraper]: ./lua_scraper.html#seed-definition |
| 21 | + |
| 22 | +## Lua override |
| 23 | + |
| 24 | +You can override parameters in Lua through the global variable `sws.crawlerConfig`. |
| 25 | + |
| 26 | +| Parameter | Lua name | Example Lua value | |
| 27 | +|----------------|--------------|-------------------------------------| |
| 28 | +| user_agent | userAgent | "SWSbot" | |
| 29 | +| page_buffer | pageBuffer | 10000 | |
| 30 | +| throttle | throttle | { Concurrent = 100 } | |
| 31 | +| num_workers | numWorkers | 4 | |
| 32 | +| on_dl_error | onDlError | "SkipAndLog" | |
| 33 | +| on_xml_error | onXmlError | "Fail" | |
| 34 | +| on_scrap_error | onScrapError | "SkipAndLog" | |
| 35 | +| robot | robot | "https://www.google.com/robots.txt" | |
| 36 | + |
| 37 | + |
| 38 | +Here is an example of crawler configuration parmeters set using Lua: |
| 39 | + |
| 40 | +```lua |
| 41 | +-- You don't have to specify all parameters, only the ones you want to override. |
| 42 | +sws.crawlerConfig = { |
| 43 | + userAgent = "SWSbot", |
| 44 | + pageBuffer = 10000, |
| 45 | + throttle = { Concurrent = 100 }, -- or: { PerSecond = 100 }, { Delay = 2 } |
| 46 | + numWorkers = 4, |
| 47 | + onDlError = "SkipAndLog", -- or: "Fail" |
| 48 | + onXmlError = "SkipAndLog", |
| 49 | + onScrapError = "SkipAndLog", |
| 50 | + robot = nil, |
| 51 | +} |
| 52 | +``` |
| 53 | + |
| 54 | +## CLI override |
| 55 | + |
| 56 | +You can override parameters through the CLI arguments. |
| 57 | + |
| 58 | +| Parameter | CLI argument name | Example CLI argument value | |
| 59 | +|----------------------|-------------------|-------------------------------------| |
| 60 | +| user_agent | --user-agent | 'SWSbot' | |
| 61 | +| page_buffer | --page-buffer | 10000 | |
| 62 | +| throttle (Concurent) | --conc-dl | 100 | |
| 63 | +| throttle (PerSecond) | --rps | 10 | |
| 64 | +| throttle (Delay) | --delay | 2 | |
| 65 | +| num_workers | --num-workers | 4 | |
| 66 | +| on_dl_error | --on-dl-error | skip-and-log | |
| 67 | +| on_xml_error | --on-xml-error | fail | |
| 68 | +| on_scrap_error | --on-scrap-error | skip-and-log | |
| 69 | +| robot | --robot | 'https://www.google.com/robots.txt' | |
| 70 | + |
| 71 | +Here is an example of crawler configuration parmeters set using CLI arguments: |
| 72 | + |
| 73 | +```sh |
| 74 | +sws --script path/to/scrape_logic.lua -o results.csv \ |
| 75 | + --user-agent 'SWSbot' \ |
| 76 | + --page-buffer 10000 \ |
| 77 | + --conc-dl 100 \ |
| 78 | + --num-workers 4 \ |
| 79 | + --on-dl-error skip-and-log \ |
| 80 | + --on-xml-error fail \ |
| 81 | + --on-scrap-error skip-and-log \ |
| 82 | + --robot 'https://www.google.com/robots.txt' \ |
| 83 | +``` |
0 commit comments