v4.6 - see CHANGELOG.md for details

xnl-h4ck3r · Nov 23, 2024 · 604ec29 · 604ec29
1 parent 67c26c6
commit 604ec29
Show file tree

Hide file tree

Showing 4 changed files with 225 additions and 69 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,22 @@
 ## Changelog
 
+- v4.6
+
+  - New
+
+    - Add argument `-ft` to specify a list of MIME Types to filter. This will override the `FILTER_MIME` list in `config.yml`. **NOTE: This will NOT be applied to Alien Vault OTX and Virus Total because they don't have the ability to filter on MIME Type. Sometimes URLScan does not have a MIME Type defined - these will always be included. Consider excluding sources if this matters to you.**.
+    - Add argument `-mt` to specify a list of MIME Types to match. This will be used instead of the default filtering using `FILTER_MIME` list in `config.yml`, or filtering using `-ft`. **NOTE: This will NOT be applied to Alien Vault OTX and Virus Total because they don't have the ability to filter on MIME Type. Sometimes URLScan does not have a MIME Type defined - these will always be included. Consider excluding sources if this matters to you.**.
+    - Add argument `--providers` in the same way as `gau`. A comma separated list of source providers that you want to get URLs from. The values can be `wayback`,`commoncrawl`,`otx`,`urlscan` and `virustotal`. Passing this will override any exclude arguments (e.g. `-xwm`,`-xcc`, etc.) passed to exclude sources, and reset those based on what was passed with this argument.
+
+  - Changed
+
+    - When argument `--verbose` has been used and the options are shown, show the name of providers that will be searched instead of the exclude arguments, e.g.`-xwm`, `-xcc`, etc.
+    - Change `HTTP_ADAPTER_CC` used for Common Crawl requests to use `retries+3` instead of `reties+20`. This was originally suggested by Common Crawl, but there are so many issues it can just take forever to get anything from their API, and often fail anyway.
+    - Change the default of `-lcc` to 1 instead of 3 because of so many problems with Common Crawl.
+    - BUG FIX: If a connection error occurs when getting the Common Crawl index file, then error `ERROR getCommonCrawlUrls 1: object of type 'NoneType' has no len()` is displayed. This will now be suppressed.
+    - BUG FIX: If arg `-mc` was not passed and `-ft` was, when options were shown to the user (in `showOptions` function), the value of `-mc` was shown for `-ft`.
+    - BUG FIX: When a MIME type is used in a filter for Wayback Machine that has a `+` in it (e.g. `image/svg+xml`), then the `+` was replaced because that'#s the only way Wayback recognises it. However, it was being escaped first and was being converted to `image/svg\.xml` instead of `image/svg.xml` so was not recognised in the filter.
+
 - v4.5
 
   - Change

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 <center><img src="https://github.com/xnl-h4ck3r/waymore/blob/main/waymore/images/title.png"></center>
 
-## About - v4.5
+## About - v4.6
 
 The idea behind **waymore** is to find even more links from the Wayback Machine than other existing tools.
 
@@ -68,7 +68,9 @@ pipx install git+https://github.com/xnl-h4ck3r/waymore.git
 | -n            | --no-subs                  | Don't include subdomains of the target domain (only used if input is not a domain with a specific path).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
 | -f            | --filter-responses-only    | The initial links from sources will not be filtered, only the responses that are downloaded, e.g. it maybe useful to still see all available paths from the links, even if you don't want to check the content.                                                                                                                                                                                                                                                                                                                                                                                                                                                |
 | -fc           |                            | Filter HTTP status codes for retrieved URLs and responses. Comma separated list of codes (default: the `FILTER_CODE` values from `config.yml`). Passing this argument will override the value from `config.yml`                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| -ft           |                            | Filter MIME Types for retrieved URLs and responses. Comma separated list of MIME Types (default: the `FILTER_MIME` values from `config.yml`). Passing this argument will override the value from `config.yml`. **NOTE: This will NOT be applied to Alien Vault OTX and Virus Total because they don't have the ability to filter on MIME Type. Sometimes URLScan does not have a MIME Type defined - these will always be included. Consider excluding sources if this matters to you.**.                                                                                                                                                                      |
 | -mc           |                            | Only Match HTTP status codes for retrieved URLs and responses. Comma separated list of codes. Passing this argument overrides the config `FILTER_CODE` and `-fc`.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| -mt           |                            | Only MIME Types for retrieved URLs and responses. Comma separated list of MIME types. Passing this argument overrides the config `FILTER_MIME` and `-ft`. **NOTE: This will NOT be applied to Alien Vault OTX and Virus Total because they don't have the ability to filter on MIME Type. Sometimes URLScan does not have a MIME Type defined - these will always be included. Consider excluding sources if this matters to you.**.                                                                                                                                                                                                                           |
 | -l            | --limit                    | How many responses will be saved (if `-mode R` or `-mode B` is passed). A positive value will get the **first N** results, a negative value will get the **last N** results. A value of 0 will get **ALL** responses (default: 5000)                                                                                                                                                                                                                                                                                                                                                                                                                           |
 | -from         | --from-date                | What date to get responses from. If not specified it will get from the earliest possible results. A partial value can be passed, e.g. `2016`, `201805`, etc.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
 | -to           | --to-date                  | What date to get responses to. If not specified it will get to the latest possible results. A partial value can be passed, e.g. `2021`, `202112`, etc.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
@@ -80,7 +82,7 @@ pipx install git+https://github.com/xnl-h4ck3r/waymore.git
 | -xav          |                            | Exclude checks for links from alienvault.com                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
 | -xus          |                            | Exclude checks for links from urlscan.io                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
 | -xvt          |                            | Exclude checks for links from virustotal.com                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
-| -lcc          |                            | Limit the number of Common Crawl index collections searched, e.g. `-lcc 10` will just search the latest `10` collections (default: 3). As of July 2023 there are currently 95 collections. Setting to `0` (default) will search **ALL** collections. If you don't want to search Common Crawl at all, use the `-xcc` option.                                                                                                                                                                                                                                                                                                                                   |
+| -lcc          |                            | Limit the number of Common Crawl index collections searched, e.g. `-lcc 10` will just search the latest `10` collections (default: 1). As of November 2024 there are currently 106 collections. Setting to `0` will search **ALL** collections. If you don't want to search Common Crawl at all, use the `-xcc` option.                                                                                                                                                                                                                                                                                                                                        |
 | -lcy          |                            | Limit the number of Common Crawl index collections searched by the year of the index data. The earliest index has data from 2008. Setting to 0 (default) will search collections or any year (but in conjuction with `-lcc`). For example, if you are only interested in data from 2015 and after, pass `-lcy 2015`. This will override the value of `-lcc` if passed. If you don't want to search Common Crawl at all, use the `-xcc` option.                                                                                                                                                                                                                 |
 | -t            | --timeout                  | This is for archived responses only! How many seconds to wait for the server to send data before giving up (default: 30)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
 | -p            | --processes                | Basic multithreading is done when getting requests for a file of URLs. This argument determines the number of processes (threads) used (default: 1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
@@ -139,16 +141,16 @@ If the input is just a domain, e.g. `redbull.com` then the `-mode` defaults to `
 
 The `config.yml` file (typically in `~/.config/waymore/`) have values that can be updated to suit your needs. Filters are all provided as comma separated lists:
 
-- `FILTER_CODE` - Exclusions used to exclude responses we will try to get from web.archive.org, and also for file names when `-i` is a directory, e.g. `301,302`. This can be overridden with the `-fc` argument. Passing the `-mc` (to match status codes instead of filter) will override any value in `FILTER_CODE` or `-fc`
-- `FILTER_MIME` - MIME Content-Type exclusions used to filter links and responses from web.archive.org through their API, e.g. `'text/css,image/jpeg`
+- `FILTER_CODE` - Exclusions used to exclude responses we will try to get from web.archive.org, and also for file names when `-i` is a directory, e.g. `301,302`. This can be overridden with the `-fc` argument. Passing the `-mc` (to match status codes instead of filter) will override any value in `FILTER_CODE` or `-fc`.
+- `FILTER_MIME` - MIME Content-Type exclusions used to filter links and responses from web.archive.org through their API, e.g. `'text/css,image/jpeg`. This can be overridden with the `-ft` argument. . Passing the `-mt` (to match MIME types instead of filter) will override any value in `FILTER_MIME` or `-ft`.
 - `FILTER_URL` - Response code exclusions we will use to filter links and responses from web.archive.org through their API, e.g. `.css,.jpg`
 - `FILTER_KEYWORDS` - Only links and responses will be returned that contain the specified keywords if the `-ko`/`--keywords-only` argument is passed (without providing an explicit value on the command line), e.g. `admin,portal`
 - `URLSCAN_API_KEY` - You can sign up to [urlscan.io](https://urlscan.io/user/signup) to get a **FREE** API key (there are also paid subscriptions available). It is recommended you get a key and put it into the config file so that you can get more back (and quicker) from their API. NOTE: You will get rate limited unless you have a full paid subscription.
 - `CONTINUE_RESPONSES_IF_PIPED` - If retrieving archive responses doesn't complete, you will be prompted next time whether you want to continue with the previous run. However, if `stdout` is piped to another process it is assumed you don't want to have an interactive prompt. A value of `True` (default) will determine assure the previous run will be continued. if you want a fresh run every time then set to `False`.
 - `WEBHOOK_DISCORD` - If the `--notify-discord` argument is passed, `knoxnl` will send a notification to this Discord wehook when a successful XSS is found.
 - `DEFAULT_OUTPUT_DIR` - This is the default location of any output files written if the `-oU` and `-oR` arguments are not used. If the value of this key is blank, then it will default to the location of the `config.yml` file.
 
-  **NOTE: The MIME types cannot be filtered for Alien Vault results because they do not return that in the API response.**
+  **NOTE: The MIME types cannot be filtered for Alien Vault OTX and Virus Total because they don't have the ability to filter on MIME Type. Sometimes URLScan does not have a MIME Type defined for a URL. In these cases, URLs will be included regardless of filter or match. Bear this in mind and consider excluding certain providers if this is important.**
 
 ## Output
 
@@ -186,6 +188,8 @@ The archive.org Wayback Machine CDX API can sometimes can sometimes require a hu
 
 There is also a problem with the Wayback Machine CDX API where the number of pages returned is not correct when filters are applied and can cause issues (see https://github.com/internetarchive/wayback/issues/243). Until that issue is resolved, setting the `-lr` argument to a sensible value can help with that problem in the short term.
 
+The Common Crawl API has had a lot of issues for a long time. Including this source could make waymore take a lot longer to run and may not yield any extra results. You can check if tere is an issue by visiting http://index.commoncrawl.org/collinfo.json and seeing if this is successful. Consider excluding Common Crawl altogether using the `--providers` argument and not including `commoncrawl`, or using the `-xcc` argument.
+
 **The provider API servers aren't designed to cope with huge volumes, so be sensible and considerate about what you hit them with!**
 
 When downloading archived responses, this can take a long time and can sometimes be killed by the machine for some reason, or manually killed by the user.
@@ -203,7 +207,7 @@ The URLs are saved in the same path as `config.yml` (typically `~/.config/waymor
 
 ### Example 2
 
-Get ALL the URLs from Wayback for `redbull.com` (no filters are applied in `mode U` with `-f`, and no URLs are retrieved from Commone Crawl, Alien Vault, URLScan and Virus Total, because `-xcc`, `-xav`, `-xus`, `-xvt` are passed respectively).
+Get ALL the URLs from Wayback for `redbull.com` (no filters are applied in `mode U` with `-f`, and no URLs are retrieved from Commone Crawl, Alien Vault, URLScan and Virus Total, because `-xcc`, `-xav`, `-xus`, `-xvt` are passed respectively. This can also be achieved by passing `--providers wayback` instead of the exclude arguments).
 Save the FIRST 200 responses that are found starting from 2022 (`-l 200 -from 2022`):
 
 <center><img src="https://github.com/xnl-h4ck3r/waymore/blob/main/waymore/images/example2.png"></center>

diff --git a/waymore/__init__.py b/waymore/__init__.py
@@ -1 +1 @@
-__version__="4.5"
+__version__="4.6"