Skip to content

Commit

Permalink
v4.3 - see CHANGELOG.md for details
Browse files Browse the repository at this point in the history
  • Loading branch information
xnl-h4ck3r committed May 8, 2024
1 parent 7f70e0d commit 57dab04
Show file tree
Hide file tree
Showing 4 changed files with 50 additions and 30 deletions.
10 changes: 9 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,18 @@
## Changelog

- v4.3

- Changed

- Wayback Machine seemed to have made some changes to their CDX API without any notice or documentation. This caused problems getting URLs for `-mode U` because the API pagination no longer worked. If a number of pages cannot be returned, then all links will be retrieved in one request. However, if they "fix" the problem and pagination starts working again, it will revert to previous code that will get results a page at a time.
- Although the bug fix for [Github Issue #45](https://github.com/xnl-h4ck3r/waymore/issues/45) appeared to be working fine since the last version, the "changes" made by Wayback machine seemed to have broken that too. The code had to be refactored to work (i.e. don't include the `collapse` parameter at all if `none`), but also no longer works with multiple fields.
- When `-co` is used, there is no way to tell how long the results will take from Wayback machine now because all the data is retrieved in one request. While pagination is broken, this will just return `Unknown` but will revert back to previous functionality if pagination is fixed.

- v4.2

- Changed

- BUG FIX: [Github Issue #45](https://github.com/xnl-h4ck3r/waymore/issues/45) - When getting archived responses from wayback machine, by default it is supposed to get one capture per day per URL (thi interval can be changed with `-ci`). But, it was only getting one response per day, not for all the different URLs per day. Thanks to @zakaria_ounissi for raising this.
- BUG FIX: [Github Issue #45](https://github.com/xnl-h4ck3r/waymore/issues/45) - When getting archived responses from wayback machine, by default it is supposed to get one capture per day per URL (this interval can be changed with `-ci`). But, it was only getting one response per day, not for all the different URLs per day. Thanks to @zakaria_ounissi for raising this.
- BUG FIX: [Github Issue #46](https://github.com/xnl-h4ck3r/waymore/issues/46) - The config `FILTER_URL` list was being applied to links found from all sources, except wayback machine. So if the MIME type wasn't correct, it was possible that links that matched `FILTER_URL` were still included in the output. Thanks to @brutexploiter for raising this.

- v4.1
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<center><img src="https://github.com/xnl-h4ck3r/waymore/blob/main/waymore/images/title.png"></center>

## About - v4.2
## About - v4.3

The idea behind **waymore** is to find even more links from the Wayback Machine than other existing tools.

Expand Down
2 changes: 1 addition & 1 deletion waymore/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__="4.2"
__version__="4.3"
66 changes: 39 additions & 27 deletions waymore/waymore.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ class StopProgram(enum.Enum):
responseOutputDirectory = ''

# Source Provider URLs
WAYBACK_URL = 'https://web.archive.org/cdx/search/cdx?url={DOMAIN}&collapse={COLLAPSE}&fl=timestamp,original,mimetype,statuscode,digest'
WAYBACK_URL = 'https://web.archive.org/cdx/search/cdx?url={DOMAIN}{COLLAPSE}&fl=timestamp,original,mimetype,statuscode,digest'
CCRAWL_INDEX_URL = 'https://index.commoncrawl.org/collinfo.json'
ALIENVAULT_URL = 'https://otx.alienvault.com/api/v1/indicators/{TYPE}/{DOMAIN}/url_list?limit=500'
URLSCAN_URL = 'https://urlscan.io/api/v1/search/?q=domain:{DOMAIN}&size=10000'
Expand Down Expand Up @@ -1851,11 +1851,15 @@ def getWaybackUrls():
session.mount('https://', HTTP_ADAPTER)
session.mount('http://', HTTP_ADAPTER)
resp = session.get(url+'&showNumPages=True', headers={"User-Agent":userAgent})
totalPages = int(resp.text.strip())

# If the argument to limit the requests was passed and the total pages is larger than that, set to the limit
if args.limit_requests != 0 and totalPages > args.limit_requests:
totalPages = args.limit_requests
# Try to get the total number of pages. If there is a problem, we'll return totalPages = 0 which means we'll get everything back in one request
try:
totalPages = int(resp.text.strip())

# If the argument to limit the requests was passed and the total pages is larger than that, set to the limit
if args.limit_requests != 0 and totalPages > args.limit_requests:
totalPages = args.limit_requests
except:
totalPages = -1
except Exception as e:
try:
# If the rate limit was reached end now
Expand All @@ -1880,31 +1884,39 @@ def getWaybackUrls():
else:
writerr(colored(getSPACER('[ ERR ] Unable to get links from Wayback Machine (archive.org): ' + str(e)), 'red'))
return

if args.check_only:
checkWayback = totalPages
write(colored('Get URLs from Wayback Machine: ','cyan')+colored(str(checkWayback)+' requests','white'))
if totalPages < 0:
write(colored('Due to a change in Wayback Machine API, all URLs will be retrieved in one request and it is not possible to determine how long it will take, so please ignore this.','cyan'))
else:
checkWayback = totalPages
write(colored('Get URLs from Wayback Machine: ','cyan')+colored(str(checkWayback)+' requests','white'))
else:
if verbose():
write(colored('The archive URL requested to get links: ','magenta')+colored(url+'\n','white'))

# if the page number was found then display it, but otherwise we will just try to increment until we have everything
write(colored('\rGetting links from ' + str(totalPages) + ' Wayback Machine (archive.org) API requests (this can take a while for some domains)...\r','cyan'))
if totalPages < 0:
write(colored('\rGetting links from Wayback Machine (archive.org) with one request (this can take a while for some domains)...\r','cyan'))

# Get a list of all the page URLs we need to visit
pages = []
if totalPages == 1:
pages.append(url)
processWayBackPage(url)
else:
for page in range(0, totalPages):
pages.append(url+str(page))

# Process the URLs from web archive
if stopProgram is None:
p = mp.Pool(args.processes)
p.map(processWayBackPage, pages)
p.close()
p.join()
# if the page number was found then display it, but otherwise we will just try to increment until we have everything
write(colored('\rGetting links from ' + str(totalPages) + ' Wayback Machine (archive.org) API requests (this can take a while for some domains)...\r','cyan'))

# Get a list of all the page URLs we need to visit
pages = []
if totalPages == 1:
pages.append(url)
else:
for page in range(0, totalPages):
pages.append(url+str(page))

# Process the URLs from web archive
if stopProgram is None:
p = mp.Pool(args.processes)
p.map(processWayBackPage, pages)
p.close()
p.join()

# Show the MIME types found (in case user wants to exclude more)
if verbose() and len(linkMimes) > 0 :
Expand Down Expand Up @@ -2431,11 +2443,11 @@ def processResponses():
if args.capture_interval == 'none': # get all
collapse = ''
elif args.capture_interval == 'h': # get at most 1 capture per URL per hour
collapse = 'timestamp:10,original'
collapse = '&collapse=timestamp:10'
elif args.capture_interval == 'd': # get at most 1 capture per URL per day
collapse = 'timestamp:8,original'
collapse = '&collapse=timestamp:8'
elif args.capture_interval == 'm': # get at most 1 capture per URL per month
collapse = 'timestamp:6,original'
collapse = '&collapse=timestamp:6'

url = WAYBACK_URL.replace('{DOMAIN}',subs + quote(argsInput) + path).replace('{COLLAPSE}',collapse) + filterMIME + filterCode + filterLimit + filterFrom + filterTo + filterKeywords

Expand Down

0 comments on commit 57dab04

Please sign in to comment.