Skip to content

lgloege/fast-link-extractor

Repository files navigation

fast-link-extractor

Python package to quickly extract links in HTML

documentation license code style contributions


Documentation: https://fast-link-extractor.readthedocs.io

Source Code: https://github.com/lgloege/fast-link-extractor


A Python 3.7+ package to extract links from a webpage. Asyncronous functions allows the code to run fast when extracting from many sub-directories. A use case for this tool is to extract download links for use with wget or fsspec.

Installation

Install using PyPi

pip install fast-link-extractor

Insatll using GitHub

pip install git+https://github.com/lgloege/fast-link-extractor.git

Example

Simply import the package and call link_extractor(). This will output of list of extracted links

import fast_link_extractor as fle

# url to extract links from
base_url = "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/"

# extract all links from sub directories ending with .nc
# this may take ~10 seconds, there are a lot of sub-directories
links = fle.link_extractor(base_url,
                           search_subs=True,
                           regex='.nc$')

If using inside Jupyter or IPython, set ipython=True

import fast_link_extractor as fle

# url to extract links from
base_url = "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/"

# extract all links from sub directories ending with .nc
# this may take ~10 seconds, there are a lot of sub-directories
links = fle.link_extractor(base_url,
                           search_subs=True,
                           ipython=True,
                           regex='.nc$')

License

This project is licensed under the terms of the MIT license.

About

Quickly extract links from HTML

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages