Python Web Scraping

This list contains python libraries related to web scraping and data processing

Python Web Scraping

Network

General
- urllib - network library (stdlib)
- requests - network library
- grab - network library (pycurl based)
- pycurl - network library (binding to libcurl)
- urllib3 - Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
- httplib2 - network library
- RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
- MechanicalSoup - A Python library for automating interaction with websites.
- mechanize - Stateful programmatic web browsing.
- socket low-level networking interface (stdlib)
- Unirest for Python - Unirest is a set of lightweight HTTP libraries available in multiple languages
- hyper - HTTP/2 Client for Python
- PySocks - Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.
Asynchronous
- treq - requests like API (twisted based)
- aiohttp - http client/server for asyncio (PEP-3156)

Web-Scraping Frameworks

Full Featured Crawlers
- grab - web-scraping framework (pycurl/multicurl based)
- scrapy - web-scraping framework (twisted based).
- pyspider - A powerful spider system.
- cola - A distributed crawling framework.
Other
- portia - Visual scraping for Scrapy.
- restkit - HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
- demiurge - PyQuery-based scraping micro-framework.

HTML/XML Parsing

General
- lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
- cssselect - working with DOM tree with CSS selectors
- pyquery - working with DOM tree with jQuery-like selectors
- BeautifulSoup - slow HTML/XMl processing library, written in pure python
- html5lib - builds DOM of HTML/XML document according to WHATWG spec. That spec is used in all modern browsers.
- feedparser - parsing of RSS/ATOM feeds.
- MarkupSafe - Implements a XML/HTML/XHTML Markup safe string for Python.
- xmltodict - Working with XML feel like you are working with JSON.
- xhtml2pdf - HTML/CSS to PDF converter.
- untangle - Converts XML documents to Python objects for easy access.
- hodor - Configuration driven wrapper around lxml and cssselect.
Sanitizing
- Bleach - cleaning of HTML (requires html5lib)
- sanitize - Bringing sanity to world of messed-up data.

Text Processing

Libraries for parsing and manipulating plain texts.

General
- difflib - (Python standard library) Helpers for computing deltas.
- Levenshtein - Fast computation of Levenshtein distance and string similarity.
- fuzzywuzzy - Fuzzy String Matching.
- esmre - Regular expression accelerator.
- ftfy - Makes Unicode text less broken and more consistent automagically.
Transliteration
- unidecode - ASCII transliterations of Unicode text.
Character encoding
- uniout - Print readable chars instead of the escaped string.
- chardet - Python 2/3 compatible character encoding detector.
- xpinyin - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
- pangu.py - Spacing texts for CJK and alphanumerics.
Slugify
- awesome-slugify - A Python slugify library that can preserve unicode.
- python-slugify - A Python slugify library that translates unicode to ASCII.
- unicode-slugify - A slugifier that generates unicode slugs.
- pytils - Simple tools for processing strings in russian (including pytils.translit.slugify)
General Parser
- PLY - Implementation of lex and yacc parsing tools for Python
- pyparsing - A general purpose framework for generating parsers.
Human names
- python-nameparser - Parsing human names into their individual components.
Phone Number
- phonenumbers - Parsing, formatting, storing and validating international phone numbers.
User-agent string
- python-user-agents - Browser user agent parser.
- HTTP Agent Parser - Python HTTP Agent Parser
- fake-useragent - Python user agent string faker, based on world statistic of browsers
- user_agent - Generator of User-Agent data

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

General
- tablib - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
- textract - Extract text from any document, Word, PowerPoint, PDFs, etc.
- messytables - Tools for parsing messy tabular data
- rows - A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT -- more coming!)
Office
- python-docx - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
- xlwt / xlrd - Writing and reading data and formatting information from Excel files.
- XlsxWriter - A Python module for creating Excel .xlsx files.
- xlwings - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
- openpyxl - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
- Marmir - Takes Python data structures and turns them into spreadsheets.
PDF
- PDFMiner - A tool for extracting information from PDF documents.
- PyPDF2 - A library capable of splitting, merging and transforming PDF pages.
- ReportLab - Allowing Rapid creation of rich PDF documents.
- pdftables - Extract tables from PDF files directly
Markdown
- Python-Markdown - A Python implementation of John Gruber’s Markdown.
- Mistune - Fastest and full featured pure Python parsers of Markdown.
- markdown2 - A fast and complete Python implementation of Markdown
YAML
- PyYAML - YAML implementations for Python.
CSS
- cssutils - A CSS library for Python.
ATOM/RSS
- feedparser - Universal feed parser.
SQL
- sqlparse - A non-validating SQL parser.
HTTP
- http-parser - HTTP request/response parser for python in C
Microformats
- opengraph - A Python module to parse the Open Graph Protocol tags
Portable Executable
pefile - A multi-platform module to parse and work with Portable Executable (aka PE) files.
PSD
- psd-tools - reading Adobe Photoshop PSD files (as described in specification) to Python data structures.

Natural Language Processing

Libraries for working with human languages.

NLTK - A leading platform for building Python programs to work with human language data.
Pattern - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
TextBlob - Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
jieba - Chinese Words Segmentation Utilities.
SnowNLP - A library for processing Chinese text.
loso - Another Chinese segmentation library.
genius - A Chinese segment base on Conditional Random Field.
langid.py - Stand-alone language identification system.
Korean - A library for Korean morphology.
pymorphy2 - Morphological analyzer (POS tagger + inflection engine) for Russian language.
PyPLN - A distributed pipeline for natural language processing, made in Python. he goal of the project is to create an easy way to use NLTK for processing big corpora, with a Web interface.
langdetect - Port of Google's language-detection library to Python

Browser automation and emulation

Browsers
- selenium - automating real browsers (Chrome, Firefox, Opera, IE)
- Ghost.py - wrapper of QtWebKit (requires PyQT)
- Spynner - wrapper of QtWebKit QtWebKit (requires PyQT)
- Splinter - univeral API to browser emulators (selenium webdrivers, django client, zope)
Headless tools
- xvfbwrapper - Python wrapper for running a display inside X virtual framebuffer (Xvfb)

Multiprocessing

threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
multiprocessing - standard python library to run processes.
celery - An asynchronous task queue/job queue based on distributed message passing.
concurrent-futures - The concurrent.futures module provides a high-level interface for asynchronously executing callables.

Asynchronous

Libraries for asynchronous networking programming.

asyncio - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
Twisted - An event-driven networking engine.
Tornado - A Web framework and asynchronous networking library.
pulsar - Event-driven concurrent framework for Python.
diesel - Greenlet-based event I/O Framework for Python.
gevent - A coroutine-based Python networking library that uses greenlet.
eventlet - Asynchronous framework with WSGI support.
Tomorrow - Magic decorator syntax for asynchronous code.

Queue

celery - An asynchronous task queue/job queue based on distributed message passing.
huey - Little multi-threaded task queue.
mrq - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
RQ - lightweight task queue manager based on redis
simpleq - A simple, infinitely scalable, Amazon SQS based queue.
python-gearman - python API for Gearman

Cloud Computing

picloud - executing python-code in cloud
dominoup.com - executing R, Python и matlab code in cloud

Email

Libraries for parsing email.

flanker - A email address and Mime parsing library.
Talon - Mailgun library to extract message quotations and signatures.

URL and Network Address Manipulation

Libraries for parsing/modifying URLs and network addresses.

URL
- furl - A small Python library that makes manipulating URLs simple.
- purl - A simple, immutable URL class with a clean API for interrogation and manipulation.
- urllib.parse - interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” (stdlib)
- tldextract - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.
Network Address
- netaddr - A Python library for representing and manipulating network addresses.

Web Content Extracting

Libraries for extracting web contents.

Text and Meta Data from HTML pages
- newspaper - News extraction, article extraction and content curation in Python.
- html2text - Convert HTML to Markdown-formatted text.
- python-goose - HTML Content/Article Extractor.
- lassie - Web Content Retrieval for Humans.
- micawber - A small library for extracting rich content from URLs.
- sumy - A module for automatic summarization of text documents and HTML pages.
- Haul - An Extensible Image Crawler.
- python-readability - Fast Python port of arc90's readability tool.
- scrapely - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
- libextract - Extract data from websites.
Video
- youtube-dl - A small command-line program to download videos from YouTube.
- you-get - A YouTube/Youku/Niconico video downloader written in Python 3.
Wiki
- WikiTeam - Tools for downloading and preserving wikis.

WebSocket

Libraries for working with WebSocket.

Crossbar - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
AutobahnPython - WebSocket & WAMP for Python on Twisted and asyncio.
WebSocket-for-Python - WebSocket client and server library for Python 2 and 3 as well as PyPy.

DNS Resolving

dnsyo - Check your DNS against over 1500 global DNS servers.
pycares - interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously

Computer Vision

OpenCV - Open Source Computer Vision Library.
SimpleCV - Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
mahotas - fast computer vision algorithms (all implemented in C++) operating over numpy arrays.

Proxy Server

shadowsocks - A fast tunnel proxy that helps you bypass firewalls (TCP & UDP support, User management API, TCP Fast Open, Workers and graceful restart, Destination IP blacklist)
tproxy - tproxy is a simple TCP routing proxy (layer 7) built on Gevent that lets you configure the routine logic in Python

Misc

user_agent - this module is for generating random, valid web navigator's configs & User-Agent HTTP headers.

Other python lists

awesome-python
pycrumbs
python-github-projects
python_reference
pythonidae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python.md

python.md

Python Web Scraping

Network

Web-Scraping Frameworks

HTML/XML Parsing

Text Processing

Specific Formats Processing

Natural Language Processing

Browser automation and emulation

Multiprocessing

Asynchronous

Queue

Cloud Computing

Email

URL and Network Address Manipulation

Web Content Extracting

WebSocket

DNS Resolving

Computer Vision

Proxy Server

Misc

Other python lists

Files

python.md

Latest commit

History

python.md

File metadata and controls

Python Web Scraping

Network

Web-Scraping Frameworks

HTML/XML Parsing

Text Processing

Specific Formats Processing

Natural Language Processing

Browser automation and emulation

Multiprocessing

Asynchronous

Queue

Cloud Computing

Email

URL and Network Address Manipulation

Web Content Extracting

WebSocket

DNS Resolving

Computer Vision

Proxy Server

Misc

Other python lists