It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.
It's based on the great and simple scraping tool written by Jeroen Janssens.
A CSS selector query like this
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be 'table.wikitable > tbody > tr > td > b > a'
or an XPATH query like this one:
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be '//table[contains(@class, 'wikitable')]/tbody/tr/td/b/a'
gives you back:
<html>
<head>
</head>
<body>
<a href="/wiki/Afghanistan" title="Afghanistan">
Afghanistan
</a>
<a href="/wiki/Albania" title="Albania">
Albania
</a>
<a href="/wiki/Algeria" title="Algeria">
Algeria
</a>
<a href="/wiki/Andorra" title="Andorra">
Andorra
</a>
<a href="/wiki/Angola" title="Angola">
Angola
</a>
<a href="/wiki/Antigua_and_Barbuda" title="Antigua and Barbuda">
Antigua and Barbuda
</a>
<a href="/wiki/Argentina" title="Argentina">
Argentina
</a>
<a href="/wiki/Armenia" title="Armenia">
Armenia
</a>
...
...
</body>
</html>
Some notes on the commands:
-e
to set the query-b
to add<html>
,<head>
and<body>
tags to the HTML output.
# go in example to the home folder
cd ~
# download scrape-cli
wget "https://github.com/aborruso/scrape-cli/releases/download/v1.0/scrape"
# move it in a folder of your PATH as /usr/bin
sudo mv ./scrape /usr/bin
# give it execute permission
sudo chmod +x /usr/bin/scrape
# use it
Please note: in OSX it seems not to work (#8).
The original source is written in Python 2, then I have built it in Python 2 environment.
There are two modules requirements: install in this environment cssselect
and then lxml
, in this order (using pip).
I have built it using pyinstaller and this command: pyinstaller --onefile scrape.py
.
Once you have built it, it's an executable, and it's possible to use it in any environment.