Install amcat-scraping and amcatclient directly from github using pip:
pip install git+git://github.com/amcat/amcatclient git+git://github.com/amcat/amcat-scraping
(Note, you should probably either work in a python virtual environment, or use sudo to install system-wide)
The scrapers can be run using the amcatscraping.scrape
module: (see below for configuration and options)
python -m amcatscraping.scrape --help
AmCAT 3.5 included the transition from Python 2 to Python 3. Additionally, some fundamental changes in the way it approaches articles have been made. This is an overview of the scrapers currently fixed:
- newspapers.ad
- newspapers.fd
- newspapers.nrc
- newspapers.nrchandelsblad
- newspapers.nrcnext
- newspapers.pcm
- newspapers.telegraaf
- newspapers.trouw
- newspapers.volkskrant
- blogs.geenstijl
- news.nu
Configuration is stored in ~/.scrapers.conf
.
[store]
# Project and articleset defined per scraper
host: amcat.nl
port: 443
username: amcat
password: amcat
ssl: true
[mail]
host: mail.hmbastiaan.nl
port: 587
from: [email protected]
to: [email protected]
username: martijn
password: xxxxxxx
tls: true
[*]
# Section with defaults for all scrapers
articleset: 37
project: 1
[AD]
username: xxxxxxxxxxxxxxx
password: xxxxxxxxxxxxxxx
class: newspapers.ad.AlgemeenDagbladScraper
Defaults can be found [here](https://github.com/amcat/amcat-scraping/blob/master/amcatscraping/maintenance/default.conf here)
Specific options:
Defines where articles are saved, after scraping.
host
hostname or IP-address of AmCAT instance [default: amcat.nl]
port
port to connect to [default: 80]
username / password
credentials to use when logging in
ssl
use SSL upon connecting (port should probably 443) [default: no]
use_django_settings
use default Django settings for mail. See the Django documention on e-mail settings [default: false]
host
SMTP server (outgoing) hostname / IP-address
port
port to connect to [default: 587]
ssl
use ssl [default: no]
tls
use tls [default: true]
username / password
credentials to use when logging in
All settings in this section will be used as defaults for all scrapers. See the following section.
username / password
credentials to use when logging in
class
class relative to amcatscraping.scrapers
is_absolute_classpath
if this option is enabled, class
will be considered an absolute classpath [default: no]
articleset
id of articleset in which to store scraped articles
project
id of project in which to store scraped articles
You can directly call amcatscraping.scrape
to invoke specific, or all scrapers.
$ python -m amcatscraping.scrape --help
Run scraper
Usage:
scrape.py run [options] [<scraper>...]
scrape.py list
scrape.py -h | --help
Options:
-h --help Show this screen.
--from=<date> Scrape articles from date (default: today)
--to=<date> Scrape articles up to and including date (default: today)
--dry-run Do not commit to database
--report Send report to e-mailaddress after scraping</nowiki>
You can use list
to list all scrapers installed in ~/.scrapers.conf
. One can run all scrapers listed their by specifying none:
PYTHONPATH=. python amcatscraping/maintenance/run.py
or specific ones by listing them:
PYTHONPATH=. python amcatscraping/maintenance/run.py AD FD
You can mix various options; for example:
PYTHONPATH=. python amcatscraping/maintenance/run.py AD FD --report --dry-run
The latter will email a report similar to the mail shown below:
You can use Cron to install periodic jobs on Linux-based systems. To view / edit your current jobs, run crontab -e
. To run all scrapers each morning at 11 A.M., add:
0 11 * * 1 python -m amcatscraping.scrape all --report