Installation

Install amcat-scraping and amcatclient directly from github using pip:

pip install git+git://github.com/amcat/amcatclient git+git://github.com/amcat/amcat-scraping

(Note, you should probably either work in a python virtual environment, or use sudo to install system-wide)

The scrapers can be run using the amcatscraping.scrape module: (see below for configuration and options)

python -m amcatscraping.scrape --help

AmCAT 3.5

AmCAT 3.5 included the transition from Python 2 to Python 3. Additionally, some fundamental changes in the way it approaches articles have been made. This is an overview of the scrapers currently fixed:

newspapers.ad
newspapers.fd
newspapers.nrc
newspapers.nrchandelsblad
newspapers.nrcnext
newspapers.pcm
newspapers.telegraaf
newspapers.trouw
newspapers.volkskrant
blogs.geenstijl
news.nu

Configuration

Configuration is stored in ~/.scrapers.conf.

[store]
# Project and articleset defined per scraper
host: amcat.nl
port: 443
username: amcat
password: amcat
ssl: true

[mail]
host: mail.hmbastiaan.nl
port: 587
from: [email protected]
to: [email protected]
username: martijn
password: xxxxxxx
tls: true

[*]
# Section with defaults for all scrapers
articleset: 37
project: 1

[AD]
username: xxxxxxxxxxxxxxx
password: xxxxxxxxxxxxxxx
class: newspapers.ad.AlgemeenDagbladScraper

Defaults can be found [here](https://github.com/amcat/amcat-scraping/blob/master/amcatscraping/maintenance/default.conf here)

Specific options:

[store]

Defines where articles are saved, after scraping.

host hostname or IP-address of AmCAT instance [default: amcat.nl]

port port to connect to [default: 80]

username / password credentials to use when logging in

ssl use SSL upon connecting (port should probably 443) [default: no]

[mail]

use_django_settings use default Django settings for mail. See the Django documention on e-mail settings [default: false]

host SMTP server (outgoing) hostname / IP-address

port port to connect to [default: 587]

ssl use ssl [default: no]

tls use tls [default: true]

username / password credentials to use when logging in

[*]

All settings in this section will be used as defaults for all scrapers. See the following section.

[scraper_label]

username / password credentials to use when logging in

class class relative to amcatscraping.scrapers

is_absolute_classpath if this option is enabled, class will be considered an absolute classpath [default: no]

articleset id of articleset in which to store scraped articles

project id of project in which to store scraped articles

Running

You can directly call amcatscraping.scrape to invoke specific, or all scrapers.

$ python -m amcatscraping.scrape --help
Run scraper

Usage:
  scrape.py run [options] [<scraper>...]
  scrape.py list
  scrape.py -h | --help

Options:
  -h --help        Show this screen.
  --from=<date>    Scrape articles from date (default: today)
  --to=<date>      Scrape articles up to and including date (default: today)
  --dry-run        Do not commit to database
  --report         Send report to e-mailaddress after scraping</nowiki>

You can use list to list all scrapers installed in ~/.scrapers.conf. One can run all scrapers listed their by specifying none:

PYTHONPATH=. python amcatscraping/maintenance/run.py

or specific ones by listing them:

PYTHONPATH=. python amcatscraping/maintenance/run.py AD FD

You can mix various options; for example:

PYTHONPATH=. python amcatscraping/maintenance/run.py AD FD --report --dry-run

The latter will email a report similar to the mail shown below:

Running periodically

You can use Cron to install periodic jobs on Linux-based systems. To view / edit your current jobs, run crontab -e. To run all scrapers each morning at 11 A.M., add:

0 11 * * 1 python -m amcatscraping.scrape all --report

Name		Name	Last commit message	Last commit date
Latest commit History 322 Commits
amcatscraping		amcatscraping
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

AmCAT 3.5

Configuration

[store]

[mail]

[*]

[scraper_label]

Running

Running periodically

About

Releases

Packages

Contributors 5

Languages

License

amcat/amcat-scraping

Folders and files

Latest commit

History

Repository files navigation

Installation

AmCAT 3.5

Configuration

[store]

[mail]

[*]

[scraper_label]

Running

Running periodically

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages