Skip to content

amcat/amcat-scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Installation

Install amcat-scraping and amcatclient directly from github using pip:

pip install git+git://github.com/amcat/amcatclient git+git://github.com/amcat/amcat-scraping

(Note, you should probably either work in a python virtual environment, or use sudo to install system-wide)

The scrapers can be run using the amcatscraping.scrape module: (see below for configuration and options)

python -m amcatscraping.scrape --help

AmCAT 3.5

AmCAT 3.5 included the transition from Python 2 to Python 3. Additionally, some fundamental changes in the way it approaches articles have been made. This is an overview of the scrapers currently fixed:

  • newspapers.ad
  • newspapers.fd
  • newspapers.nrc
  • newspapers.nrchandelsblad
  • newspapers.nrcnext
  • newspapers.pcm
  • newspapers.telegraaf
  • newspapers.trouw
  • newspapers.volkskrant
  • blogs.geenstijl
  • news.nu

Configuration

Configuration is stored in ~/.scrapers.conf.

[store]
# Project and articleset defined per scraper
host: amcat.nl
port: 443
username: amcat
password: amcat
ssl: true

[mail]
host: mail.hmbastiaan.nl
port: 587
from: [email protected]
to: [email protected]
username: martijn
password: xxxxxxx
tls: true

[*]
# Section with defaults for all scrapers
articleset: 37
project: 1

[AD]
username: xxxxxxxxxxxxxxx
password: xxxxxxxxxxxxxxx
class: newspapers.ad.AlgemeenDagbladScraper

Defaults can be found [here](https://github.com/amcat/amcat-scraping/blob/master/amcatscraping/maintenance/default.conf here)

Specific options:

[store]

Defines where articles are saved, after scraping.

host hostname or IP-address of AmCAT instance [default: amcat.nl]

port port to connect to [default: 80]

username / password credentials to use when logging in

ssl use SSL upon connecting (port should probably 443) [default: no]

[mail]

use_django_settings use default Django settings for mail. See the Django documention on e-mail settings [default: false]

host SMTP server (outgoing) hostname / IP-address

port port to connect to [default: 587]

ssl use ssl [default: no]

tls use tls [default: true]

username / password credentials to use when logging in

[*]

All settings in this section will be used as defaults for all scrapers. See the following section.

[scraper_label]

username / password credentials to use when logging in

class class relative to amcatscraping.scrapers

is_absolute_classpath if this option is enabled, class will be considered an absolute classpath [default: no]

articleset id of articleset in which to store scraped articles

project id of project in which to store scraped articles

Running

You can directly call amcatscraping.scrape to invoke specific, or all scrapers.

$ python -m amcatscraping.scrape --help
Run scraper

Usage:
  scrape.py run [options] [<scraper>...]
  scrape.py list
  scrape.py -h | --help

Options:
  -h --help        Show this screen.
  --from=<date>    Scrape articles from date (default: today)
  --to=<date>      Scrape articles up to and including date (default: today)
  --dry-run        Do not commit to database
  --report         Send report to e-mailaddress after scraping</nowiki>

You can use list to list all scrapers installed in ~/.scrapers.conf. One can run all scrapers listed their by specifying none:

PYTHONPATH=. python amcatscraping/maintenance/run.py

or specific ones by listing them:

PYTHONPATH=. python amcatscraping/maintenance/run.py AD FD

You can mix various options; for example:

PYTHONPATH=. python amcatscraping/maintenance/run.py AD FD --report --dry-run

The latter will email a report similar to the mail shown below:

Scraper report

Running periodically

You can use Cron to install periodic jobs on Linux-based systems. To view / edit your current jobs, run crontab -e. To run all scrapers each morning at 11 A.M., add:

0 11 * * 1 python -m amcatscraping.scrape all --report

About

Contains scraper logic and scrapers themselves for AmCAT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published