Border Wait is a python web crawler, facebook and twitter bot - crawls the data from Qendra Kombëtare për Menaxhim Kufitar (QKMK) then posts and tweets the delays at each border with a specifc gif on facebook and twitter.
- Operating System: Ubuntu 14.04 LTS
- Language: Python
- Crawling Framework: Scrapy
- Scrapy Deploy and Control Spider API: Scrapyd
- Python Twitter API: Tweepy
- Database: MongoDB
- MongoDB ORM: PyMongo
- Web Server: Nginx
- Software utility as time-based scheduler: Cron(Optional)
GPG (GNU Privacy Guard) is the tool used in secure apt to sign files and check their signatures.
apt-key is a program that is used to manage a keyring of gpg keys for secure apt. The keyring is kept in the file /etc/apt/trusted.gpg (not to be confused with the related but not very interesting /etc/apt/trustdb.gpg). apt-key can be used to show the keys in the keyring, and to add or remove a key.
NOTE!!! These packages are currently not updated and may not work on Ubuntu 16.04 or later versions.
Import the GPG key used to sign Scrapy packages into APT keyring:
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7
Create /etc/apt/sources.list.d/scrapy.list file using the following command:
echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list
Install these dependecies in order to install Scrapy:
sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
Update package lists and install the scrapy package:
sudo apt-get update && sudo apt-get install scrapy
Scrapyd service for deploying and managing spiders:
sudo apt-get install scrapyd
Scrapyd-deploy for deploying the scrapy project:
sudo apt-get install scrapyd-deploy
We need to install scrapy python packages as well:
sudo pip install scrapy==1.0.5
Install Git:
sudo apt-get install git
Go inside the projects folder and install scrapyd using pip:
sudo pip install -e git+https://github.com/scrapy/scrapyd.git@32be9b85b0ba496e5a5d983ee492f1116f9cfbb9#egg=scrapyd
NOTE!!! In order to avoid some errors, you have to downgrade the Twisted package to version 16.4.1:
sudo pip install Twisted==16.4.1
Install dependencies:
sudo pip install pyasn1 tweepy --upgrade
Clone the project in your machine:
git clone https://github.com/opendatakosovo/border-wait.git
cd border-wait
Adding the Facebook and Twitter API Access Tokens:
cd borderwait
sudo nano settings.py
...
# TWITTER ACCESS TOKENS FOR TWITTER BOT
TWITTER_CONSUMER_KEY = ''
TWITTER_CONSUMER_SECRET = ''
TWITTER_ACCESS_TOKEN = ''
TWITTER_ACCESS_TOKEN_SECRET = ''
# FACEBOOK ACCESS TOKENS FOR FACEBOOK BOT
FACEBOOK_ACCESS_TOKEN = ''
...
Get global root project path (use pwd command in root project path to get the global path):
pwd
/home/<user>/border-wait
Set the global root project path into settings.py
cd borderwait
sudo nano settings.py
...
GLOBAL_PROJECT_DIRECTORY = '/home/<user>/border-wait'
...
Install nginx for the reverse proxy and apache2-utils for the login security:
sudo apt-get install nginx apache2-utils
Now create a user and password:
sudo htpasswd -c /etc/nginx/.htpasswd <type your username>
Hit enter and will ask you for a password. Then you need to edit the configurations file of nginx to create a reverse proxy to the scrapyd webservice.
sudo nano /etc/nginx/sites-available/default
then edit the config file according to this:
server {
listen 80 default_server;
listen [::]:80 default_server ipv6only=on;
root /usr/share/nginx/html;
# Make site accessible from http://localhost/
server_name _;
# deny access to .htaccess files, if Apache's document root
# concurs with nginx's one
location ~ /\.ht {
deny all;
}
location / {
proxy_pass http://localhost:6800;
auth_basic "Restricted Content";
auth_basic_user_file /etc/nginx/.htpasswd;
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
#try_files $uri $uri/ =404;
}
}
now reload nginx service:
sudo service nginx reload
First of all you need to block scrapyd to be accessed by outside the server. To do this you have to edit the scrapyd configuration file:
sudo nano /etc/scrapyd/conf.d/000-default
then add a bind address above the port and set it to 127.0.0.1(default is set to 0.0.0.0):
bind_address = 127.0.0.1
the restart scrapyd service:
sudo service scrapyd restart
Go inside the project's folder and run the command below to deploy the project:
scrapyd-deploy -p borderwait
Note: If you have errors when running command above please remove folder project.egg-info and build inside project folder.
If you want to delete deployed project:
curl http://localhost:6800/delproject.json -d project=borderwait
Scrapyd can be accessed by outside since it uses the localhost(0.0.0.0), so we need to block it by changing the localhost to 127.0.0.1 which blocks outside world from entering.
curl http://localhost:6800/schedule.json -d project=borderwait -d spider=borderwait
There are in total three item pipelines in scrapy project (FacebookPipeline, MongoPipeline and TwitterPipeline). You can prevent scrapy item pipelines by commenting(putting #) the lines of code in settings.py as below:
sudo nano borderwait/settings.py
...
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'borderwait.pipelines.facebook.FacebookPipeline': 900
# 'borderwait.pipelines.mongo.MongoPipeline': 1100,
# 'borderwait.pipelines.twitter.TwitterPipeline': 250
}
...
Note: If your project is already deployed, you must delete deployed project and deploy again in order to take affect new settings.
If you want to invoke automatically scrapy project to start consider configuring Cron as a software utitily scheduler. Let's dive into configuration!
Installing cron:
sudo apt-get install cron
Write tasks that will periodically repeated on a specified time:
crontab -e
In crontab file, at the end of the file, to write a cron expression firstly you need specify the time that the command will invoke automatically, the format of time is (minute, hour, day of the month, month, day of the week). Use this online tool to define easier your time expression https://crontab.guru/
...
# m h dom mon dow command
0 */6 * * * curl http://localhost:6800/schedule.json -d project=borderwait -d spider=borderwait
Note: Above cron tab expression will execute curl command every 6 hours!
Restart cron:
sudo service cron restart
Stop cron:
sudo service cron stop