peace-machine

Setting up the PC (Instructions below assume Ubuntu 18)

Install Anaconda distribution of python 3.8. For remote installation use: https://kengchichang.com/post/conda-linux/
Install awscli
Install gcc
Install ruby
Create conda environment: conda create -n peace python=3.8 and then conda activate peace
Install wayback-machine-downloader: sudo gem install wayback_machine_downloader
Install NVIDIA GPU drivers(depending on the machine)

Setting up the Pipeline

Install the required packages

Install peace-machine from github: pip install -U git+https://github.com/ssdorsey/peace-machine.git

Setup the Mongo Database

Download + Install MongoDB Server
(Optional) Set local database base in mongod.conf file
In Command Prompt / Terminal, start server with mongod
- If you want to start the server on a specific drive (ex: D:/): mongod --dbpath D:/ OR:
  - Make sure you're starting the database in a path with the following folders /data/db/
  - Change your directory: cd D:
  - Launch the server: mongod
Create database and collection:
- In a new Terminal, enter the mongo shell: mongo
- Create the database (this one named "ml4p"): use ml4p
- Create a new collection (this one named "articles"): db.createCollection('articles')
- Create an index for the collection on the "url" field: db.articles.createIndex({'url':1}, {unique: true})
  - This is so duplicate url's aren't inserted and we can do quick searches on the url
Set up access control
Set up firewall permissions
Create known actors collection
Set up actors system [Akanksha]

Setup ElasticSearch wikipedia

Install ElasticSearch
Import Wikipedia
- wiki_to_elastic.py
Attach whatever pageview stats you want.
- download_wikimedia_pageviews.py
- attach_wikipedia_pageviews.py

Run the Scrapers

Site-direct scrapers

Set up MongoDB (see above)
Install the fork of news-please with Mongo integration: pip install -U git+https://github.com/ssdorsey/news-please.git
Follow the news-please documentation for initial run / config
Edit the config.cfg and sitelist.hjson created in the above step:
- sitelist.hjson can be configured automatically using the create_sitelist function in scrape_direct.py
- Set the MongoDB URI
- Ensure MongoStorage is in the pipeline at the bottom of the config.cfg file
Re-launch the scrapers with: news-please -c [CONFIG_LOCATION]
- Include -resume flag if the code has been run previously

CC-News scrapers

Set up MongoDB (see above)
Check/edit the sitelist ~/DIRECTORY_FOR_STORING_LOGS/config/sitelist.hjson
Edit the directory and settings in the scrape_ccnews.py file inside of peace-machine
Execute the CC-News parser: python scrape_ccnews.py
- This will track the .warc files you have already downloaded/parsed. If you add a new domain (and thus need to rerun the files), set my_continue_process = False in commoncrawl.py
Rerun commoncrawl.py whenever you want to collect new data

Run De-Deuplication and Patch Script

Open terminal and got to peace-machine/peacemachine/scripts
Run the patch file: python3 patch_tools.py
Note: Make sure you run this script before running the locatrion, translation and event extractor pipeline

Run translation

Open a new terminal
Run: peace-machine -t [ISO2 LANGUAGE CODE]
- Ex: peace-machine -t es
- Other options (such as batch sizing) are available using peace-machine --help
- Iso2 language code must be in the languages collection of the DB

Run the Event Extraction

Open a new terminal
Run the extractor: peace-machine -u [MongoDB URI] -e [MODEL NAME] -b [BATCH SIZE] -ml [LOCATION OF MODEL ON DRIVE]
- EX: peace-machine -u mongodb://username:[email protected] -e civic1 -b 768 -ml "/mnt/d/peace-machine/peacemachine/data/finetuned-transformers"

Automate

Linux

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
peacemachine		peacemachine
website		website
.gitignore		.gitignore
README.md		README.md
desktop.ini		desktop.ini
guide_legalfilter.txt		guide_legalfilter.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

peace-machine

Setting up the PC (Instructions below assume Ubuntu 18)

Setting up the Pipeline

Install the required packages

Setup the Mongo Database

Setup ElasticSearch wikipedia

Run the Scrapers

Site-direct scrapers

CC-News scrapers

Run De-Deuplication and Patch Script

Run translation

Run the Event Extraction

Automate

About

Releases

Packages

Contributors 5

Languages

ssdorsey/peace-machine

Folders and files

Latest commit

History

Repository files navigation

peace-machine

Setting up the PC (Instructions below assume Ubuntu 18)

Setting up the Pipeline

Install the required packages

Setup the Mongo Database

Setup ElasticSearch wikipedia

Run the Scrapers

Site-direct scrapers

CC-News scrapers

Run De-Deuplication and Patch Script

Run translation

Run the Event Extraction

Automate

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages