- Install Anaconda distribution of python 3.8. For remote installation use: https://kengchichang.com/post/conda-linux/
- Install awscli
- Install gcc
- Install ruby
- Create conda environment:
conda create -n peace python=3.8
and thenconda activate peace
- Install wayback-machine-downloader:
sudo gem install wayback_machine_downloader
- Install NVIDIA GPU drivers(depending on the machine)
- Install peace-machine from github:
pip install -U git+https://github.com/ssdorsey/peace-machine.git
- Download + Install MongoDB Server
- (Optional) Set local database base in mongod.conf file
- In Command Prompt / Terminal, start server with
mongod
- If you want to start the server on a specific drive (ex: D:/):
mongod --dbpath D:/
OR:- Make sure you're starting the database in a path with the following folders /data/db/
- Change your directory:
cd D:
- Launch the server:
mongod
- If you want to start the server on a specific drive (ex: D:/):
- Create database and collection:
- In a new Terminal, enter the mongo shell:
mongo
- Create the database (this one named "ml4p"):
use ml4p
- Create a new collection (this one named "articles"):
db.createCollection('articles')
- Create an index for the collection on the "url" field:
db.articles.createIndex({'url':1}, {unique: true})
- This is so duplicate url's aren't inserted and we can do quick searches on the url
- In a new Terminal, enter the mongo shell:
- Set up access control
- Set up firewall permissions
- Create known actors collection
- Set up actors system [Akanksha]
- Install ElasticSearch
- Import Wikipedia
- wiki_to_elastic.py
- Attach whatever pageview stats you want.
- download_wikimedia_pageviews.py
- attach_wikipedia_pageviews.py
- Set up MongoDB (see above)
- Install the fork of news-please with Mongo integration:
pip install -U git+https://github.com/ssdorsey/news-please.git
- Follow the news-please documentation for initial run / config
- Edit the config.cfg and sitelist.hjson created in the above step:
- sitelist.hjson can be configured automatically using the
create_sitelist
function in scrape_direct.py - Set the MongoDB URI
- Ensure MongoStorage is in the pipeline at the bottom of the config.cfg file
- sitelist.hjson can be configured automatically using the
- Re-launch the scrapers with:
news-please -c [CONFIG_LOCATION]
- Include -resume flag if the code has been run previously
- Set up MongoDB (see above)
- Check/edit the sitelist ~/DIRECTORY_FOR_STORING_LOGS/config/sitelist.hjson
- Edit the directory and settings in the scrape_ccnews.py file inside of peace-machine
- Execute the CC-News parser:
python scrape_ccnews.py
- This will track the .warc files you have already downloaded/parsed. If you add a new domain (and thus need to rerun the files), set
my_continue_process = False
in commoncrawl.py
- This will track the .warc files you have already downloaded/parsed. If you add a new domain (and thus need to rerun the files), set
- Rerun commoncrawl.py whenever you want to collect new data
- Open terminal and got to peace-machine/peacemachine/scripts
- Run the patch file:
python3 patch_tools.py
Note: Make sure you run this script before running the locatrion, translation and event extractor pipeline
- Open a new terminal
- Run:
peace-machine -t [ISO2 LANGUAGE CODE]
- Ex:
peace-machine -t es
- Other options (such as batch sizing) are available using
peace-machine --help
- Iso2 language code must be in the languages collection of the DB
- Ex:
- Open a new terminal
- Run the extractor:
peace-machine -u [MongoDB URI] -e [MODEL NAME] -b [BATCH SIZE] -ml [LOCATION OF MODEL ON DRIVE]
- EX:
peace-machine -u mongodb://username:[email protected] -e civic1 -b 768 -ml "/mnt/d/peace-machine/peacemachine/data/finetuned-transformers"
- EX: