Some extraction and conversion scripts for the NLPforCOVID-19 project

Requirements

Python 3.6.5
WWW2sf
detectblocks
Elastic Search

Configuration

To prepare the environment:

pipenv sync

Copy the config.json.sample file to config.json and edit it according to your settings.

Parameter	Description
crawled_data_repository	Url of the harvested page databases.
twitter	Settings for Twitter data extraction. See below for more information.
default_similarity_threshold	Articles that are too similar will be ignored.
domains	Domains to consider. Each domain contains several parameters. See below for more information.
domains_ignored	Domains to ignore. For some reasons, a domain might be irrelevant. It can be put into this section so that it is ignored.
domains_disabled	Disabled domains. For some reasons, a domain might become irrelevant after a while. For instance, a domain might suddenly blocks our crawler. In such a case, it's no longer useful to process it so it can be moved from domains to domains_disabled. Previous data will be kept and shown in the stats page but no new data will be extracted and processed.
url_black_list	Text file containing URLs that must be ignored.
db_dir	Directory where the databases are stored during processing.
html_dir	Directory where the extracted html files are stored.
xml_dir	Directory where the converted xml files are stored.
run_dir	Directory containing running time info.
WWW2sf_dir	Installation directory of the WWW2sf tool.
detectblocks_dir	Installation directory of the detectblocks tool.
mail	Parameters needed for email notifications. See below for more information.

The domains sections can contain several parameters:

Parameter	Description
region	2 or 3 letters code corresponding to the geographical area of the domain. It's usually an ISO-3166 Alpha-2 code. But "int" means international.
language	2 letters code corresponding to the language of the resource. It's usually an ISO-639-2 code.
subdomains	An array of regular expressions that the URL must comply to be considered.
sources	The official sources associated to the domain. Most of the time, it's equal to the domain name however for some cases where domains are grouped into a bundle, it will list the sources contained in the bundle.

The mail section might contain several parameters:

Parameter	Description
from	The From: field value of the notification email.
to	The To: field value where the notification mail will be sent.
cc	The Cc: field value where the notification mail will be also sent.
bcc	The Bcc: field value where the notification mail will be also sent.

In addition to the mail parameters, it's also important to configure the OAuth settings. The mail notification uses Googlle account using OAuth.

From the https://console.developers.google.com page, it's required to create an OAuth 2.0 Client IDs from the Credentials page. Then the credential data must downloaded and copied into the credentials.json file.

The send_mail() function in utils.py must be called from a desktop computer having access to Internet. Doing so will lead the user to authenticate himself to Google and authorize mail notifications. A resulting token.pickle will be generated. This file must be copied in the same directory that contains the config.json file. It will enable mail notifications without user interaction.

This procedure is not very convenient but I know no alternative to it at the moment.

The twitter section contains settings related to data extraction from Twitter:

Parameter	Description
crawled_data_repository	Url of the harvested page databases.
user	Login of the user authorized to access the data.
password	Password of the user authorized to access the data.
html_dir	Directory where the extracted Twitter html files are stored.
xml_dir	Directory where the converted Twitter xml files are stored.

ElasticSearch

This project uses ElasticSearch to implement full-text search queries.

Download the latest version and install it on 3 different hosts:

curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz
tar -xvf elasticsearch-7.9.2-linux-x86_64.tar.gz

From now on, we will assume that $ES_HOME is the location where Elastic Search has been installed.

Each node must be configured properly beforehand.

Copy the default $ES_HOME/config file with a different name like $ES_HOME/config-covid19-node-$NN where $NN is the number of the node (for instance, 01, 02, and 03). Edit the new file and adjust it to specify the following information:

cluster.name: covid19
none.name: node-$NN
path.data: $ES_NODE_NN_DIR/data
path.log: $ES_NODE_NN_DIR/logs
path.repo: $ES_REPO_DIR
network.host: 0.0.0.0
discovery.seed_hosts: ["$HOST_NODE_01", "$HOST_NODE_02", "$HOST_NODE_03"]

where:

$NN is the number of the node (for instance, 01, 02, and 03);
$ES_NODE_NN_DIR is the location where the data and logs of this particular node will be stored;
ES_REPO_DIR is the shared location where the snapshots of the whole repository will be stored;
$HOST_NODE_01, $HOST_NODE_02, and $HOST_NODE_03 are the hostnames of the 3 nodes.

The analysis-kuromoji plugin must also be installed into your Elastic Search instance(s) like this:

cd $ES_HOME
./elasticsearch-plugin install analysis-kuromoji

To start an Elastic Search node:

cd $ES_HOME/bin
ES_PATH_CONF=$ES_HOME/config-covid19-node-$NN ./elasticsearch

To run the node as a daemon, use the -d -p pid-node-$NN options.

To create the indices, data streams, and other Elastic Search objects, refer to this document.

Name		Name	Last commit message	Last commit date
Latest commit History 332 Commits
doc/elastic_search		doc/elastic_search
www		www
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
archive-old-files.sh		archive-old-files.sh
check_new_domains.py		check_new_domains.py
config.json.sample		config.json.sample
convert-html-to-xml.py		convert-html-to-xml.py
convert-html-to-xml.sh		convert-html-to-xml.sh
convert-twitter-html-to-xml.py		convert-twitter-html-to-xml.py
convert-twitter-html-to-xml.sh		convert-twitter-html-to-xml.sh
converter_logging.conf.sample		converter_logging.conf.sample
country_codes.txt		country_codes.txt
es_importer_logging.conf.sample		es_importer_logging.conf.sample
extract-html.py		extract-html.py
extract-tweets.py		extract-tweets.py
extracter.py		extracter.py
extracter_logging.conf.sample		extracter_logging.conf.sample
fetch-db.py		fetch-db.py
fetch-twitter-db.py		fetch-twitter-db.py
force-convert-html-to-xml.py		force-convert-html-to-xml.py
force-convert-html-to-xml.sh		force-convert-html-to-xml.sh
force_converter_logging.conf.sample		force_converter_logging.conf.sample
gene_config_sample.py		gene_config_sample.py
get-covid19-db-and-html.sh		get-covid19-db-and-html.sh
get-covid19-twitter-db-and-html.sh		get-covid19-twitter-db-and-html.sh
prune_files.py		prune_files.py
put-back-hashtags-and-links-to-tweets.py		put-back-hashtags-and-links-to-tweets.py
twitter-extracter.py		twitter-extracter.py
twitter_converter_logging.conf.sample		twitter_converter_logging.conf.sample
twitter_extracter_logging.conf.sample		twitter_extracter_logging.conf.sample
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Some extraction and conversion scripts for the NLPforCOVID-19 project

Requirements

Configuration

ElasticSearch

About

Releases

Packages

Contributors 3

Languages

NLPforCOVID-19/covid-19-extract-convert

Folders and files

Latest commit

History

Repository files navigation

Some extraction and conversion scripts for the NLPforCOVID-19 project

Requirements

Configuration

ElasticSearch

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages