Reports data package and entity read counts from the EDI data repository, and more...
counter analyzes download metrics for each data package in the EDI data repository during the period of time selected using the "start" and "end" date options. Specifically, counter determines what data entities were available for download during this period and then "counts" the number of times they were requested for download through the PASTA+ REST API. This value is recorded in a SQLite database, along with the data entity resource identifier (its unique identifier within PASTA+), the package identifier of the associated data package, the common name of the data enitty as found in metadata, and the creation date and time of the data entity.
Using the information obtained for each data entity (specifically, the data package identifier), counter then collects the data package metadata to obtain the title and DOI of the data package, and then calculates the total downloads for all data entities in the data package. This information is also recorded in a SQLite database.
counter has two options for obtaining the above information: first, and by default, counter uses the PASTA+ REST API to collate and aggregate the download metrics. Alternatively, counter can use direct connections to PASTA+ databases to generate the same download metrics -- this alternate approach is much faster, but does require access privileges to PASTA+ databases.
The most direct way to install counter is to clone the
counter github repository, create and activate a Python virtual environment,
install the necessary Python dependencies found in environment-min.yml
or requirements.txt
, and then copy the file config.py.template
to
config.py
.
For Conda:
git clone https://github.com/PASTAplus/counter.git
cd counter
conda env create --file environment-min.yml
conda activate counter
cp ./src/counter/config.py.template ./src/counter/config.py
pip install -e .
For other Python virtual environment (assumes installed and active):
git clone https://github.com/PASTAplus/counter.git
cd counter
pip install -r requirements.txt
cp ./src/counter/config.py.template ./src/counter/config.py
pip install -e .
If everything is installed correctly, you should be able to run
counter --help
and see the counter help information (see below).
Note: Development was performed in a conda
virtual environment using the
PyCharm IDE. To replicate and run counter in this manner, you must first
install anaconda3
from https://www.anaconda.com/products/individual, and
then create a working conda
virtual environment using conda env create --file environment-min.yml
. conda
will use the dependency specifications in
the environment-min.yml
file to install the appropriate Python3 packages.
Once installed in this manner, you may execute counter by first activating
the counter virtual environment (conda activate counter
), and then using
either python counter.py <OPTIONS> SCOPE CREDENTIALS
or by installing
counter using pip install .
and then running it directly from the command
line as counter <OPTIONS> SCOPE CREDENTIALS
. See below for specific options
and required arguments.
Usage: counter [OPTIONS] SCOPE CREDENTIALS
Perform analysis of data entity downloads for the given PASTA+ SCOPE from
START_DATE to END_DATE.
SCOPE: PASTA+ scope value
CREDENTIALS: User credentials in the form 'DN:PW', where DN is the
full EDI LDAP distinguished name (e.g., uid=USER,o=EDI,
dc=edirepository,dc=org) and PW is the corresponding password
Options:
-s, --start TEXT Start date from which to begin search in ISO 8601
format(default is 2013-01-01T00:00:00)
-e, --end TEXT End date from which to end search in ISO 8601 format
(default is today)
-p, --path TEXT Directory path for which to write SQLite database and CSVs
-n, --newest Report only on newest data package entities
-d, --db Use the PASTA+ database directly (must have authorization)
-c, --csv Write out CSV tables in addition to the SQLite database
-v, --verbose Send output to standard out (-v or -vv or -vvv for
increasing output)
-o, --one Include downloads from DataONE
-h, --help Show this message and exit.
Running counter only requires the SCOPE of the data package of interest
and the CREDENTIALS for your EDI LDAP account, and of course, the start
and end dates of the time period you would like to analyze. It is most
informative if you use either the -v
or -vv
flag to provide runtime
feedback. For a short running example:
counter -s "2019-01-01T00:00:00" -e "2020-01-01T00:00:00" -v knb-lter-nin "uid=msobel,o=EDI,dc=edirepository,dc=org:PASSWORD"
Analysis times depend on the number of data entities found within the time period and how busy PASTA+ is when running counter. In general, you can expect counter to take between 10-30 seconds per entity, which means that typical runs may be hours long. And since counter utilizes a considerable number of PASTA+ REST API calls to perform the analysis, its execution will result in PASTA+ becoming quite busy, naturally. With this in mind, please be considerate of other users when running counter -- thanks!
Data collected by counter is motivated by the needs of information managers who need to report download statistics to colleagues and funding agencies. Two sets of data are collected: 1) download metrics at the data entity level and 2) basic metadata at the data package level, including an aggregated sum of all data entity counts within the data package (see table schemas below).
- rid (resource identifier) - string, primary key
- pid (package identifier) - string
- date_created (date of entity creation in PASTA+) - datetime
- count (download count) - integer
- name (entity common name) - string
- pid (package identifier) - string, primary key
- doi (package digital object identifier) - string
- title (package title) - string
- count (aggregated download count) - integer
The two tables provide a means for users to generate any number of reports,
including a simple summary report by using only the packages table. One key
aspect of the entities table is the date_created
value: one can better
understand count values by placing the data entity into a timeline
perspective, especially if counts seem unusually low. For example, if your end
date is 2020-01-01T00:00:00, and the date_created of a data entity is
2019-12-15T14:03:12, then a low download count may be reasonable since the
data entity was only avaiable 16 days for download. If, however, the
date_created was 2013-12-15T4:03:12, I would be suspicious of the low count.