Skip to content
This repository was archived by the owner on Mar 19, 2024. It is now read-only.
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: CDLUC3/counter-processor
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: IQSS/counter-processor
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: goto-gdcc
Choose a head ref
Can’t automatically merge. Don’t worry, you can still create the pull request.
  • 2 commits
  • 3 files changed
  • 2 contributors

Commits on Mar 21, 2024

  1. Update README.md

    pdurbin authored Mar 21, 2024
    Copy the full SHA
    faebe61 View commit details

Commits on Jan 30, 2025

  1. Copy the full SHA
    61ca44a View commit details
Showing with 239 additions and 150 deletions.
  1. +1 −150 README.md
  2. +36 −0 user-agents/lists/machine.txt
  3. +202 −0 user-agents/lists/robots.txt
151 changes: 1 addition & 150 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,150 +1 @@
# Counter Processor

## Introduction

The Counter Processor is a Python 3 (written in 3.6.4) script for processing dataset access statistics from logs
using the COUNTER Code of Practice for Research Data.

The software assumes you area already logging your COUNTER dataset *investigations* and *requests* to a log file using a format somewhat similar to extended log format. The COUNTER Code of Practice requires that descriptive metadata be submitted along with statistics--these items are included in logs to ease later processing.

Log items are separated by tabs (\t) and any missing values may be logged with a dash (-) or by a an empty string.

## Items to log per line for processing
- Event datetime in ISO8601 format
- Client IP address
- Session cookie id (if available, otherwise blank)
- User cookie id (if available, otherwise blank)
- User identifier (if available, otherwise blank)
- Requested URL
- Identifier of requested item (likely a DOI)
- Filename (optional)
- Size of request (required for requests)
- User-agent sent in the request
- Title
- Publisher
- Publisher ID (such as ISNI or GRID)
- Authors, separated by a pipe character (|) if there is more than one author
- Publication Date (ISO8601 format)
- Version
- Other/Alternate ID
- Target URL that the itentifer such as the DOi would resolve to
- Year of Publication

## Overview of processing logs

- Make your logs available to the script on the file system
- Set up the configuration file
- Override any settings you need to change with environment variables
- Run the script, it will go through these stages
- Log processing
- Statistic generation and output
- Submission of statistics

## Make logs available
You will need to run the script on a computer where the log files you're trying to process are available on the file system for the script to access.

## Download the free IP to geolocation database
This product includes GeoLite2 data created by MaxMind, available from
<a href="http://www.maxmind.com">http://www.maxmind.com</a>.

GeoLite2 is a free IP geolocation database that must be installed in the product. You can download the database from [https://dev.maxmind.com/geoip/geoip2/geolite2/](https://dev.maxmind.com/geoip/geoip2/geolite2/). Choose the GeoLite2 Country database (binary, gzipped) and extract it to the maxmind_geoip directory inside the application to use with default configuration, or put it elsewhere and configure the path as mentioned below.

## Set up the configuration file
The script takes a number of different configuration parameters in order to run correctly. See **config/config.yaml** for an example. To change the configuration you may edit it at config/config.yaml or you can put it at a different location and then specify it with an environment variable when starting the script like the example below.

```CONFIG_FILE=path/to/my/config.yaml ./main.py```

If you don't set a CONFIG_FILE the script will use the one at *config/config.yaml*.

### The options
- **log\_name\_pattern**: This pattern indicates the daily log files it should look for. Include the string "(yyyy-mm-dd)" in your log file pattern. It will look for log files and replace this string, without parenthesis and with actual year, month and day.
- **path_types** have two sub-keys, *investigations* and *requests*. Each sub-key has an array of regular expressions for classifying the path (and after) portion of URLs for requests as either an *investigation* or a *request* for the URLs in your system.
- **robots_url** is a url to download a list of regular expressions (one per line in a text file) that the script uses to classify a user-agent as a robot/crawler.
- **machines_url** is a url to download a list of regular expressions (one per line in a text file) that the script uses to classify a user-agent as a machine (rather than human) access.
- **year_month** is the year and month for which you are desiring to create a report. For example, 2018-05.
- **output_file**: the path and file to write the report to. Leave off the extension because it will be automatically supplied based on the *output_format*.
- **output_format**: Choose either *tsv* or *json* for this value. Currently only json is fully functional.
- **platform** is the name of your platform which is used in the report output.
- **hub\_api\_token**: set this value in a *secrets.yaml* file in the same directory as your config.yaml if you are committing your config.yaml to a public respository. Be sure to exclude the secrets.yaml from being committed to a public place such as with a .gitignore file.
- **hub\_base\_url**: A value such as https://metrics.datacite.org that will be as the base to submit data to.
- **upload\_to\_hub**: True/False. If True, it will attempt to POST the data you generate to the hub. If False, the script will simply generate the output files and will not attempt uploading (could be useful for troubleshooting).
- **simulate_date**: put in a yyyy-mm-dd date to simulate running a report on that specified year month and day. Normally the script will process logs and create data output through the previous day based on the system time. A report run for a month after a reporting period is over will process things up to the end of that reporting month as specified by year_month. Setting this allows simulating a run on a different day and is mostly for testing. See information about how state is maintained in the section below to understand what happens when specifying a different date. The processor expects an orderly processing of logs in chronological order such as running nightly or weekly.
- **maxmind\_geoip\_country\_path**: set the path to the GeoLite2-Country.mmdb binary geolocation database file. You may need to periodically download updates to this file from MaxMind.

## Maintaining State Between Runs

If you process your logs in an orderly way by running the script in chronological order, such as each night, it should correctly maintain state about any previous identifiers used for report submission and also about the last log file that has been processed.

The state is maintained in the state/ directory. You'll see some sqlite database files and a statefile.json file.

The script maintains a separate sqlite database file for each reporting month in here. Deleting a month's file will delete any previously processed data in the database for that month.

The statefile.json contains simple json key/value pairs. There is a section for each month that a report has been run for. Under each month there is a "last\_processed\_day" key which has a value indicating the last day processing of logs has been completed for.

There is also an id key for each month which indicates the identifier returned by the server on an initial POST request. This id needs to be reused later for PUT requests to replace data (for example if the script for the month is run nightly to update statistics for the month each night).

The state allows data to be added to the database from the logs, for example each night, without reprocessing every log for the month every night.

For example, if the script is run on May 2nd, and for a May 2018 report, it woould process the log file for May 1st and put entries in the 2018-05 database for that log file (from which stats can be calculated).

If run again on May 3rd, it would only need to process the May 2nd log into the database because May 1st has already been processed.

If you don't process every night, the script will process every log file after the last processed log file for the reporting period up until the day before or to the end of the reporting period.

If you run the script multiple times in one day it will not reprocess log files that have already been processed, for example, if the previous day is already marked as processed. It would simply calculate stats and submit them again from what is already in the database.

It might be important to understand how this works if there is an unusual situation such as an error while processing logs.

If you wish to completely reprocess and submit a month's data from log files you can:

1. Manually send a DELETE request to the hub for an id to remove a report.
2. Remove the state data from the json file for a particular year-month.
3. Remove the appropriate month's sqlite database from the file system
4. Reprocess the month. If it's after the month, use *year_month* for the months report you'd like.

It might also be important to understand how state works if moving the script to a different system so that you maintain the state files as needed.


## Override selected options in environment variables when running the script##
You will want to set the options in the *config.yaml* file that you use, but some options may change every time you run the script.

Most options listed above in the previous section can be overriden for each execution of the program by setting them in environment variables (but in all UPPERCASE letters). The most likely things to be overridden when you are generating reports:

- **YEAR_MONTH** -- use when you want to change which month you're generating the report for.
- **SIMULATE_DATE** -- if you want to run a report through a different day than the day before what your computer's clock is set to.

An example of overriding:

```YEAR_MONTH="2018-05" LOG_NAME_PATTERN="/path/to/my/logs/counter_(yyyy-mm-dd).log" ./main.py```

## Example run

I'm sure this will change.

```
$ LOG_GLOB="sample_logs/counter_2018-03-*.log" START_DATE="2018-03-01" END_DATE="2018-03-31" ./main.py
Running report for 2018-03-01T00:00:00 to 2018-04-01T00:00:00
processing sample_logs/counter_2018-03-13.log
processing sample_logs/counter_2018-03-14.log
Calculating stats for doi:10.6071/Z7WC73
Calculating stats for doi:10.5060/D8H59D
Calculating stats for doi:10.7280/D1MW2M
...
Calculating stats for doi:10.6078/D11S3N
Writing JSON report to tmp/test.json
submitted
```

## Submitting to the hub

To submit to the hub, set *upload\_to\_hub* to True in either the configuration or an environment variable. You must also set *hub\_api\_token* and *hub\_base\_url* in the config.yaml or secrets.yaml. It will then send reports to the hub for you.

If there are errors or problem submitting to the hub, check the tmp/datacite_response_body.txt file. The first line in this file contains the HTTP response code from the server, the second line contains the response headers and the rest of the file contains the response body received from the server.

Some possible submission problems:
- Submitting to the server using a POST request for a month that has already had data submitted and has already been assigned an ID (where a PUT using that ID would be more appropriate).
- Missing or invalid data is contained in the report (or data for features not yet implemented in the hub such as country counts).
- Is the hub server up and functioning properly?
Please use <https://github.com/gdcc/counter-processor> instead.
36 changes: 36 additions & 0 deletions user-agents/lists/machine.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# this file is generated from a master file, please modify it and regenerate it with the generate_lists.rb script
^ruby$
AddThis
aria2\/\d
CakePHP
ColdFusion
curl\/
^\%?default\%?$
Dispatch\/\d
EBSCO\sEJS\sContent\sServer
Fetch(\s|\+)API(\s|\+)Request
geturl
gvfs\/
HttpComponents\/1.1
http.?client
Indy Library
^java\/\d{1,2}.\d
libcurl
libhttp
libwww
lwp
Microsoft(\s|\+)URL(\s|\+)Control
Microsoft Office Existence Discovery
ng\/2\.
no_user_agent
pear.php.net
PHP\/
PycURL
python
rss
^undefined$
^unknown$
URL2File
urllib
Wget
wordpress
202 changes: 202 additions & 0 deletions user-agents/lists/robots.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# this file is generated from a master file, please modify it and regenerate it with the generate_lists.rb script
bot
spider
crawl
[^a]fish
^voyager\/
ADmantX
alexa
Alexandria(\s|\+)prototype(\s|\+)project
AllenTrack
almaden
appie
API[\+\s]scraper
Arachmo
architext
ArchiveTeam
arks
asterias
atomz
BDFetch
baidu
biglotron
BingPreview
binlar
Blackboard[\+\s]Safeassign
blaiz\-bee
bloglines
blogpulse
boitho\.com\-dc
bookmark\-manager
Brutus\/AET
BUbiNG
bwh3_user_agent
celestial
cfnetwork
checklink
checkprivacy
China\sLocal\sBrowse\s2\.6
cloakDetect
coccoc\/1\.0
collection@infegy.com
com\.plumanalytics
combine
contentmatch
ContentSmartz
convera
core
CoverScout
cursor
custo
DataCha0s\/2\.0
daumoa
DeuSu\/
Docoloc
docomo
DSurf
DTS Agent
easydl
EmailSiphon
EmailWolf
Embedly
EThOS\+\(British\+Library\)
facebookexternalhit\/
feedburner
FeedFetcher
feedreader
ferret
findlinks
Fulltext
Funnelback
G-i-g-a-b-o-t
Goldfire(\s|\+)Server
google
Grammarly
grub
gulliver
harvest
heritrix
holmes
htdig
htmlparser
HTTPFetcher
httrack
ia_archiver
ichiro
iktomi
ilse
^integrity\/\d
internetseer
intute
iSiloX
iskanie
jeeves
jobo
kyluka
larbin
lilina
link.?check
LinkLint-checkonly
^LinkParser\/
^LinkSaver\/
linkscan
LinkTiger
linkwalker
lipperhey
livejournal\.com
LOCKSS
ltx71
lycos[\_\+]
mail.ru
mediapartners\-google
megite
MetaURI[\+\s]API\/\d\.\d
mimas
mnogosearch
moget
motor
MuscatFerre
myweb
nagios
^NetAnts\/\d
netcraft
netluchs
Ning
nomad
nutch
^oaDOI$
ocelli
Offline(\s|\+)Navigator
onetszukaj
OurBrowser
panscient
parsijoo
EasyBib[\+\s]AutoCite[\+\s]
perman
pioneer
playmusic\.com
playstarmusic\.com
^Postgenomic(\s|\+)v2
powermarks
proximic
Qwantify
Readpaper
redalert
Riddler
robozilla
scan4mail
scientificcommons
scirus
scooter
Scrapy\/\d
^scrutiny\/\d
SearchBloxIntra
shoutcast
SkypeUriPreview
slurp
sogou
speedy
Strider
summify
sunrise
Sysomos
T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E
tailrank
Teleport(\s|\+)Pro
Teoma
titan
^Traackr\.com$
Trove
twiceler
ucsd
ultraseek
urlaliasbuilder
validator
virus.detector
voila
^voltron$
voyager\/
w3af.org
Wanadoo
Web(\s|\+)Downloader
WebCloner
webcollage
WebCopier
Webinator
weblayers
Webmetrics
webmirror
webmon
webreaper
WebStripper
WebZIP
worm
www.gnip.com
WWW\-Mechanize
xenu
y!j
yacy
yahoo
yandex
zeus
zyborg