-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #194 from jpappel/counts_scripts
Scripts to Get, Correct, and Set Docket, Document, and Comment values
- Loading branch information
Showing
8 changed files
with
741 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
# Script Documentation | ||
|
||
## Summary | ||
|
||
Some tasks are small enough that the project architecture should not change, but the large enough that they should not be performed by hand. | ||
Files in the `scripts` directory exist to fill this space. | ||
|
||
Currently, the following scripts are provided. | ||
|
||
* `get_counts.py` | ||
* get docket, document, and comment counts from regulations.gov, a mirrulations dashboard, or a mirrulations Redis instance as json | ||
* when using regulations.gov a timestamp can be given to make all dockets, documents, and comments before the timestamp count as if they were downloaded | ||
* `correct_counts.py` | ||
* correct possible errors within a counts json file generated by `get_counts.py` | ||
* `set_counts.py` | ||
* set values in a mirrulations Redis instance using json generated by `get_counts.py` | ||
* `get_correct_set.sh` | ||
* run `get_counts.py`, `correct_counts.py`, and `set_counts.py`, logging relevant information | ||
|
||
All of the scripts above share a common format | ||
<details> | ||
<summary><code>get_counts.py</code> common format</summary> | ||
|
||
```json | ||
{ | ||
"creation_timestamp": "2024-10-16 15:00:00", | ||
"dockets": { | ||
"downloaded": 253807, | ||
"jobs": 0, | ||
"total": 253807, | ||
"last_timestamp": "2024-10-13 04:04:18" | ||
}, | ||
"documents": { | ||
"downloaded": 1843774, | ||
"jobs": 0, | ||
"total": 1843774, | ||
"last_timestamp": "2024-10-13 04:04:18" | ||
}, | ||
"comments": { | ||
"downloaded": 22240501, | ||
"jobs": 10, | ||
"total": 22240511, | ||
"last_timestamp": "2024-10-13 04:04:18" | ||
} | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
## Description | ||
|
||
### `get_correct_set.sh` | ||
|
||
`get_correct_set.sh` gets counts using `get_counts.py` from Redis, corrects them using `correct_counts.py`, and on success sets them using `set_counts.py`. | ||
It attempts to log to `/var/log/mirrulations_counts.log`. | ||
By default, it expects a virtual environment with all required dependencies in `/home/cs334/mirrulations/scripts/.venv`. | ||
|
||
### `get_counts.py` | ||
|
||
`get_counts.py` gets counts from one of three sources: regulations.gov, a Mirrulations Redis instance, a Mirrulations dashboard via HTTP. | ||
|
||
When reading from regulations.gov a UTC timestamp can be specified to mock having downloaded all dockets, documents, and comments from before that timestamp. | ||
|
||
When reading from a dashboard a UTC timestamp must be specified since the dashboard API does not provide one. | ||
|
||
### `correct_counts.py` | ||
|
||
`correct_counts.py` corrects counts from `get_counts.py` using one of two strategies: set downloaded counts for a type to the minimum of `downloaded` and `total` for that type, or set downloaded counts to the minimum of `total -jobs` and `downloaded`. | ||
By default any queued jobs will cause the script to exit and output nothing, this behavior can be changed with the `--ignore-queue` flag. | ||
|
||
### `set_counts.py` | ||
|
||
`set_counts.py` sets values from `get_counts.py` in a Redis instance. | ||
By default the script will prompt for user input before changing any values. | ||
This behavior can be changed using the `--yes` flag, which should be used **WITH GREAT CARE, ESPECIALLY IN PRODUCTION!!!**. | ||
|
||
## Setup | ||
|
||
First a virtual environment should be created to download dependencies to. | ||
|
||
```bash | ||
cd scripts | ||
python3 -m venv .venv | ||
source .venv/bin/activate | ||
pip install -r requirements.txt | ||
``` | ||
|
||
Make sure when you are in the correct environment when running scripts. | ||
|
||
## Examples | ||
|
||
### Cap Docket, Document, and Comment downloaded counts by the counts from Regulations.gov | ||
|
||
```bash | ||
./get_counts.py redis | ./correct_counts.py | ./set_counts.py -y | ||
``` | ||
|
||
### Set Docket, Document, Comment downloaded counts while jobs are in the queue | ||
|
||
```bash | ||
./get_counts.py dashboard | ./correct_counts.py --ignore-queue --strategy diff_total_with_jobs | ./set_counts.py -y | ||
``` | ||
|
||
### Download Counts for a Certain Time from Regulations.gov | ||
|
||
```bash | ||
./get_counts.py --api-key $API_KEY -o aug_6_2022.json -t 2024-08-06T06:20:50Z | ||
|
||
EXPORT API_KEY=<REGULATIONS.GOV_API_KEY> | ||
./get_counts.py regulations -o oct_01_2024.json --last-timestamp 2024-10-01T15:30:10Z | ||
./set_counts.py -i oct_01_2024.json | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
#!/usr/bin/env python3 | ||
|
||
from copy import deepcopy | ||
import json | ||
import pathlib | ||
import sys | ||
from json import JSONDecodeError | ||
from counts import Counts, CountsEncoder, CountsDecoder | ||
|
||
import argparse | ||
|
||
|
||
class JobsInQueueException(Exception): | ||
pass | ||
|
||
|
||
def strategy_cap(recieved: Counts, ignore_queue: bool) -> Counts: | ||
filtered = deepcopy(recieved) | ||
if filtered["queue_size"] != 0 and not ignore_queue: | ||
raise JobsInQueueException(f'Found jobs in job queue: {filtered["queue_size"]}') | ||
for entity_type in ("dockets", "documents", "comments"): | ||
total_ = filtered[entity_type]["total"] | ||
downloaded = filtered[entity_type]["downloaded"] | ||
filtered[entity_type]["downloaded"] = min(total_, downloaded) | ||
|
||
return filtered | ||
|
||
|
||
def strategy_diff(recieved: Counts, ignore_queue: bool) -> Counts: | ||
filtered = deepcopy(recieved) | ||
for entity_type in ("dockets", "documents", "comments"): | ||
total_ = filtered[entity_type]["total"] | ||
downloaded = filtered[entity_type]["downloaded"] | ||
jobs = filtered[entity_type]["jobs"] | ||
if jobs > 0 and not ignore_queue: | ||
raise JobsInQueueException( | ||
f'{entity_type} has {filtered[entity_type]["jobs"]} in queue' | ||
) | ||
filtered[entity_type]["downloaded"] = min(total_ - jobs, downloaded) | ||
|
||
return filtered | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser( | ||
"Correct Counts", | ||
description="Correct counts in json format by either capping downloaded with `total` or capping with `total - jobs`", | ||
) | ||
parser.add_argument( | ||
"-o", | ||
"--output", | ||
metavar="OUTPUT_PATH", | ||
type=str, | ||
default="-", | ||
help="file to output to, use '-' for stdout (default '%(default)s')", | ||
) | ||
parser.add_argument( | ||
"-i", | ||
"--input", | ||
metavar="INPUT_PATH", | ||
type=str, | ||
default="-", | ||
help="file to read from, use '-' for stdin (default '%(default)s')", | ||
) | ||
parser.add_argument( | ||
"-s", | ||
"--strategy", | ||
type=str, | ||
default="cap_with_total", | ||
choices=("cap_with_total", "diff_total_with_jobs"), | ||
help="the correction strategy to use (default '%(default)s')", | ||
) | ||
parser.add_argument( | ||
"--ignore-queue", | ||
action="store_true", | ||
help="continue even if there are queued jobs", | ||
) | ||
|
||
args = parser.parse_args() | ||
|
||
try: | ||
if args.input == "-": | ||
input_counts: Counts = json.load(sys.stdin, cls=CountsDecoder) | ||
else: | ||
try: | ||
with open(pathlib.Path(args.input), "r") as fp: | ||
input_counts = json.load(fp, cls=CountsDecoder) | ||
except FileNotFoundError: | ||
print(f"Missing file {args.input}, exitting", file=sys.stderr) | ||
sys.exit(2) | ||
except JSONDecodeError: | ||
print(f"Malformed input file {args.input}, exitting", file=sys.stderr) | ||
sys.exit(2) | ||
|
||
try: | ||
if args.strategy == "cap_with_total": | ||
modified_counts = strategy_cap(input_counts, args.ignore_queue) | ||
elif args.strategy == "diff_total_with_jobs": | ||
modified_counts = strategy_diff(input_counts, args.ignore_queue) | ||
else: | ||
print(f"Unrecognized strategy {args.strategy}, exitting", file=sys.stderr) | ||
sys.exit(1) | ||
except JobsInQueueException as e: | ||
print( | ||
f"Found jobs in queue: {e}\nUse `--ignore-queue` to continue", | ||
file=sys.stderr, | ||
) | ||
sys.exit(2) | ||
|
||
if args.output == "-": | ||
json.dump(modified_counts, sys.stdout, cls=CountsEncoder) | ||
else: | ||
with open(pathlib.Path(args.output), "w") as fp: | ||
json.dump(modified_counts, fp, cls=CountsEncoder) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
import json | ||
import datetime as dt | ||
from typing import Any, TypedDict | ||
|
||
|
||
class EntityCount(TypedDict): | ||
downloaded: int | ||
jobs: int | ||
total: int | ||
last_timestamp: dt.datetime | ||
|
||
|
||
class Counts(TypedDict): | ||
creation_timestamp: dt.datetime | ||
queue_size: int | ||
dockets: EntityCount | ||
documents: EntityCount | ||
comments: EntityCount | ||
|
||
|
||
class CountsEncoder(json.JSONEncoder): | ||
def default(self, o: Any) -> Any: | ||
if isinstance(o, dt.datetime): | ||
return o.strftime("%Y-%m-%d %H:%M:%S") | ||
return super().default(o) | ||
|
||
|
||
class CountsDecoder(json.JSONDecoder): | ||
def __init__(self, *args, **kwargs): | ||
super().__init__(object_hook=self.object_hook, *args, **kwargs) | ||
|
||
def object_hook(self, obj: Any) -> Any: | ||
for key, value in obj.items(): | ||
try: | ||
obj[key] = dt.datetime.strptime(value, "%Y-%m-%d %H:%M:%S") | ||
except (ValueError, TypeError): | ||
pass | ||
return obj |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
#!/bin/bash | ||
|
||
WORK_DIR="/home/cs334/mirrulations/scripts/" | ||
LOG_FILE=/var/log/mirrulations_counts.log | ||
START_TIME=$(date -u -Iseconds) | ||
echo "$START_TIME: RUnning" > $LOG_FILE | ||
cd $WORK_DIR | ||
|
||
PYTHON=".venv/bin/python3" | ||
|
||
$PYTHON get_counts redis -o "/tmp/mirrulations_$START_TIME.json" 2>> $LOG_FILE && | ||
$PYTHON correct_counts -i "/tmp/mirrulations_$START_TIME.json" -o "/tmp/mirrulations_${START_TIME}_corrected.json" 2>> $LOG_FILE && | ||
$PYTHON set_counts -y -i "/tmp/mirrulations_${START_TIME}_corrected.json" 2>> $LOG_FILE | ||
|
||
rm "/tmp/mirrulations_${START_TIME}_corrected.json" "/tmp/mirrulations_$START_TIME.json" |
Oops, something went wrong.