Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts to Get, Correct, and Set Docket, Document, and Comment values #194

Merged
merged 14 commits into from
Oct 29, 2024
Merged
112 changes: 112 additions & 0 deletions docs/scripts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Script Documentation

## Summary

Some tasks are small enough that the project architecture should not change, but the large enough that they should not be performed by hand.
Files in the `scripts` directory exist to fill this space.

Currently, the following scripts are provided.

* `get_counts.py`
* get docket, document, and comment counts from regulations.gov, a mirrulations dashboard, or a mirrulations Redis instance as json
* when using regulations.gov a timestamp can be given to make all dockets, documents, and comments before the timestamp count as if they were downloaded
* `correct_counts.py`
* correct possible errors within a counts json file generated by `get_counts.py`
* `set_counts.py`
* set values in a mirrulations Redis instance using json generated by `get_counts.py`
* `get_correct_set.sh`
* run `get_counts.py`, `correct_counts.py`, and `set_counts.py`, logging relevant information

All of the scripts above share a common format
<details>
<summary><code>get_counts.py</code> common format</summary>

```json
{
"creation_timestamp": "2024-10-16 15:00:00",
"dockets": {
"downloaded": 253807,
"jobs": 0,
"total": 253807,
"last_timestamp": "2024-10-13 04:04:18"
},
"documents": {
"downloaded": 1843774,
"jobs": 0,
"total": 1843774,
"last_timestamp": "2024-10-13 04:04:18"
},
"comments": {
"downloaded": 22240501,
"jobs": 10,
"total": 22240511,
"last_timestamp": "2024-10-13 04:04:18"
}
}
```

</details>

## Description

### `get_correct_set.sh`

`get_correct_set.sh` gets counts using `get_counts.py` from Redis, corrects them using `correct_counts.py`, and on success sets them using `set_counts.py`.
It attempts to log to `/var/log/mirrulations_counts.log`.
By default, it expects a virtual environment with all required dependencies in `/home/cs334/mirrulations/scripts/.venv`.

### `get_counts.py`

`get_counts.py` gets counts from one of three sources: regulations.gov, a Mirrulations Redis instance, a Mirrulations dashboard via HTTP.

When reading from regulations.gov a UTC timestamp can be specified to mock having downloaded all dockets, documents, and comments from before that timestamp.

When reading from a dashboard a UTC timestamp must be specified since the dashboard API does not provide one.

### `correct_counts.py`

`correct_counts.py` corrects counts from `get_counts.py` using one of two strategies: set downloaded counts for a type to the minimum of `downloaded` and `total` for that type, or set downloaded counts to the minimum of `total -jobs` and `downloaded`.
By default any queued jobs will cause the script to exit and output nothing, this behavior can be changed with the `--ignore-queue` flag.

### `set_counts.py`

`set_counts.py` sets values from `get_counts.py` in a Redis instance.
By default the script will prompt for user input before changing any values.
This behavior can be changed using the `--yes` flag, which should be used **WITH GREAT CARE, ESPECIALLY IN PRODUCTION!!!**.

## Setup

First a virtual environment should be created to download dependencies to.

```bash
cd scripts
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

Make sure when you are in the correct environment when running scripts.

## Examples

### Cap Docket, Document, and Comment downloaded counts by the counts from Regulations.gov

```bash
./get_counts.py redis | ./correct_counts.py | ./set_counts.py -y
```

### Set Docket, Document, Comment downloaded counts while jobs are in the queue

```bash
./get_counts.py dashboard | ./correct_counts.py --ignore-queue --strategy diff_total_with_jobs | ./set_counts.py -y
```

### Download Counts for a Certain Time from Regulations.gov

```bash
./get_counts.py --api-key $API_KEY -o aug_6_2022.json -t 2024-08-06T06:20:50Z

EXPORT API_KEY=<REGULATIONS.GOV_API_KEY>
./get_counts.py regulations -o oct_01_2024.json --last-timestamp 2024-10-01T15:30:10Z
./set_counts.py -i oct_01_2024.json
```
114 changes: 114 additions & 0 deletions scripts/correct_counts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
#!/usr/bin/env python3

from copy import deepcopy
import json
import pathlib
import sys
from json import JSONDecodeError
from counts import Counts, CountsEncoder, CountsDecoder

import argparse


class JobsInQueueException(Exception):
pass


def strategy_cap(recieved: Counts, ignore_queue: bool) -> Counts:
filtered = deepcopy(recieved)
if filtered["queue_size"] != 0 and not ignore_queue:
raise JobsInQueueException(f'Found jobs in job queue: {filtered["queue_size"]}')
for entity_type in ("dockets", "documents", "comments"):
total_ = filtered[entity_type]["total"]
downloaded = filtered[entity_type]["downloaded"]
filtered[entity_type]["downloaded"] = min(total_, downloaded)

return filtered


def strategy_diff(recieved: Counts, ignore_queue: bool) -> Counts:
filtered = deepcopy(recieved)
for entity_type in ("dockets", "documents", "comments"):
total_ = filtered[entity_type]["total"]
downloaded = filtered[entity_type]["downloaded"]
jobs = filtered[entity_type]["jobs"]
if jobs > 0 and not ignore_queue:
raise JobsInQueueException(
f'{entity_type} has {filtered[entity_type]["jobs"]} in queue'
)
filtered[entity_type]["downloaded"] = min(total_ - jobs, downloaded)

return filtered


if __name__ == "__main__":
parser = argparse.ArgumentParser(
"Correct Counts",
description="Correct counts in json format by either capping downloaded with `total` or capping with `total - jobs`",
)
parser.add_argument(
"-o",
"--output",
metavar="OUTPUT_PATH",
type=str,
default="-",
help="file to output to, use '-' for stdout (default '%(default)s')",
)
parser.add_argument(
"-i",
"--input",
metavar="INPUT_PATH",
type=str,
default="-",
help="file to read from, use '-' for stdin (default '%(default)s')",
)
parser.add_argument(
"-s",
"--strategy",
type=str,
default="cap_with_total",
choices=("cap_with_total", "diff_total_with_jobs"),
help="the correction strategy to use (default '%(default)s')",
)
parser.add_argument(
"--ignore-queue",
action="store_true",
help="continue even if there are queued jobs",
)

args = parser.parse_args()

try:
if args.input == "-":
input_counts: Counts = json.load(sys.stdin, cls=CountsDecoder)
else:
try:
with open(pathlib.Path(args.input), "r") as fp:
input_counts = json.load(fp, cls=CountsDecoder)
except FileNotFoundError:
print(f"Missing file {args.input}, exitting", file=sys.stderr)
sys.exit(2)
except JSONDecodeError:
print(f"Malformed input file {args.input}, exitting", file=sys.stderr)
sys.exit(2)

try:
if args.strategy == "cap_with_total":
modified_counts = strategy_cap(input_counts, args.ignore_queue)
elif args.strategy == "diff_total_with_jobs":
modified_counts = strategy_diff(input_counts, args.ignore_queue)
else:
print(f"Unrecognized strategy {args.strategy}, exitting", file=sys.stderr)
sys.exit(1)
except JobsInQueueException as e:
print(
f"Found jobs in queue: {e}\nUse `--ignore-queue` to continue",
file=sys.stderr,
)
sys.exit(2)

if args.output == "-":
json.dump(modified_counts, sys.stdout, cls=CountsEncoder)
else:
with open(pathlib.Path(args.output), "w") as fp:
json.dump(modified_counts, fp, cls=CountsEncoder)
38 changes: 38 additions & 0 deletions scripts/counts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import json
import datetime as dt
from typing import Any, TypedDict


class EntityCount(TypedDict):
downloaded: int
jobs: int
total: int
last_timestamp: dt.datetime


class Counts(TypedDict):
creation_timestamp: dt.datetime
queue_size: int
dockets: EntityCount
documents: EntityCount
comments: EntityCount


class CountsEncoder(json.JSONEncoder):
def default(self, o: Any) -> Any:
if isinstance(o, dt.datetime):
return o.strftime("%Y-%m-%d %H:%M:%S")
return super().default(o)


class CountsDecoder(json.JSONDecoder):
def __init__(self, *args, **kwargs):
super().__init__(object_hook=self.object_hook, *args, **kwargs)

def object_hook(self, obj: Any) -> Any:
for key, value in obj.items():
try:
obj[key] = dt.datetime.strptime(value, "%Y-%m-%d %H:%M:%S")
except (ValueError, TypeError):
pass
return obj
15 changes: 15 additions & 0 deletions scripts/get_correct_set.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

WORK_DIR="/home/cs334/mirrulations/scripts/"
LOG_FILE=/var/log/mirrulations_counts.log
START_TIME=$(date -u -Iseconds)
echo "$START_TIME: RUnning" > $LOG_FILE
cd $WORK_DIR

PYTHON=".venv/bin/python3"

$PYTHON get_counts redis -o "/tmp/mirrulations_$START_TIME.json" 2>> $LOG_FILE &&
$PYTHON correct_counts -i "/tmp/mirrulations_$START_TIME.json" -o "/tmp/mirrulations_${START_TIME}_corrected.json" 2>> $LOG_FILE &&
$PYTHON set_counts -y -i "/tmp/mirrulations_${START_TIME}_corrected.json" 2>> $LOG_FILE

rm "/tmp/mirrulations_${START_TIME}_corrected.json" "/tmp/mirrulations_$START_TIME.json"
Loading
Loading