Skip to content

Commit

Permalink
Merge pull request #194 from jpappel/counts_scripts
Browse files Browse the repository at this point in the history
Scripts to Get, Correct, and Set Docket, Document, and Comment values
  • Loading branch information
OnToNothing authored Oct 29, 2024
2 parents 043cacb + a4d141e commit 06a22fe
Show file tree
Hide file tree
Showing 8 changed files with 741 additions and 0 deletions.
112 changes: 112 additions & 0 deletions docs/scripts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Script Documentation

## Summary

Some tasks are small enough that the project architecture should not change, but the large enough that they should not be performed by hand.
Files in the `scripts` directory exist to fill this space.

Currently, the following scripts are provided.

* `get_counts.py`
* get docket, document, and comment counts from regulations.gov, a mirrulations dashboard, or a mirrulations Redis instance as json
* when using regulations.gov a timestamp can be given to make all dockets, documents, and comments before the timestamp count as if they were downloaded
* `correct_counts.py`
* correct possible errors within a counts json file generated by `get_counts.py`
* `set_counts.py`
* set values in a mirrulations Redis instance using json generated by `get_counts.py`
* `get_correct_set.sh`
* run `get_counts.py`, `correct_counts.py`, and `set_counts.py`, logging relevant information

All of the scripts above share a common format
<details>
<summary><code>get_counts.py</code> common format</summary>

```json
{
"creation_timestamp": "2024-10-16 15:00:00",
"dockets": {
"downloaded": 253807,
"jobs": 0,
"total": 253807,
"last_timestamp": "2024-10-13 04:04:18"
},
"documents": {
"downloaded": 1843774,
"jobs": 0,
"total": 1843774,
"last_timestamp": "2024-10-13 04:04:18"
},
"comments": {
"downloaded": 22240501,
"jobs": 10,
"total": 22240511,
"last_timestamp": "2024-10-13 04:04:18"
}
}
```

</details>

## Description

### `get_correct_set.sh`

`get_correct_set.sh` gets counts using `get_counts.py` from Redis, corrects them using `correct_counts.py`, and on success sets them using `set_counts.py`.
It attempts to log to `/var/log/mirrulations_counts.log`.
By default, it expects a virtual environment with all required dependencies in `/home/cs334/mirrulations/scripts/.venv`.

### `get_counts.py`

`get_counts.py` gets counts from one of three sources: regulations.gov, a Mirrulations Redis instance, a Mirrulations dashboard via HTTP.

When reading from regulations.gov a UTC timestamp can be specified to mock having downloaded all dockets, documents, and comments from before that timestamp.

When reading from a dashboard a UTC timestamp must be specified since the dashboard API does not provide one.

### `correct_counts.py`

`correct_counts.py` corrects counts from `get_counts.py` using one of two strategies: set downloaded counts for a type to the minimum of `downloaded` and `total` for that type, or set downloaded counts to the minimum of `total -jobs` and `downloaded`.
By default any queued jobs will cause the script to exit and output nothing, this behavior can be changed with the `--ignore-queue` flag.

### `set_counts.py`

`set_counts.py` sets values from `get_counts.py` in a Redis instance.
By default the script will prompt for user input before changing any values.
This behavior can be changed using the `--yes` flag, which should be used **WITH GREAT CARE, ESPECIALLY IN PRODUCTION!!!**.

## Setup

First a virtual environment should be created to download dependencies to.

```bash
cd scripts
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

Make sure when you are in the correct environment when running scripts.

## Examples

### Cap Docket, Document, and Comment downloaded counts by the counts from Regulations.gov

```bash
./get_counts.py redis | ./correct_counts.py | ./set_counts.py -y
```

### Set Docket, Document, Comment downloaded counts while jobs are in the queue

```bash
./get_counts.py dashboard | ./correct_counts.py --ignore-queue --strategy diff_total_with_jobs | ./set_counts.py -y
```

### Download Counts for a Certain Time from Regulations.gov

```bash
./get_counts.py --api-key $API_KEY -o aug_6_2022.json -t 2024-08-06T06:20:50Z

EXPORT API_KEY=<REGULATIONS.GOV_API_KEY>
./get_counts.py regulations -o oct_01_2024.json --last-timestamp 2024-10-01T15:30:10Z
./set_counts.py -i oct_01_2024.json
```
114 changes: 114 additions & 0 deletions scripts/correct_counts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
#!/usr/bin/env python3

from copy import deepcopy
import json
import pathlib
import sys
from json import JSONDecodeError
from counts import Counts, CountsEncoder, CountsDecoder

import argparse


class JobsInQueueException(Exception):
pass


def strategy_cap(recieved: Counts, ignore_queue: bool) -> Counts:
filtered = deepcopy(recieved)
if filtered["queue_size"] != 0 and not ignore_queue:
raise JobsInQueueException(f'Found jobs in job queue: {filtered["queue_size"]}')
for entity_type in ("dockets", "documents", "comments"):
total_ = filtered[entity_type]["total"]
downloaded = filtered[entity_type]["downloaded"]
filtered[entity_type]["downloaded"] = min(total_, downloaded)

return filtered


def strategy_diff(recieved: Counts, ignore_queue: bool) -> Counts:
filtered = deepcopy(recieved)
for entity_type in ("dockets", "documents", "comments"):
total_ = filtered[entity_type]["total"]
downloaded = filtered[entity_type]["downloaded"]
jobs = filtered[entity_type]["jobs"]
if jobs > 0 and not ignore_queue:
raise JobsInQueueException(
f'{entity_type} has {filtered[entity_type]["jobs"]} in queue'
)
filtered[entity_type]["downloaded"] = min(total_ - jobs, downloaded)

return filtered


if __name__ == "__main__":
parser = argparse.ArgumentParser(
"Correct Counts",
description="Correct counts in json format by either capping downloaded with `total` or capping with `total - jobs`",
)
parser.add_argument(
"-o",
"--output",
metavar="OUTPUT_PATH",
type=str,
default="-",
help="file to output to, use '-' for stdout (default '%(default)s')",
)
parser.add_argument(
"-i",
"--input",
metavar="INPUT_PATH",
type=str,
default="-",
help="file to read from, use '-' for stdin (default '%(default)s')",
)
parser.add_argument(
"-s",
"--strategy",
type=str,
default="cap_with_total",
choices=("cap_with_total", "diff_total_with_jobs"),
help="the correction strategy to use (default '%(default)s')",
)
parser.add_argument(
"--ignore-queue",
action="store_true",
help="continue even if there are queued jobs",
)

args = parser.parse_args()

try:
if args.input == "-":
input_counts: Counts = json.load(sys.stdin, cls=CountsDecoder)
else:
try:
with open(pathlib.Path(args.input), "r") as fp:
input_counts = json.load(fp, cls=CountsDecoder)
except FileNotFoundError:
print(f"Missing file {args.input}, exitting", file=sys.stderr)
sys.exit(2)
except JSONDecodeError:
print(f"Malformed input file {args.input}, exitting", file=sys.stderr)
sys.exit(2)

try:
if args.strategy == "cap_with_total":
modified_counts = strategy_cap(input_counts, args.ignore_queue)
elif args.strategy == "diff_total_with_jobs":
modified_counts = strategy_diff(input_counts, args.ignore_queue)
else:
print(f"Unrecognized strategy {args.strategy}, exitting", file=sys.stderr)
sys.exit(1)
except JobsInQueueException as e:
print(
f"Found jobs in queue: {e}\nUse `--ignore-queue` to continue",
file=sys.stderr,
)
sys.exit(2)

if args.output == "-":
json.dump(modified_counts, sys.stdout, cls=CountsEncoder)
else:
with open(pathlib.Path(args.output), "w") as fp:
json.dump(modified_counts, fp, cls=CountsEncoder)
38 changes: 38 additions & 0 deletions scripts/counts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import json
import datetime as dt
from typing import Any, TypedDict


class EntityCount(TypedDict):
downloaded: int
jobs: int
total: int
last_timestamp: dt.datetime


class Counts(TypedDict):
creation_timestamp: dt.datetime
queue_size: int
dockets: EntityCount
documents: EntityCount
comments: EntityCount


class CountsEncoder(json.JSONEncoder):
def default(self, o: Any) -> Any:
if isinstance(o, dt.datetime):
return o.strftime("%Y-%m-%d %H:%M:%S")
return super().default(o)


class CountsDecoder(json.JSONDecoder):
def __init__(self, *args, **kwargs):
super().__init__(object_hook=self.object_hook, *args, **kwargs)

def object_hook(self, obj: Any) -> Any:
for key, value in obj.items():
try:
obj[key] = dt.datetime.strptime(value, "%Y-%m-%d %H:%M:%S")
except (ValueError, TypeError):
pass
return obj
15 changes: 15 additions & 0 deletions scripts/get_correct_set.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

WORK_DIR="/home/cs334/mirrulations/scripts/"
LOG_FILE=/var/log/mirrulations_counts.log
START_TIME=$(date -u -Iseconds)
echo "$START_TIME: RUnning" > $LOG_FILE
cd $WORK_DIR

PYTHON=".venv/bin/python3"

$PYTHON get_counts redis -o "/tmp/mirrulations_$START_TIME.json" 2>> $LOG_FILE &&
$PYTHON correct_counts -i "/tmp/mirrulations_$START_TIME.json" -o "/tmp/mirrulations_${START_TIME}_corrected.json" 2>> $LOG_FILE &&
$PYTHON set_counts -y -i "/tmp/mirrulations_${START_TIME}_corrected.json" 2>> $LOG_FILE

rm "/tmp/mirrulations_${START_TIME}_corrected.json" "/tmp/mirrulations_$START_TIME.json"
Loading

0 comments on commit 06a22fe

Please sign in to comment.