Merge pull request #194 from jpappel/counts_scripts

Scripts to Get, Correct, and Set Docket, Document, and Comment values
MoravianUniversity · Oct 29, 2024 · 06a22fe · 06a22fe
2 parents 043cacb + a4d141e
commit 06a22fe
Show file tree

Hide file tree

Showing 8 changed files with 741 additions and 0 deletions.
diff --git a/docs/scripts.md b/docs/scripts.md
@@ -0,0 +1,112 @@
+# Script Documentation
+
+## Summary
+
+Some tasks are small enough that the project architecture should not change, but the large enough that they should not be performed by hand.
+Files in the `scripts` directory exist to fill this space.
+
+Currently, the following scripts are provided.
+
+* `get_counts.py`
+    * get docket, document, and comment counts from regulations.gov, a mirrulations dashboard, or a mirrulations Redis instance as json
+    * when using regulations.gov a timestamp can be given to make all dockets, documents, and comments before the timestamp count as if they were downloaded
+* `correct_counts.py`
+    * correct possible errors within a counts json file generated by `get_counts.py`
+* `set_counts.py`
+    * set values in a mirrulations Redis instance using json generated by `get_counts.py`
+* `get_correct_set.sh`
+    * run `get_counts.py`, `correct_counts.py`, and `set_counts.py`, logging relevant information
+
+All of the scripts above share a common format
+<details>
+<summary><code>get_counts.py</code> common format</summary>
+
+```json
+{
+  "creation_timestamp": "2024-10-16 15:00:00",
+  "dockets": {
+    "downloaded": 253807,
+    "jobs": 0,
+    "total": 253807,
+    "last_timestamp": "2024-10-13 04:04:18"
+  },
+  "documents": {
+    "downloaded": 1843774,
+    "jobs": 0,
+    "total": 1843774,
+    "last_timestamp": "2024-10-13 04:04:18"
+  },
+  "comments": {
+    "downloaded": 22240501,
+    "jobs": 10,
+    "total": 22240511,
+    "last_timestamp": "2024-10-13 04:04:18"
+  }
+}
+```
+
+</details>
+
+## Description
+
+### `get_correct_set.sh`
+
+`get_correct_set.sh` gets counts using `get_counts.py` from Redis, corrects them using `correct_counts.py`, and on success sets them using `set_counts.py`.
+It attempts to log to `/var/log/mirrulations_counts.log`.
+By default, it expects a virtual environment with all required dependencies in `/home/cs334/mirrulations/scripts/.venv`.
+
+### `get_counts.py`
+
+`get_counts.py` gets counts from one of three sources: regulations.gov, a Mirrulations Redis instance, a Mirrulations dashboard via HTTP.
+
+When reading from regulations.gov a UTC timestamp can be specified to mock having downloaded all dockets, documents, and comments from before that timestamp.
+
+When reading from a dashboard a UTC timestamp must be specified since the dashboard API does not provide one.
+
+### `correct_counts.py`
+
+`correct_counts.py` corrects counts from `get_counts.py` using one of two strategies: set downloaded counts for a type to the minimum of `downloaded` and `total` for that type, or set downloaded counts to the minimum of `total -jobs` and `downloaded`.
+By default any queued jobs will cause the script to exit and output nothing, this behavior can be changed with the `--ignore-queue` flag.
+
+### `set_counts.py`
+
+`set_counts.py` sets values from `get_counts.py` in a Redis instance.
+By default the script will prompt for user input before changing any values.
+This behavior can be changed using the `--yes` flag, which should be used **WITH GREAT CARE, ESPECIALLY IN PRODUCTION!!!**.
+
+## Setup
+
+First a virtual environment should be created to download dependencies to.
+
+```bash
+cd scripts
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+
+Make sure when you are in the correct environment when running scripts.
+
+## Examples
+
+### Cap Docket, Document, and Comment downloaded counts by the counts from Regulations.gov
+
+```bash
+./get_counts.py redis | ./correct_counts.py | ./set_counts.py -y
+```
+
+### Set Docket, Document, Comment downloaded counts while jobs are in the queue
+
+```bash
+./get_counts.py dashboard | ./correct_counts.py --ignore-queue --strategy diff_total_with_jobs | ./set_counts.py -y
+```
+
+### Download Counts for a Certain Time from Regulations.gov
+
+```bash
+./get_counts.py --api-key $API_KEY -o aug_6_2022.json -t 2024-08-06T06:20:50Z
+
+EXPORT API_KEY=<REGULATIONS.GOV_API_KEY>
+./get_counts.py regulations -o oct_01_2024.json --last-timestamp 2024-10-01T15:30:10Z
+./set_counts.py -i oct_01_2024.json
+```
diff --git a/scripts/correct_counts.py b/scripts/correct_counts.py
@@ -0,0 +1,114 @@
+#!/usr/bin/env python3
+
+from copy import deepcopy
+import json
+import pathlib
+import sys
+from json import JSONDecodeError
+from counts import Counts, CountsEncoder, CountsDecoder
+
+import argparse
+
+
+class JobsInQueueException(Exception):
+    pass
+
+
+def strategy_cap(recieved: Counts, ignore_queue: bool) -> Counts:
+    filtered = deepcopy(recieved)
+    if filtered["queue_size"] != 0 and not ignore_queue:
+        raise JobsInQueueException(f'Found jobs in job queue: {filtered["queue_size"]}')
+    for entity_type in ("dockets", "documents", "comments"):
+        total_ = filtered[entity_type]["total"]
+        downloaded = filtered[entity_type]["downloaded"]
+        filtered[entity_type]["downloaded"] = min(total_, downloaded)
+
+    return filtered
+
+
+def strategy_diff(recieved: Counts, ignore_queue: bool) -> Counts:
+    filtered = deepcopy(recieved)
+    for entity_type in ("dockets", "documents", "comments"):
+        total_ = filtered[entity_type]["total"]
+        downloaded = filtered[entity_type]["downloaded"]
+        jobs = filtered[entity_type]["jobs"]
+        if jobs > 0 and not ignore_queue:
+            raise JobsInQueueException(
+                f'{entity_type} has {filtered[entity_type]["jobs"]} in queue'
+            )
+        filtered[entity_type]["downloaded"] = min(total_ - jobs, downloaded)
+
+    return filtered
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        "Correct Counts",
+        description="Correct counts in json format by either capping downloaded with `total` or capping with `total - jobs`",
+    )
+    parser.add_argument(
+        "-o",
+        "--output",
+        metavar="OUTPUT_PATH",
+        type=str,
+        default="-",
+        help="file to output to, use '-' for stdout (default '%(default)s')",
+    )
+    parser.add_argument(
+        "-i",
+        "--input",
+        metavar="INPUT_PATH",
+        type=str,
+        default="-",
+        help="file to read from, use '-' for stdin (default '%(default)s')",
+    )
+    parser.add_argument(
+        "-s",
+        "--strategy",
+        type=str,
+        default="cap_with_total",
+        choices=("cap_with_total", "diff_total_with_jobs"),
+        help="the correction strategy to use (default '%(default)s')",
+    )
+    parser.add_argument(
+        "--ignore-queue",
+        action="store_true",
+        help="continue even if there are queued jobs",
+    )
+
+    args = parser.parse_args()
+
+    try:
+        if args.input == "-":
+            input_counts: Counts = json.load(sys.stdin, cls=CountsDecoder)
+        else:
+            try:
+                with open(pathlib.Path(args.input), "r") as fp:
+                    input_counts = json.load(fp, cls=CountsDecoder)
+            except FileNotFoundError:
+                print(f"Missing file {args.input}, exitting", file=sys.stderr)
+                sys.exit(2)
+    except JSONDecodeError:
+        print(f"Malformed input file {args.input}, exitting", file=sys.stderr)
+        sys.exit(2)
+
+    try:
+        if args.strategy == "cap_with_total":
+            modified_counts = strategy_cap(input_counts, args.ignore_queue)
+        elif args.strategy == "diff_total_with_jobs":
+            modified_counts = strategy_diff(input_counts, args.ignore_queue)
+        else:
+            print(f"Unrecognized strategy {args.strategy}, exitting", file=sys.stderr)
+            sys.exit(1)
+    except JobsInQueueException as e:
+        print(
+            f"Found jobs in queue: {e}\nUse `--ignore-queue` to continue",
+            file=sys.stderr,
+        )
+        sys.exit(2)
+
+    if args.output == "-":
+        json.dump(modified_counts, sys.stdout, cls=CountsEncoder)
+    else:
+        with open(pathlib.Path(args.output), "w") as fp:
+            json.dump(modified_counts, fp, cls=CountsEncoder)
diff --git a/scripts/counts.py b/scripts/counts.py
@@ -0,0 +1,38 @@
+import json
+import datetime as dt
+from typing import Any, TypedDict
+
+
+class EntityCount(TypedDict):
+    downloaded: int
+    jobs: int
+    total: int
+    last_timestamp: dt.datetime
+
+
+class Counts(TypedDict):
+    creation_timestamp: dt.datetime
+    queue_size: int
+    dockets: EntityCount
+    documents: EntityCount
+    comments: EntityCount
+
+
+class CountsEncoder(json.JSONEncoder):
+    def default(self, o: Any) -> Any:
+        if isinstance(o, dt.datetime):
+            return o.strftime("%Y-%m-%d %H:%M:%S")
+        return super().default(o)
+
+
+class CountsDecoder(json.JSONDecoder):
+    def __init__(self, *args, **kwargs):
+        super().__init__(object_hook=self.object_hook, *args, **kwargs)
+
+    def object_hook(self, obj: Any) -> Any:
+        for key, value in obj.items():
+            try:
+                obj[key] = dt.datetime.strptime(value, "%Y-%m-%d %H:%M:%S")
+            except (ValueError, TypeError):
+                pass
+        return obj
diff --git a/scripts/get_correct_set.sh b/scripts/get_correct_set.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+WORK_DIR="/home/cs334/mirrulations/scripts/"
+LOG_FILE=/var/log/mirrulations_counts.log
+START_TIME=$(date -u -Iseconds)
+echo "$START_TIME: RUnning" > $LOG_FILE
+cd $WORK_DIR
+
+PYTHON=".venv/bin/python3"
+
+$PYTHON get_counts redis -o "/tmp/mirrulations_$START_TIME.json" 2>> $LOG_FILE &&
+    $PYTHON correct_counts -i "/tmp/mirrulations_$START_TIME.json" -o "/tmp/mirrulations_${START_TIME}_corrected.json" 2>> $LOG_FILE &&
+    $PYTHON set_counts -y -i "/tmp/mirrulations_${START_TIME}_corrected.json" 2>> $LOG_FILE
+
+rm "/tmp/mirrulations_${START_TIME}_corrected.json" "/tmp/mirrulations_$START_TIME.json"