Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual Download Count Scripts #193

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ If you are interested in becoming a developer, see `docs/developers.md`.

To run Mirrulations, you need Python 3.9 or greater ([MacOSX](https://docs.python-guide.org/starting/install3/osx/) or [Windows](https://docs.python-guide.org/starting/install3/win/)) on your machine to run this, as well as [redis](https://redis.io/) if you are running a server

You will also need a valid API key from Regulations.gov to participate. To apply for a key, you must simply [contact the Regulations Help Desk]([email protected]) and provide your name, email address, organization, and intended use of the API. If you are not with any organizations, just say so in your message. They will email you with a key once they've verified you and activated the key.
You will also need a valid API key from Regulations.gov to participate. To apply for a key, you must simply complete the API key request form (https://open.gsa.gov/api/regulationsgov/) and provide your name, email address, organization, and intended use of the API. After review the key will be sent by email.

To download the actual project, you will need to go to our [GitHub page](https://github.com/MoravianUniversity/mirrulations) and [clone](https://help.github.com/articles/cloning-a-repository/) the project to your computer.

Expand Down
15 changes: 15 additions & 0 deletions docs/client.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,18 @@ The goal is
that the client will request and complete work in order to download data from
[regulations.gov](https://www.regulations.gov/).

## Attributes
api_key: Used to authenticate requests made to the API.
client_id: An ID included in the client.env file.
path_generator: An instance of PathGenerator that returns a path for saving job results.
saver: An instance of Saver that handles saving files either to disk or Amazon S3.
redis: A connection to the Redis server for managing job states.
job_queue: A queue from which the client pulls jobs.
cache: An instance of JobStatistics for caching job statistics.

## Workflow
Initialization: The Client is initialized with a Redis server and a job queue.
Fetching Jobs: The client attempts to fetch a job from the job queue using _get_job_from_job_queue.
Performing Jobs: Depending on the job type, the client performs the job by calling an API endpoint to request a JSON object.
Saving Results: The client saves the job results and any included attachments using the Saver class.
Updating Redis: The client updates the Redis server with job states.
64 changes: 53 additions & 11 deletions docs/database.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,53 @@

## Database Format

We use [Redis](https://redis.io/) to store jobs as well as key values that must
be remembered.
We use [Redis](https://redis.io/) to store jobs as well as key values

## Database Structure

The Redis database is structured with the following keys:

regulations_total_comments
num_dockets_done
num_documents_done
num_attachments_done
last_job_id
jobs_in_progress
num_pdf_attachments_done
num_jobs_documents_waiting
num_jobs_comments_waiting
dockets_last_timestamp
invalid_jobs
regulations_total_dockets
client_jobs
num_extractions_done
regulations_total_documents
mirrulations_bucket_size
num_comments_done
documents_last_timestamp
num_jobs_dockets_waiting
comments_last_timestamp


## Job Management

The REDIS database has three "queues", with the names:

`jobs_waiting_queue`, `jobs_in_progress`, and `jobs_done`.

`jobs_waiting_queue` is a list, while 'jobs_in_progress' and 'jobs_done' are hashes.
Each stores jobs for clients to process.
The keys serve the following functions:

jobs_waiting_queue: A list holding JSON strings representing each job.

jobs_in_progress: A hash storing jobs currently being processed.

Keys will be integers, the job ids of the jobs.
These keys will be mapped to integers, the values to be processed.
jobs_done: A hash storing completed jobs.

Additionally, the database has an integer value storing the number of clients:
`total_num_client_ids`.
The keys client_jobs and total_num_client_ids are used for sotring client information.

client_jobs: A hash mapping job IDs to client IDs.

total_num_client_ids: An integer value storing the number of clients.

## Redis Format
## `jobs_waiting_queue`
Expand Down Expand Up @@ -54,7 +84,19 @@ timestamp seen when querying regulations.gov.
The `last_job_id` variable is used by the work generator to ensure it generates
unique ids for each job.

## Client IDs

The 'last_client_id' variable is used by the work server to ensure that it
generates unique client ids.
## Job Statistics Keys

DOCKETS_DONE: Tracks the number of completed dockets.

DOCUMENTS_DONE: Tracks the number of completed documents.

COMMENTS_DONE: Tracks the number of completed comments.

ATTACHMENTS_DONE: Tracks the number of completed attachments.

PDF_ATTACHMENTS_DONE: Tracks the number of completed PDF attachments.

EXTRACTIONS_DONE: Tracks the number of completed extractions.

MIRRULATION_BUCKET_SIZE: Stores the size of the mirrulations bucket.
222 changes: 222 additions & 0 deletions scripts/get_counts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
#!/usr/bin/env python3

import argparse
import datetime as dt
import json
import os
import pathlib
import sys
from typing import Any, TypedDict

import redis
import requests

REGULATIONS_BASE_URL = "https://api.regulations.gov/v4/"


class Counts(TypedDict):
dockets: int
documents: int
comments: int


class EntityCount(TypedDict):
downloaded: int
total: int


class Output(TypedDict):
start_timestamp: dt.datetime
stop_timestamp: dt.datetime
dockets: EntityCount
documents: EntityCount
comments: EntityCount


class OutputEncoder(json.JSONEncoder):
def default(self, o: Any) -> Any:
if isinstance(o, dt.datetime):
return o.strftime("%Y-%m-%d %H:%M:%S")
return super().default(o)


def get_regulation_count(api_key: str, start: dt.datetime, end: dt.datetime) -> Counts:
headers = {"X-Api-Key": api_key}
counts: Counts = {"dockets": -1, "documents": -1, "comments": -1}

params = {
"filter[lastModifiedDate][ge]": start.strftime("%Y-%m-%d %H:%M:%S"),
"filter[lastModifiedDate][le]": end.strftime("%Y-%m-%d %H:%M:%S"),
}

for type_ in counts:
response = requests.get(
REGULATIONS_BASE_URL + type_, headers=headers, params=params
)
response.raise_for_status()
counts[type_] = response.json()["meta"]["totalElements"]

return counts


def get_true_prod_count(dashboard_url: str) -> Counts:
"""Dumbly get the counts of a running mirrulations instance"""
response = requests.get(dashboard_url + "/data")
response.raise_for_status()

stats = response.json()
counts = Counts(
dockets=stats["num_dockets_done"],
documents=stats["num_documents_done"],
comments=stats["num_comments_done"],
)

return counts


def clamp_counts(counts: Counts, max_counts: Counts) -> Counts:
clamped = Counts(dockets=0, documents=0, comments=0)
for key in counts:
if max_counts[key] < counts[key]:
clamped[key] = max(min(counts[key], max_counts[key]), 0)
else:
clamped[key] = counts[key]
return clamped


def get_accurate_prod_count(
db: redis.Redis, max_counts: Counts, ignore_queue: bool = False
) -> Counts:
"""Get the counts of a running mirrulations instance, ignoring duplicated downloads

Args:
db: a redis database connection
strict: true if the resulting counts are allowed to be larger
than the official Regulations.gov counts
ignore_queue: continue even if jobs are in the queue
"""
counts = Counts(
dockets=int(db.get("num_dockets_done")),
documents=int(db.get("num_documents_done")),
comments=int(db.get("num_comments_done")),
)
jobs_waiting = {
"dockets": int(db.get("num_jobs_dockets_waiting")),
"documents": int(db.get("num_jobs_documents_waiting")),
"comments": int(db.get("num_jobs_comments_waiting")),
}

if any(jobs_waiting.values()):
if not ignore_queue:
print("Jobs in queue, exitting", file=sys.stderr)
sys.exit(1)
for k in counts:
counts[k] = min(max_counts[k] - jobs_waiting[k], counts[k])

return clamp_counts(counts, max_counts)


def make_output(
start: dt.datetime, end: dt.datetime, downloaded: Counts, total: Counts
) -> Output:
output: Output = {
"start_timestamp": start,
"stop_timestamp": end,
"dockets": {
"downloaded": downloaded["dockets"],
"total": total["dockets"],
},
"documents": {
"downloaded": downloaded["documents"],
"total": total["documents"],
},
"comments": {
"downloaded": downloaded["comments"],
"total": total["comments"],
},
}

return output


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Get Correct Mirrulation Counts as a json document",
epilog="Can be used in conjunction with update_counts.py",
)
parser.add_argument(
"-f",
"--from",
metavar="START_DATETIME",
dest="start",
type=dt.datetime.fromisoformat,
default=dt.datetime(1776, 7, 4).isoformat(timespec="seconds"),
help="start time (inclusive) for counts in ISO 8601 format 'YYYY-MM-DDTHH:mm:ss' (default '%(default)s')",
)
parser.add_argument(
"-t",
"--to",
metavar="END_DATETIME",
dest="end",
type=dt.datetime.fromisoformat,
default=dt.datetime.now().isoformat(timespec="seconds"),
help="end time (exclusive) for counts in ISO 8601 format 'YYYY-MM-DDTHH:mm:ss' (default '%(default)s')",
)
parser.add_argument(
"-o",
"--output",
metavar="PATH",
help="file to output to, use '-' for stdout (default '%(default)s')",
type=str,
default="-",
)
parser.add_argument(
"-c",
"--correct",
help="Get corrected counts download counts",
action="store_true",
)
parser.add_argument(
"--ignore-queue", help="Continue if jobs are in the job queue", action="store_true"
)
parser.add_argument(
"-a",
"--api-key",
help="Regulations.gov api key, defaults to value of `API_KEY` environment variable",
default=os.getenv("API_KEY"),
type=str,
)
parser.add_argument(
"--dashboard",
metavar="URL",
help="URL of dashboard to use, mutually exclusive with '-c', (default '%(default)s')",
default="http://localhost",
)

args = parser.parse_args()

start: dt.datetime = args.start
end: dt.datetime = args.end
out_path: str = args.output
correct: bool = args.correct
dashboard_url: str = args.dashboard
ignore_queue: bool = args.ignore_queue

api_key = args.api_key
if api_key is None or api_key == "":
print("No api key found, exitting", file=sys.stderr)
sys.exit(1)

regulations = get_regulation_count(api_key, start, end)
if correct:
mirrulations = get_accurate_prod_count(redis.Redis(), regulations, ignore_queue)
else:
mirrulations = get_true_prod_count(dashboard_url)

output = make_output(start, end, mirrulations, regulations)

if out_path == "-":
json.dump(output, sys.stdout, cls=OutputEncoder)
else:
with open(pathlib.Path(out_path), "w") as fp:
json.dump(output, fp, cls=OutputEncoder)
3 changes: 3 additions & 0 deletions scripts/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
requests
docker
redis
Loading