From 00fdaf2b24957b3c530cf3712d2ca3188cdd1079 Mon Sep 17 00:00:00 2001 From: cavis Date: Tue, 7 May 2024 15:30:59 -0600 Subject: [PATCH] The docs --- README.md | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 53 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index e801d1d..3fd18b6 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,53 @@ -# dovetail-cdn-usage -Lambda to query Dovetail CloudFront usage and insert into BigQuery +# Dovetail CDN Usage + +AWS Lambda to query Dovetail CloudFront usage and insert into BigQuery + +## Overview + +1. Requests to the [Dovetail CDN](https://github.com/PRX/Infrastructure/tree/main/cdn/dovetail-cdn) are logged to an S3 bucket. +2. This lambda BigQuery-queries for the `MAX(day) FROM dt_bytes`, and processes days >= the result (or all the way back to the S3 expiration date). +3. Then we Athena-query for a day of logs, grouping by path and summing bytes sent. +4. Paths are parsed and grouped as `///...` or `///episode/...`. Unrecognized paths that use a bunch of bandwidth are warning-logged. +5. Resulting bytes usage is inserted back into BigQuery: + + ``` + {day: "2024-04-23", feeder_podcast: 123, feeder_episode: "abcd-efgh", feeder_feed: null, bytes: 123456789} + ``` + +## Development + +Local development is dependency free! Just: + +```sh +yarn install +yarn test +yarn lint +``` + +However, if you actually want to hit Athena/BigQuery, you'll need to `cp env-example .env` and fill in several dependencies: + +- `ATHENA_DB` the athena database you're using +- `ATHENA_TABLE` the athena table that has been configured to [query to the Dovetail CDN S3 logs](https://docs.aws.amazon.com/athena/latest/ug/cloudfront-logs.html#create-cloudfront-table-standard-logs) + - **NOTE:** you must have your AWS credentials setup and configured locally to reach/query Athena +- `BQ_DATASET` the BigQuery dataset to load the `dt_bytes` table in. You should use `development` or something locally (not `staging` or `production`) + +Then run `yarn start` and you're off! + +## Deployment + +This function's code is deployed as part of the usual +[PRX CI/CD](https://github.com/PRX/Infrastructure/tree/main?tab=readme-ov-file#cicd) process. +The lambda zip is built via `yarn build`, uploaded to S3, and deployed into the wild. + +While that's all straightforward, there are some gotchas setting up access: + +1. AWS permissions are (Athena, S3, Glue, etc) are documented in the [Cloudformation Stack](https://github.com/PRX/Infrastructure/blob/main/spire/templates/apps/dovetail-cdn-usage.yml) for this app. +2. Google is configured via the `BQ_CLIENT_CONFIG` ENV and [Federated Access](https://github.com/PRX/internal/wiki/Guide:-Google-Cloud-Workload-Identity-Federation) +3. _In addition to the steps documented in (2)_, the Service Account you create must have the following permissions: + - `BigQuery Job User` in your BigQuery project + - _Any_ role on the BigQuery dataset that provides `bigquery.tables.create`, so the table load jobs can execute. We have a custom role to provide this minimal access, but any role with that create permission will work. + - `BigQuery Data Editor` _only_ on the `dt_bytes` table in the dataset for this environment (click the table name in BigQuery UI -> Share -> Manage Permissions) + +## License + +[AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html)