Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Logging API Limit not reasonable for Snakemake #14

Open
vsoch opened this issue Nov 25, 2023 · 5 comments
Open

Google Logging API Limit not reasonable for Snakemake #14

vsoch opened this issue Nov 25, 2023 · 5 comments

Comments

@vsoch
Copy link
Collaborator

vsoch commented Nov 25, 2023

Problem: we need to be able to stream logs for the job to a .snakemake logs file. Currently, we get job statuses and print to the console, but for use cases where it's entirely in Python (e.g., a test) and for provenance / keeping of this metadata, we need the file written.

Google has a logging API, but the API limits make it unreasonable to use for this case. The Logging API has an extremely limited API limit, 60/minute, where each request returns one line of a log that can be many thousands of lines (for reference, a hello world test workflow is over 3K).

Number of entries.list requests 60 per minute, per Google Cloud project2, 3

This means that, in practice, anything > 1 job (e.g., snakemake running multiple steps) or that does not sufficiently sleep will hit the limit very quickly (I did in a hello world test run). Because of this limit and the needs for snakemake, this is currently not reasonable to add. We are adding a helper script in #13 that can (with a sleep) stream logs (very slowly) for a job, but ideally we can add this to the core of the executor to stream all logs for all steps.

I'm seeing that we are able to create "sinks" https://cloud.google.com/logging/docs/routing/overview#sinks and I am worried the behavior above is because Google wants us to send them to say, pub/sub and then retrieve more reasonably (but also pay more for the extra service). If this is the only option we won't have a choice, but it adds complexity to the executor in requiring use (and activation) of another API, and introduces another source of billing. We will ping Google for advice on options we have here.

@joshgc
Copy link

joshgc commented Nov 28, 2023

@vsoch It sounds like your use case maybe covered by "live-tailing" logs which has these quotas

Number of open live-tailing sessions 10 per Google Cloud project3
Number of live-tailing entries returned 60,000 per minute

https://cloud.google.com/logging/quotas

Also I'm not sure I understand what you mean by entries.list returns only one log line; can you please share your request body? FWIW I'm just a big GCP logging user here

@vsoch
Copy link
Collaborator Author

vsoch commented Nov 28, 2023

Also I'm not sure I understand what you mean by entries.list returns only one log line; can you please share your request body? FWIW I'm just a big GCP logging user here

Sure! Each entry is actually just one line in a log.

{
insertId: "rh56tufqu0wjl"
labels: {4}
logName: "projects/llnl-flux/logs/batch_task_logs"
receiveTimestamp: "2023-11-25T16:48:00.758973170Z"
resource: {2}
severity: "ERROR"
textPayload: "WARNING: Target directory /.snakemake/pip-deployments/snakemake_storage_plugin_s3-0.2.5.dist-info already exists. Specify --upgrade to force replacement."
timestamp: "2023-11-25T16:48:00.139653089Z"
}

If you were to be streaming an actual log, it would be just one line:

WARNING: Target directory /.snakemake/pip-deployments/snakemake_storage_plugin_s3-0.2.5.dist-info already exists. Specify --upgrade to force replacement.

A "hello world" workflow would have 3-4k of these lines. So when the logging quota is 60/minute, I can only see 60 of those lines per minute, which barely covers the hello world job (and only works if you had a sleep between lines). Does that make sense?

Number of open live-tailing sessions 10 per Google Cloud project3

That could be an option, but 10 (for an entire project) also would lead to an issue for even one large workflow / multiple users.

@joshgc
Copy link

joshgc commented Nov 28, 2023

Hmm that seems suspicious. Can you share your request for entries.list too? Also can you specify what your latency requirements are? I'm having a hard time believing the log costs are gonna even make a dent compared to the compute costs.

Pub/Sub is $40 or $50/TiB and each messages is at least 1KiB but the first 10GiB are free. Gory details here: https://cloud.google.com/pubsub/pricing but notice they're example costs are for 10MiBps (or more) sustained. Thats a lot of logs!

IIUC, you're looking at 5$ / 100 Million log messages. And 5$ is only 17 hours on the smallest general purpose C3 GCE machine: https://cloud.google.com/compute/all-pricing#general_purpose. For Pub/Sub cost to ~ GCE costs you'd have to produce >1 Khz of logs. Thats alot of logs; sounds like its data that should be on a filesystem instead anyway...

@vsoch
Copy link
Collaborator Author

vsoch commented Nov 28, 2023

Here is the script (derived from your example). When I ran it the first time (without a sleep) for a hello world example I hit quota.

https://github.com/snakemake/snakemake-executor-plugin-googlebatch/pull/13/files#diff-3534da2f75f6fb70c7607be1e9659e6051c519e627697c21690ddc02ae9fd83c

Pub/Sub is $40 or $50/TiB and each messages is at least 1KiB but the first 10GiB are free. Gory details here: https://cloud.google.com/pubsub/pricing but notice they're example costs are for 10MiBps (or more) sustained. Thats a lot of logs!

This is a bit avoiding the discussion and telling me that (putting on my user hat) I should be OK with activating another service and paying more. If we are imaging an entire institution generating logs for workflows in Google Batch (for some future when this is happening) yes, it would create a lot of logs I agree. :)

And yes, the intention is to stream the logs from batch to a filesystem (where snakemake is being run from) for reproducibility. we do produce inordinate numbers of logs on our HPC filesystems. It's not always great! Pinging @johanneskoester for discussion because I don't love any of these solutions.

@vsoch
Copy link
Collaborator Author

vsoch commented Nov 28, 2023

And my thinking so far from this discussion is that I'm not comfortable adding anything to the executor that requires enabling an extra service, and one that can accrue costs (albeit slowly) over time. It's not transparent enough, and it adds additional complexity for the user to setup (and need to request an increase in quota or other PubSub setup). I would rather direct the user to the console and then provide a helper script to get a particular log.

vsoch added a commit that referenced this issue Dec 8, 2023
This PR will add the following:

- an example script to show logs (not added to the executor per problems
with the API limit, described in
#14
- ability for the true test in CI to take environment variables to
specify project /region (for easiest local running)
 - a missing NOTICE file required for HPCIC work

I'm working on an issue that describes the logging problem, and will be
sharing this with Google and will update here after. TBA.

Update: I don't see a solution that I like for logging. It either
requires enabling an extra service, or risking hitting an API limit. For
now we can easily see status updates in the UI and direct the interested
user to this script for dumping a log.

---------

Signed-off-by: vsoch <[email protected]>
Co-authored-by: vsoch <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants