Skip to content

Commit

Permalink
feat: add a CLI entrypoint for find_in_gmail (#5)
Browse files Browse the repository at this point in the history
* feat: add a CLI entrypoint for find_in_gmail
  • Loading branch information
clintval authored Dec 14, 2024
1 parent 4f15064 commit d179c55
Show file tree
Hide file tree
Showing 6 changed files with 282 additions and 14 deletions.
112 changes: 109 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@ The package can be installed with `pip`:
pip install tp53
```

## Upload a VCF to the Seshat TP53 Annotation Server
## Upload a VCF to Seshat

Upload a VCF to the [Seshat TP53 annotation server](http://vps338341.ovh.net/) using a headless browser.

```bash
❯ python -m tp53.seshat.upload_vcf \
--input "input.vcf" \
--input "sample.library.vcf" \
--email "[email protected]"
```
```console
Expand Down Expand Up @@ -52,7 +52,9 @@ One solution that has worked in the past is to remove SVs.
The following command will exclude all variants with a non-empty SVTYPE INFO key:

```bash
❯ bcftools view in.vcf --exclude 'SVTYPE!="."' > out.noSV.vcf
❯ bcftools view sample.library.vcf \
--exclude 'SVTYPE!="."' \
> sample.library.noSV.vcf
```

###### Automation
Expand All @@ -75,6 +77,110 @@ This script relies on Google Chrome:

Distributions of MacOS may require you to authenticate the Chrome driver ([link](https://stackoverflow.com/a/60362134)).

## Download a Seshat Annotation from Gmail

Download [Seshat](http://vps338341.ovh.net/) VCF annotations by awaiting a server-generated email.

```bash
❯ python -m tp53.seshat.find_in_gmail \
--input "sample.library.vcf" \
--output "sample.library" \
--credentials "~/.secrets/credentials.json"
```
```console
INFO:tp53.seshat.find_in_gmail:Successfully logged into the Gmail service.
INFO:tp53.seshat.find_in_gmail:Querying for a VCF named: sample.library.vcf
INFO:tp53.seshat.find_in_gmail:Searching Gmail messages with: sample.library.vcf from:[email protected] newer_than:5h subject:"Results of batch analysis"
INFO:tp53.seshat.find_in_gmail:Message found with the following metadata: {'id': '193c310d2714b389', 'threadId': '193c30b7244e2662'}
INFO:tp53.seshat.find_in_gmail:Message contents are as follows:
INFO:tp53.seshat.find_in_gmail: Results of batch analysis
INFO:tp53.seshat.find_in_gmail: Analyzed batch file:
INFO:tp53.seshat.find_in_gmail: sample.library.vcf
INFO:tp53.seshat.find_in_gmail: Time taken to run the analysis:
INFO:tp53.seshat.find_in_gmail: 0 minutes 10 seconds
INFO:tp53.seshat.find_in_gmail: Summary:
INFO:tp53.seshat.find_in_gmail: The input file contained
INFO:tp53.seshat.find_in_gmail: 23 mutations out of which
INFO:tp53.seshat.find_in_gmail: 23 were TP53 mutations.
INFO:tp53.seshat.find_in_gmail:Writing attachment to ZIP archive: sample.library.vcf.seshat.zip
INFO:tp53.seshat.find_in_gmail:Extracting ZIP archive: sample.library.vcf.seshat.zip
INFO:tp53.seshat.find_in_gmail:Output file renamed to: sample.library.seshat.short-20241214_034753_129732.tsv
INFO:tp53.seshat.find_in_gmail:Output file renamed to: sample.library.seshat.long-20241214_034753_217420.tsv
```

This tool is used to programmatically wait for, and retrieve, a batch results email from the Seshat TP53 annotation server.
The tool works by searching a user-controlled Gmail inbox for a recent Seshat email that contains the result annotations for a given VCF input file, by name.
It is critically important to be aware that there is no way to prove which annotation files, as they arrive via email, are to be linked with which VCF file on disk.

This tool assists in the correct pairing of VCF input files, and subsequent annotation files, by letting you specify how many hours back in time you will let the Gmail query search (`--newer-than`).
Limiting the window of time in which an email should have arrived minimizes the chance of discovering stale annotation files from an old Seshat execution in the cases where VCF filenames may be non-unique.
If the batch results email from the Seshat annotation server has not yet arrived, this tool will wait a set number of seconds (`--wait-for`) before exiting with exception.
It normally takes less than 1 minute for the Seshat server to annotate an average TP53-only VCF.

###### Search Criteria

The following rules are used to find annotation files:

1. The email contains the filename of the input VCF
2. The email subject line must contain "Results of batch analysis"
3. The email is at least `--newer-than` hours old
4. The email is from the address [[email protected]](mailto:[email protected])

###### Outputs:

- `<output>.seshat.long-\\d{8}_\\d{6}_\\d{6}.tsv`: The long format Seshat annotations for the input VCF
- `<output>.seshat.short-\\d{8}_\\d{6}_\\d{6}.tsv`: The short format Seshat annotations for the input VCF
- `<output>.seshat.zip`: The original ZIP archive from Seshat

###### Gmail Authentication

After installing all Python dependencies, you must create a Google developer's OAuth file.
First-time 2FA may be required depending on the configuration of your Gmail service.
If 2FA is required, then this script will block until you acknowledge your 2FA prompt.
A 2FA prompt is often delivered through an auto-opening web browser.

To create a Google developer's OAuth file, navigate to the following URL and follow the instructions.

- [Authorize Credentials for a Desktop Application](https://developers.google.com/gmail/api/quickstart/python#authorize_credentials_for_a_desktop_application)

Ensure your OAuth file is configured as a "Desktop app" and then download the credentials as JSON.
Save your credentials file somewhere safe, ideally in a secure user folder with restricted permissions (`chmod 700`).
Set your OAuth file permissions to also restrict unwarranted access (`chmod 600`).

This script will store a cached token after first-time authentication is successful.
This cached token can be found in the user's home directory within a hidden directory.
Token caching greatly speeds up continued executions of this script.
As of now, the token is cached at the following location:

```bash
"~/.tp53/seshat/seshat-gmail-find-token.pickle"
```

If the cached token is missing, or becomes stale, then you will need to provide your OAuth credentials file.

A typical Google developer's OAuth file is of the format:

```console
{
"installed": {
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"client_id": "272111863110-csldkfjlsdkfjlksdjflksdincie.apps.googleusercontent.com",
"client_secret": "sdlfkjsdlkjfijciejijcei",
"project_id": "gmail-access-2398293892838",
"redirect_uris": [
"urn:ietf:wg:oauth:2.0:oob",
"http://localhost"
],
"token_uri": "https://oauth2.googleapis.com/token"
}
}
```

###### Server Failures

If Seshat fails to annotate the VCF file but still emails the user a response, then this tool will emit the email body to standard error and exit with a non-zero status.

## Development and Testing

See the [contributing guide](./CONTRIBUTING.md) for more information.
Expand Down
Empty file.
File renamed without changes.
160 changes: 159 additions & 1 deletion tp53/seshat/find_in_gmail/__main__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,160 @@
r"""
Download Seshat VCF annotations by awaiting a server-generated email.
This tool is used to programmatically wait for, and retrieve, a batch results
email from the Seshat TP53 annotation server. The tool works by searching a
user-controlled Gmail inbox for a recent Seshat email that contains the result
annotations for a given VCF input file, by name. It is critically important to
be aware that there is no way to prove which annotation files, as they arrive
via email, are to be linked with which VCF file on disk. This tool assists in
the correct pairing of VCF input files, and subsequent annotation files, by
letting you specify how many hours back in time you will let the Gmail query
search (`--newer-than`). Limiting the window of time in which an email should
have arrived minimizes the chance of discovering stale annotation files from an
old Seshat execution in the cases where VCF filenames may be non-unique.
If the batch results email from the Seshat annotation server has not yet
arrived, this tool will wait a set number of seconds (`--wait-for`) before
exiting with exception. It normally takes less than 1 minute for the Seshat
server to annotate an average TP53-only VCF.
#### Search Criteria
The following rules are used to find annotation files:
1. The email contains the filename of the input VCF
2. The email subject line must contain "Results of batch analysis"
3. The email is at least `--newer-than` hours old
4. The email is from the address "[email protected]"
#### Outputs:
* <output>.seshat.long-\\d{8}_\\d{6}_\\d{6}.tsv:
The long format Seshat annotations for the input VCF.
* <output>.seshat.short-\\d{8}_\\d{6}_\\d{6}.tsv:
The short format Seshat annotations for the input VCF.
* <output>.seshat.zip:
The original ZIP archive from Seshat.
#### Gmail Authentication
After installing all Python dependencies, you must create a Google developer's
OAuth file. First-time 2FA may be required depending on the configuration of
your Gmail service. If 2FA is required, then this script will block until you
acknowledge your 2FA prompt. A 2FA prompt is often delivered through an
auto-opening web browser.
To create a Google developer's OAuth file, navigate to the following URL and
follow the instructions.
- https://developers.google.com/gmail/api/quickstart/python#authorize_credentials_for_a_desktop_application
Ensure your OAuth file is configured as a "Desktop app" and then download the
credentials as JSON. Save your credentials file somewhere safe, ideally in a
secure user folder with restricted permissions (chmod 700). Set your OAuth file
permissions to also restrict unwarranted access (chmod 600).
This script will store a cached token after first-time authentication is
successful. This cached token can be found in the user's home directory within
a hidden directory. Token caching greatly speeds up continued executions of this
script. As of now, the token is cached at the following location:
- '~/.tp53/seshat/seshat-gmail-find-token.pickle'
If the cached token is missing, or becomes stale, then you will need to provide
your OAuth credentials file.
A typical Google developer's OAuth file is of the format:
{
"installed": {
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"client_id": "272111863110-csldkfjlsdkfjlksdjflksdincie.apps.googleusercontent.com",
"client_secret": "sdlfkjsdlkjfijciejijcei",
"project_id": "gmail-access-2398293892838",
"redirect_uris": [
"urn:ietf:wg:oauth:2.0:oob",
"http://localhost"
],
"token_uri": "https://oauth2.googleapis.com/token"
}
}
#### Server Failures
If Seshat fails to annotate the VCF file but still emails the user a response,
then this tool will emit the email body to STDERR and exit with a non-zero
status.
#### References
1. Soussi, Thierry, et al. “Recommendations for Analyzing and Reporting TP53
Gene Variants in the High-Throughput Sequencing Era.” Human Mutation,
vol. 35, no. 6, 2014, pp. 766–778., doi:10.1002/humu.22561.
───────
"""

import argparse
import logging
import sys
from pathlib import Path

from ._find_in_gmail import find_in_gmail

if __name__ == "__main__":
...
formatter = argparse.RawTextHelpFormatter

cli_args = sys.argv[1:]

parser = argparse.ArgumentParser(
prog="find_in_gmail",
description=__doc__,
add_help=True,
formatter_class=formatter,
epilog=r"Copyright © Clint Valentine 2024",
)

_ = parser.add_argument(
"--input",
required=True,
type=Path,
help="The path to the original VCF which was uploaded.",
)
_ = parser.add_argument(
"--output",
required=True,
type=Path,
help="The path to write the TP53 annotations to.",
)
_ = parser.add_argument(
"--newer-than",
default=5,
type=int,
help="Limit search to emails newer than this many hours.\n(default: 5).",
)
_ = parser.add_argument(
"--wait-for",
default=200,
type=int,
help="Seconds to wait for an email to arrive.\n(default: 200).",
)
_ = parser.add_argument(
"--credentials",
default=None,
type=Path,
help="The path to the Gmail authentication credentials JSON.",
)
args = parser.parse_args(cli_args)

logging.basicConfig(datefmt="[%X]", level=logging.INFO)
logging.getLogger("googleapiclient.discovery_cache").setLevel(logging.ERROR)

find_in_gmail(
infile=args.input,
output=args.output,
newer_than=args.newer_than,
wait_for=args.wait_for,
credentials=args.credentials,
)
Loading

0 comments on commit d179c55

Please sign in to comment.