rosewater assigns genes to (super-) enhancer output from ROSE in an expression-aware manner. It allows users to set a TPM threshold to filter genes that are not expressed on a sample-by-sample basis.
rosewater
can be installed via pip. For use, it requires bedtools
be available on your PATH.
pip install rosewater
rosewater
is fairly simple to use. It requires a GENCODE annotation GTF file, a TSV file of TPMs with the gene name column named 'gene' (these should match the 'gene_name' attributes in the annotation GTF), the name of the sample column in the TPM file, and an output file from ROSE. Optionally, users can set a TPM threshold (set to 5 by default) for filtering out lowly/non-expressed genes prior to assignment.
Usage: rosewater [OPTIONS]
rosewater assigns genes to ROSE output in an expression-aware manner.
Options:
-a, --annotation PATH Path to annotation GTF file. [required]
-s, --sample TEXT Sample name that should match a column in the
TPM file. [required]
-e, --enh_file PATH Output from ROSE ending with
'ENHANCER_TO_GENE.txt'. [required]
-t, --tpm_file PATH A file containing a matrix of TPMs with genes as
rows and samples as columns. The gene label
column should be named 'gene'. [required]
-th, --tpm_threshold FLOAT The minimum TPM to retain genes for assignment.
[default: 5]
-o, --output_dir PATH The output directory. [default:
./EnhancerGeneAssignments]
--version Show the version and exit.
-h, --help Show this message and exit.
Two output files will be generated, named after the ROSE enhancer input file appended with either .rosewater.GeneAssignment.log
or .rosewater.GeneAssignment.bed
. The log file will contain useful stats such as how many TSSes are filtered by the TPM threshold, how many TSSes overlap each enhancer, the average enhancer size, and how many assignments change from the original ROSE assignments.
The BED-like file will contain the assignments for each enhancer. Two assignments are made for each enhancer - one utilizing all TSSes in the annotation file that meet the TPM threshold and another utilizing only the protein-coding TSSes. These assignments are the last 4 columns of the file. The additional columns are fairly self-explanatory. In short, they contain the overlapping TSSes, the closest TSS using all transcripts that meet the TPM threshold, the closest TSS using only protein-coding transcripts that meet the TPM threshold, and the TPMs for each of those.
The original ROSE gene mapper just assigns the TSS that is closest to the center of the enhancer. rosewater
takes a more sophisticated (and therefore complicated approach):
- If the enhancer overlaps no TSSes for a gene that meets the TPM threshold:
- The "final_allgene_assignment" will be set to the gene that meets the TPM threshold for the closest TSS while "final_proteincoding_assignment" will be set to the gene that meets the TPM threshold for the closest 'protein_coding' TSS.
- If the enhancer overlaps a single TSS for a gene that meets the TPM threshold:
- If the 'gene_type' of the gene is
protein_coding
, the "final_allgene_assignment" and "final_proteincoding_assignment" will both be set to that gene. - If the 'gene_type' of the gene is not
protein_coding
, the "final_allgene_assignment" will be set to that gene while the "final_proteincoding_assignment" will be set to the gene for the closest 'protein_coding' TSS.
- If the 'gene_type' of the gene is
- If the enhancer overlaps two or more TSS for a gene that meets the TPM threshold:
- If the 'gene_type' of the most highly-expressed gene is
protein_coding
, the "final_allgene_assignment" and "final_proteincoding_assignment" will both be set to that gene. - If the 'gene_type' of the most highly-expressed gene is not
protein_coding
, the "final_allgene_assignment" will be set to that gene while the "final_proteincoding_assignment" will be set to the most highly-expressed overlapping 'protein_coding' gene. If there are no overlapping TSSes for 'protein_coding' genes, the "final_proteincoding_assignment" will be set to the gene for the closest 'protein_coding' TSS.
- If the 'gene_type' of the most highly-expressed gene is
Users are free to use whichever assignment they feel is most appropriate for their use case.
Users may get a warning like RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode
depending on their version of python. This can be safely ignored. It stems from pybedtools
and should be fixed in their next release.
Feel free to submit a pull request or open an issue if you have an idea to enhance this tool.
rosewater
is available under the GNU-GPLv3 license. It is provided as-is, with no warranty or guarantees.