Skip to content

Commit

Permalink
Microsatbed: new tool for reporting short tandem repeats as bed track…
Browse files Browse the repository at this point in the history
… features. (galaxyproject#6145)

* Preparing draft PR

* typo

* typo

* redundant copy

* fix flakery

* sheesh. local flake8 was fine, I swear....

* Add comments explaining two non-obvious issues being dealt with - 1 based reporting and decorated sorting for bed output

* flake8 strikes again.

* flake8 a trailing space. Oy.

* added a native pytrf mode to run findstr and make csv or tsv or gtf format output for all perfect STRs.
This makes the tool more generic and makes better use of pytrf...

* update readme with stuff from the tool help

* forgot the new test output for the new native pytrf findstr command line test

* make native test output smaller than 43MB. Eeesh.

* add missing smaller test fa

* remove bogus print

* Add test for built-in genome and paraphenaliae

* reverted the inbuilt test because cannot get the output labelled informatively about the provenance otherwise.
Cannot pass the element_identifier from a test parameter it seems - it's a built-in so not surprisingly over-ridden.

* rationalise minima for everything.
fix dimer minimum of 1
fix tests to conform to new structures

* make bed sample small enough

* add option for windowed density bigwig output with selectable window size for size selected and specified motifs
not so easy for native pytrf operation

* fix flake8 issue

* more flakery

* remove pybigtools to use ucsc-bedgraphtobigwig on a bedgraph instead
jbrowse2 simply will not read the bigwigs made using pybigtools :(

* Fixes suggested by Bjoern's review

* Update tools/microsatbed/microsatbed.xml

Co-authored-by: Björn Grüning <[email protected]>

* Update microsatbed.xml

* add help for windowed density bigwig option

* fix logic for multiple flags
add contents assertions to all tests

* remove bogus old merge mark

---------

Co-authored-by: Björn Grüning <[email protected]>
  • Loading branch information
fubar2 and bgruening authored Jul 21, 2024
1 parent fff23da commit 275acb7
Show file tree
Hide file tree
Showing 16 changed files with 37,732 additions and 0 deletions.
17 changes: 17 additions & 0 deletions tools/microsatbed/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: microsatbed
owner: iuc
categories:
- Sequence Analysis
description: Select microsatellites for a bed file
homepage_url: https://github.com/lmdu/pytrf
long_description: |
See https://pytrf.readthedocs.io/en/latest/ for the pytrf documentation.
A Tandem repeat (TR) in genomic sequence is a set of adjacent short DNA sequence repeated consecutively. The core sequence or repeat unit is generally called motif.
According to the motif length, tandem repeats can be classified as microsatellites and minisatellites. Microsatellites are also known as simple sequence repeats (SSRs)
or short tandem repeats (STRs) with motif length of 1-6 bp. Minisatellites are also sometimes referred to as variable number of tandem repeats (VNTRs) has longer
motif length than microsatellites. Pytrf is a lightweight Python C extension for identification of tandem repeats, for both exact or perfect SSRs.
It also can find generic tandem repeats with any size of motif, such as with maximum motif length of 100 bp. Additionally, it has capability of finding approximate or imperfect tandem repeats.
A fasta file must be supplied for processing. Different subsets of STR may be selected for output. Perfect STRs are the default, but any combination
with one or more of pefect, approxinate and generic. Designed to build some of the microsatellite tracks from https://github.com/arangrhie/T2T-Polish/tree/master/pattern for the VGP.
remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/master/tools/microsatbed
type: unrestricted
53 changes: 53 additions & 0 deletions tools/microsatbed/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
## microsatellites to bed features

**Convert short repetitive sequences to bed features**

Microsatellites are usually defined as repeated short DNA patterns in an unbroken sequence.
A microsatellite pattern or *motif* can be any combination nucleotides, typically from 1 to 6nt in length.

This tool allows microsatellite and related features to be selected from a fasta sequence input file, and output into a single bed track, suitable for viewing in a genome browser such as JBrowse2.

All motifs of selected lengths can be reported as individual features in the output bed file, or specific motifs can be provided and all
others will be ignored. In all cases, a minimum required number of repeats can be specified. For example, requiring 2 or more repeats of the trimer *ACG* will report
every sequence of *ACGACG* or *ACGACGACG* or *ACGACGACGACG* and so on, as individual bed features. Similarly, requiring 3 repeats of any trimer will
report every distinct 3 nucleotide pattern, including *ACGACGACG* as well as every other unique 3 nucleotide pattern with 3 sequential repeats or more such, as "CTCCTCCTC*.

For other output formats, the pytrf native command line *findstr* can be used to produce a gff, csv or tsv output containing all exact short tandem repeats, as
described at the end of https://pytrf.readthedocs.io/en/latest

A fasta file must be supplied for processing. A built in genome can be selected, or a fasta file of any kind can be selected from the current history. Note that all
symbols are treated as valid nucleotides by pytrf, so extraneous characters such as *-* or *N* in the input fasta may appear as unexpected bed features. Lower case fasta symbols will be converted
to uppercase, to prevent them being reported as distinct motifs.


**Filter motifs by length**

The default tool form setting is to select all dimer motif patterns.

Additional motif lengths from 1 to 6nt can be selected in the multiple-select drop-down list. All features will be returned in a single bed file. For each selected motif length,
the minimum number of repeats required for reporting can be adjusted. **Tandem repeats** are defined as at least 2 of any pattern. This tool allows singleton motifs to be reported,
so is not restricted to short tandem repeats (STR)

**Filter motifs by pattern**

This option allows a motif pattern to be specified as a text string such as *CG* or *ATC*. Multiple motifs can be specified as a comma separated string such as *CG,ATC*.
All features will be returned as a single bed file.

The minimum number of repeats for all motifs can be set to match specific requirements.

For example, technical sequencing read bias may be influenced by the density of specific dimers, whether they are repeated or not
such as in https://github.com/arangrhie/T2T-Polish/tree/master/pattern

**Run pytrf findstr to create a csv, tsv or gff format output with all perfect STR**

This selection runs the pytrf *findstr* option to create gff/csv/tsv outputs as described at the end of https://pytrf.readthedocs.io/en/latest/.

Quoted here:

*A Tandem repeat (TR) in genomic sequence is a set of adjacent short DNA sequence repeated consecutively. The core sequence or repeat unit is generally called motif.
According to the motif length, tandem repeats can be classified as microsatellites and minisatellites. Microsatellites are also known as simple sequence repeats (SSRs)
or short tandem repeats (STRs) with motif length of 1-6 bp. Minisatellites are also sometimes referred to as variable number of tandem repeats (VNTRs) has longer motif length than microsatellites.
Pytrf is a lightweight Python C extension for identification of tandem repeats. The pytrf enables to fastly identify both exact or perfect SSRs.
It also can find generic tandem repeats with any size of motif, such as with maximum motif length of 100 bp. Additionally, it has capability of finding approximate or imperfect tandem repeats*


18 changes: 18 additions & 0 deletions tools/microsatbed/all_fasta.loc.sample
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#This file lists the locations and dbkeys of all the fasta files
#under the "genome" directory (a directory that contains a directory
#for each build). The script extract_fasta.py will generate the file
#all_fasta.loc. This file has the format (white space characters are
#TAB characters):
#
#<unique_build_id> <dbkey> <display_name> <file_path>
#
#So, all_fasta.loc could look something like this:
#
#apiMel3 apiMel3 Honeybee (Apis mellifera): apiMel3 /path/to/genome/apiMel3/apiMel3.fa
#hg19canon hg19 Human (Homo sapiens): hg19 Canonical /path/to/genome/hg19/hg19canon.fa
#hg19full hg19 Human (Homo sapiens): hg19 Full /path/to/genome/hg19/hg19full.fa
#
#Your all_fasta.loc file should contain an entry for each individual
#fasta file. So there will be multiple fasta files for each build,
#such as with hg19 above.
#
127 changes: 127 additions & 0 deletions tools/microsatbed/find_str.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
import argparse
import subprocess

import pytrf # 1.3.0
from pyfastx import Fastx # 0.5.2

"""
Allows all STR or those for a subset of motifs to be written to a bed file
Designed to build some of the microsatellite tracks from https://github.com/arangrhie/T2T-Polish/tree/master/pattern for the VGP.
"""


def getDensity(name, bed, chrlen, winwidth):
"""
pybigtools can write bigwigs and they are processed by other ucsc tools - but jb2 will not read them.
Swapped the conversion to use a bedgraph file processed by bedGraphToBigWig
"""
nwin = int(chrlen / winwidth)
d = [0.0 for x in range(nwin + 1)]
for b in bed:
nt = b[5]
bin = int(b[1] / winwidth)
d[bin] += nt
bedg = [
(name, (x * winwidth), ((x + 1) * winwidth) - 1, float(d[x]))
for x in range(nwin + 1)
if (x + 1) * winwidth <= chrlen
]
return bedg


def write_ssrs(args):
"""
The integers in the call change the minimum repeats for mono-, di-, tri-, tetra-, penta-, hexa-nucleotide repeats
ssrs = pytrf.STRFinder(name, seq, 10, 6, 4, 3, 3, 3)
NOTE: Dinucleotides GA and AG are reported separately by https://github.com/marbl/seqrequester.
The reversed pair STRs are about as common in the documentation sample.
Sequence read bias might be influenced by GC density or some other specific motif.
"""
bed = []
wig = []
chrlens = {}
specific = None
if args.specific:
specific = args.specific.upper().split(",")
fa = Fastx(args.fasta, uppercase=True)
for name, seq in fa:
chrlen = len(seq)
chrlens[name] = chrlen
cbed = []
for ssr in pytrf.STRFinder(
name,
seq,
args.monomin,
args.dimin,
args.trimin,
args.tetramin,
args.pentamin,
args.hexamin,
):
row = (
ssr.chrom,
ssr.start,
ssr.end,
ssr.motif,
ssr.repeat,
ssr.length,
)
if args.specific and ssr.motif in specific:
cbed.append(row)
elif args.mono and len(ssr.motif) == 1:
cbed.append(row)
elif args.di and len(ssr.motif) == 2:
cbed.append(row)
elif args.tri and len(ssr.motif) == 3:
cbed.append(row)
elif args.tetra and len(ssr.motif) == 4:
cbed.append(row)
elif args.penta and len(ssr.motif) == 5:
cbed.append(row)
elif args.hexa and len(ssr.motif) == 6:
cbed.append(row)
if args.bigwig:
w = getDensity(name, cbed, chrlen, args.winwidth)
wig += w
bed += cbed
if args.bigwig:
wig.sort()
bedg = ["%s %d %d %.2f" % x for x in wig]
with open("temp.bedg", "w") as bw:
bw.write("\n".join(bedg))
chroms = ["%s\t%s" % (x, chrlens[x]) for x in chrlens.keys()]
with open("temp.chromlen", "w") as cl:
cl.write("\n".join(chroms))
cmd = ["bedGraphToBigWig", "temp.bedg", "temp.chromlen", args.bed]
subprocess.run(cmd)
else:
bed.sort()
obed = ["%s\t%d\t%d\t%s_%d\t%d" % x for x in bed]
with open(args.bed, "w") as outbed:
outbed.write("\n".join(obed))
outbed.write("\n")


if __name__ == "__main__":
parser = argparse.ArgumentParser()
a = parser.add_argument
a("--di", action="store_true")
a("--tri", action="store_true")
a("--tetra", action="store_true")
a("--penta", action="store_true")
a("--hexa", action="store_true")
a("--mono", action="store_true")
a("--dimin", default=2, type=int)
a("--trimin", default=2, type=int)
a("--tetramin", default=2, type=int)
a("--pentamin", default=2, type=int)
a("--hexamin", default=2, type=int)
a("--monomin", default=2, type=int)
a("-f", "--fasta", default="humsamp.fa")
a("-b", "--bed", default="humsamp.bed")
a("--bigwig", action="store_true")
a("--winwidth", default=128, type=int)
a("--specific", default=None)
a("--minreps", default=2, type=int)
args = parser.parse_args()
write_ssrs(args)
Loading

0 comments on commit 275acb7

Please sign in to comment.