Skip to content

Commit 091caba

Browse files
fubar2bgruening
andauthored
Make bed features from nonparametric continuous extreme bigwig value regions. (galaxyproject#6121)
* adding new tool bigwig_outlier_bed * remove redundant script * replace test.bw with test-data/1.bigwig to decrease test outputs below 1MB maximum for linter. * added bigtools to bio.tools for this new entry added edam * Add a proper version command by importing pybedtools * fix doi for pybigtools * Incorporate Bjoern's ideas. Single select to configure 3 possible outputs. Upper quantile now required. Chide if no low quantile and no output because low not in the choice. Help vastly expanded. * Added the bigwig metadata name as the label for bed feature name so no need for the user to supply one. * Clearing out old test data. Passes tests here but diffs in CI. Something odd. * readding fixed test outputs again again. * Ah. was overwriting a bed with the two bigwig test. * remove print leftovers * Separated python script to enable access for linting in CI * fix flake8 complaints with black on the new separate python script * fix imports * Do not make the output contig statistics table if there's no table output needed. * make the table calculations optional and mostly as a separate function since they may not be needed. * Clean up some comments. makeTableRow renamed * Update tools/bigwig_outlier_bed/.shed.yml Co-authored-by: Björn Grüning <[email protected]> * Update tools/bigwig_outlier_bed/.shed.yml Co-authored-by: Björn Grüning <[email protected]> * Clean up all field prompt text and consolidate the help text into chunks. * remove test for both qhi/qlo because qhi is not optional --------- Co-authored-by: Björn Grüning <[email protected]>
1 parent d7cada7 commit 091caba

14 files changed

+1016
-0
lines changed

tools/bigwig_outlier_bed/.shed.yml

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
categories:
2+
- Sequence Analysis
3+
description: pybigtools and numpy code to find continuous runs above a high or below a low quantile cutpoint in bigwig files
4+
homepage_url: https://github.com/galaxyproject/tools-iuc/tree/master/tools/bigwig_outlier_bed
5+
long_description: |
6+
Bigwigs are great. Peaks are easy to see. Small runs of very low values not so much
7+
8+
Using numpy to segment bigwig values into regions of contiguous high and low values, this tool writes bed files
9+
containing those regions, with a score set to 1 or -1 depending on whether above the top quantile or below the
10+
low quantile.
11+
12+
Quantile values recommended are 0.01 and 0.99. They are calculated for each chromosome and vary a bit.
13+
Ideally should be estimated over the entire assembly but not feasible without sampling due to RAM hoggery.
14+
Minimum window size of 10 will give a very, very low risk of random false positives at about 0.01**10 for 0.01
15+
quantile cutoff for example.
16+
17+
name: bigwig_outlier_bed
18+
owner: iuc
19+
remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/master/tools/bigwig_outlier_bed
20+
type: unrestricted

tools/bigwig_outlier_bed/README.md

+31
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
## bigwig peak bed maker
2+
3+
### July 30 2024 for the VGP
4+
5+
This code will soon become a Galaxy tool, for building some of the [NIH MARBL T2T assembly polishing](https://github.com/marbl/training) tools as Galaxy workflows.
6+
7+
JBrowse2 2.12.3 update will include a plugin for optional colours to distinguish bed features, shown being tested in the screenshots below.
8+
9+
### Find and mark BigWig peaks to a bed file for display
10+
11+
In the spirit of DeepTools, but finding contiguous regions where the bigwig value is either above or below a given centile.
12+
0.99 and 0.01 for example. These quantile cut point values are found and applied over each chromosome using some [cunning numpy code](http://gregoryzynda.com/python/numpy/contiguous/interval/2019/11/29/contiguous-regions.html)
13+
14+
![image](https://github.com/fubar2/bigwig_peak_bed/assets/6016266/cdee3a2b-ae31-4282-b744-992c15fb49db)
15+
16+
![image](https://github.com/fubar2/bigwig_peak_bed/assets/6016266/59d1564b-0c34-42a3-b437-44332cf1b2f0)
17+
18+
Big differences between chromosomes 14,15,21,22 and Y in this "all contigs" view - explanations welcomed:
19+
20+
![image](https://github.com/fubar2/bigwig_peak_bed/assets/6016266/162bf681-2977-4eb8-8d6f-9dad5b3931f8)
21+
22+
23+
[pybedtools](https://github.com/jackh726/bigtools) is used for the bigwig interface. Optionally allow
24+
multiple bigwigs to be processed into a single bed - the bed features have the bigwig name in the label for viewing.
25+
26+
### Note on quantiles per chromosome rather than quantiles for the whole bigwig
27+
28+
It is just not feasible to hold all contigs in the entire decoded bigwig in RAM to estimate quantiles. It may be
29+
better to sample across all chromosomes so as not to lose any systematic differences between them - the current method will hide those
30+
differences unfortunately. Sampling might be possible. Looking at the actual quantile values across a couple of test bigwigs suggests that
31+
there is not much variation between chromosomes but there's now a tabular report to check them for each input bigwig.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
"""
2+
Ross Lazarus June 2024 for VGP
3+
Bigwigs are great, but hard to reliably "see" small low coverage or small very high coverage regions.
4+
Colouring in JB2 tracks will need a new plugin, so this code will find bigwig regions above and below a chosen percentile point.
5+
0.99 and 0.01 work well in testing with a minimum span of 10 bp.
6+
Multiple bigwigs **with the same reference** can be combined - bed segments will be named appropriately
7+
Combining multiple references works but is silly because only display will rely on one reference so others will not be shown...
8+
Tricksy numpy method from http://gregoryzynda.com/python/numpy/contiguous/interval/2019/11/29/contiguous-regions.html
9+
takes about 95 seconds for a 17MB test wiggle
10+
JBrowse2 bed normally displays ignore the score, so could provide separate low/high bed file outputs as an option.
11+
Update june 30 2024: wrote a 'no-build' plugin for beds to display red/blue if >0/<0 so those are used for scores
12+
Bed interval naming must be short for JB2 but needs input bigwig name and (lo or hi).
13+
"""
14+
15+
import argparse
16+
import copy
17+
import os
18+
import sys
19+
from pathlib import Path
20+
21+
import numpy as np
22+
import pybigtools
23+
24+
25+
class findOut:
26+
27+
def __init__(self, args):
28+
self.bwnames = args.bigwig
29+
self.bwlabels = args.bigwiglabels
30+
self.bedwin = args.minwin
31+
self.qlo = args.qlo
32+
self.qhi = args.qhi
33+
self.outbeds = args.outbeds
34+
self.bedouthi = args.bedouthi
35+
self.bedoutlo = args.bedoutlo
36+
self.bedouthilo = args.bedouthilo
37+
self.tableoutfile = args.tableoutfile
38+
self.bedwin = args.minwin
39+
self.qhi = args.qhi
40+
self.qlo = args.qlo
41+
nbw = len(args.bigwig)
42+
nlab = len(args.bigwiglabels)
43+
if nlab < nbw:
44+
self.bwlabels += ["Nolabel"] * (nbw - nlab)
45+
self.makeBed()
46+
47+
def processVals(self, bw, isTop):
48+
"""
49+
idea from http://gregoryzynda.com/python/numpy/contiguous/interval/2019/11/29/contiguous-regions.html
50+
Fast segmentation into regions by taking np.diff on the boolean array of over (under) cutpoint indicators in bwex.
51+
This only gives non-zero values at the segment boundaries where there's a change, so those zeros are all removed in bwexdnz
52+
leaving an array of segment start/end positions. That's twisted around into an array of start/end coordinates.
53+
Magical. Fast. Could do the same for means or medians over windows for sparse bigwigs like repeat regions.
54+
"""
55+
if isTop:
56+
bwex = np.r_[False, bw >= self.bwtop, False] # extend with 0s
57+
else:
58+
bwex = np.r_[False, bw <= self.bwbot, False]
59+
bwexd = np.diff(bwex)
60+
bwexdnz = bwexd.nonzero()[0]
61+
bwregions = np.reshape(bwexdnz, (-1, 2))
62+
return bwregions
63+
64+
def writeBed(self, bed, bedfname):
65+
"""
66+
potentially multiple
67+
"""
68+
bed.sort()
69+
beds = ["%s\t%d\t%d\t%s\t%d" % x for x in bed]
70+
with open(bedfname, "w") as bedf:
71+
bedf.write("\n".join(beds))
72+
bedf.write("\n")
73+
74+
def makeTableRow(self, bw, bwlabel, chr):
75+
"""
76+
called for every contig, but messy inline
77+
"""
78+
bwmean = np.mean(bw)
79+
bwstd = np.std(bw)
80+
bwmax = np.max(bw)
81+
nrow = np.size(bw)
82+
bwmin = np.min(bw)
83+
row = "%s\t%s\t%d\t%f\t%f\t%f\t%f" % (
84+
bwlabel,
85+
chr,
86+
nrow,
87+
bwmean,
88+
bwstd,
89+
bwmin,
90+
bwmax,
91+
)
92+
if self.qhi is not None:
93+
row += "\t%f" % self.bwtop
94+
else:
95+
row += "\t"
96+
if self.qlo is not None:
97+
row += "\t%f" % self.bwbot
98+
else:
99+
row += "\t"
100+
return row
101+
102+
def makeBed(self):
103+
bedhi = []
104+
bedlo = []
105+
bwlabels = self.bwlabels
106+
bwnames = self.bwnames
107+
if self.tableoutfile:
108+
restab = ["bigwig\tcontig\tn\tmean\tstd\tmin\tmax\tqtop\tqbot"]
109+
for i, bwname in enumerate(bwnames):
110+
bwlabel = bwlabels[i].replace(" ", "")
111+
fakepath = "in%d.bw" % i
112+
if os.path.isfile(fakepath):
113+
os.remove(fakepath)
114+
p = Path(fakepath)
115+
p.symlink_to(bwname) # required by pybigtools (!)
116+
bwf = pybigtools.open(fakepath)
117+
chrlist = bwf.chroms()
118+
chrs = list(chrlist.keys())
119+
chrs.sort()
120+
for chr in chrs:
121+
bw = bwf.values(chr)
122+
bw = bw[~np.isnan(bw)] # some have NaN if parts of a contig not covered
123+
if self.qhi is not None:
124+
self.bwtop = np.quantile(bw, self.qhi)
125+
bwhi = self.processVals(bw, isTop=True)
126+
for j, seg in enumerate(bwhi):
127+
if seg[1] - seg[0] >= self.bedwin:
128+
bedhi.append((chr, seg[0], seg[1], "%s_hi" % (bwlabel), 1))
129+
if self.qlo is not None:
130+
self.bwbot = np.quantile(bw, self.qlo)
131+
bwlo = self.processVals(bw, isTop=False)
132+
for j, seg in enumerate(bwlo):
133+
if seg[1] - seg[0] >= self.bedwin:
134+
bedlo.append((chr, seg[0], seg[1], "%s_lo" % (bwlabel), -1))
135+
if self.tableoutfile:
136+
row = self.makeTableRow(bw, bwlabel, chr)
137+
restab.append(copy.copy(row))
138+
if self.tableoutfile:
139+
stable = "\n".join(restab)
140+
with open(self.tableoutfile, "w") as t:
141+
t.write(stable)
142+
t.write("\n")
143+
some = False
144+
if self.qlo:
145+
if self.outbeds in ["outall", "outlo", "outlohi"]:
146+
self.writeBed(bedlo, self.bedoutlo)
147+
some = True
148+
if self.qhi:
149+
if self.outbeds in ["outall", "outlohi", "outhi"]:
150+
self.writeBed(bedhi, self.bedouthi)
151+
some = True
152+
if self.outbeds in ["outall", "outhilo"]:
153+
allbed = bedlo + bedhi
154+
self.writeBed(allbed, self.bedouthilo)
155+
some = True
156+
if not some:
157+
sys.stderr.write(
158+
"Invalid configuration - no output could be created. Was qlo missing and only low output requested for example?"
159+
)
160+
sys.exit(2)
161+
162+
163+
if __name__ == "__main__":
164+
parser = argparse.ArgumentParser()
165+
a = parser.add_argument
166+
a("-m", "--minwin", default=10, type=int)
167+
a("-l", "--qlo", default=None, type=float)
168+
a("-i", "--qhi", default=None, type=float)
169+
a("--bedouthi", default=None)
170+
a("--bedoutlo", default=None)
171+
a("--bedouthilo", default=None)
172+
a("-w", "--bigwig", nargs="+")
173+
a("-n", "--bigwiglabels", nargs="+")
174+
a("-o", "--outbeds", default="outhilo", help="optional high and low combined bed")
175+
a("-t", "--tableoutfile", default=None)
176+
args = parser.parse_args()
177+
findOut(args)

0 commit comments

Comments
 (0)