Summarizing data quality/coverage across many runs #219

jeffreybarrick · 2019-10-02T13:03:34Z

Motivation: It would be very useful to have a script that can take many runs and create a dashboard for evaluating and comparing their quality/coverage.

It might:

Generate a spreadsheet/table with summary statistics, like the number of reads/bases and the %mapping.
– Show thumbnail coverage graphs across genomes
Display the UN evidence concerning how much of the genome had enough coverage for calling mutations in each sample.
etc.

Implementation: Most likely as Python/R scripts that generate HTML output. They can parse the summary.json files for statistics and use breseq BAM2COV to generate files to generate input files for graphing, for example.

The text was updated successfully, but these errors were encountered:

jeffreybarrick · 2020-02-01T13:32:40Z

Here are some example summary files that can be used for testing:
https://barricklab.org/release/tmp/ADP1-summary.tgz

jeffreybarrick · 2020-02-08T13:08:20Z

HTML table as output.

Could eventually color some cells green/yellow/red to flag suspect files/samples.

In general, the output should have most of the same columns, but additional information, compared to the READ and REFERENCE tables generated for one breseq run. Example:

https://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/REL8593A_output/summary.html

Columns to include in the REFERENCE TABLE:

sample
multiple lines within sample for each reference sequence
length
average coverage
average fit coverage
dispersion of fit coverage

Columns to include in the READ TABLE

sample
multiple lines within sample for each read file
number of reads
minimum-maximum [average] read length
% mapping

jeffreybarrick · 2020-11-25T04:09:24Z

@ginnymortensen
Here is a newer set of breseq output that preserves all of the output folders compared to the one linked above. The output.json files are still the main place to pull information from.

https://barricklab.org/release/tmp/Ara-1-summary.tgz

jeffreybarrick added the coding-project label Oct 2, 2019

alexayala08 self-assigned this Feb 7, 2020

jeffreybarrick assigned ginnymortensen and unassigned alexayala08 Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarizing data quality/coverage across many runs #219

Summarizing data quality/coverage across many runs #219

jeffreybarrick commented Oct 2, 2019

jeffreybarrick commented Feb 1, 2020

jeffreybarrick commented Feb 8, 2020

jeffreybarrick commented Nov 25, 2020

Summarizing data quality/coverage across many runs #219

Summarizing data quality/coverage across many runs #219

Comments

jeffreybarrick commented Oct 2, 2019

jeffreybarrick commented Feb 1, 2020

jeffreybarrick commented Feb 8, 2020

jeffreybarrick commented Nov 25, 2020