Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarizing data quality/coverage across many runs #219

Open
jeffreybarrick opened this issue Oct 2, 2019 · 3 comments
Open

Summarizing data quality/coverage across many runs #219

jeffreybarrick opened this issue Oct 2, 2019 · 3 comments
Assignees

Comments

@jeffreybarrick
Copy link
Contributor

Motivation: It would be very useful to have a script that can take many runs and create a dashboard for evaluating and comparing their quality/coverage.

It might:

  • Generate a spreadsheet/table with summary statistics, like the number of reads/bases and the %mapping.
    – Show thumbnail coverage graphs across genomes
  • Display the UN evidence concerning how much of the genome had enough coverage for calling mutations in each sample.
  • etc.

Implementation: Most likely as Python/R scripts that generate HTML output. They can parse the summary.json files for statistics and use breseq BAM2COV to generate files to generate input files for graphing, for example.

@jeffreybarrick
Copy link
Contributor Author

Here are some example summary files that can be used for testing:
https://barricklab.org/release/tmp/ADP1-summary.tgz

@alexayala08 alexayala08 self-assigned this Feb 7, 2020
@jeffreybarrick
Copy link
Contributor Author

HTML table as output.

Could eventually color some cells green/yellow/red to flag suspect files/samples.

In general, the output should have most of the same columns, but additional information, compared to the READ and REFERENCE tables generated for one breseq run. Example:

https://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/REL8593A_output/summary.html

Columns to include in the REFERENCE TABLE:

  • sample
  • multiple lines within sample for each reference sequence
  • length
  • average coverage
  • average fit coverage
  • dispersion of fit coverage

Columns to include in the READ TABLE

  • sample
  • multiple lines within sample for each read file
  • number of reads
  • minimum-maximum [average] read length
  • % mapping

@jeffreybarrick
Copy link
Contributor Author

@ginnymortensen
Here is a newer set of breseq output that preserves all of the output folders compared to the one linked above. The output.json files are still the main place to pull information from.

https://barricklab.org/release/tmp/Ara-1-summary.tgz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants