LCDB test data

This repository stores small amounts of example data that can be used for testing along with the code to reproduce the data.

Downloading data

See the data directory in this repo to download some already-prepared data from Drosophila melanogaster. Use the URL format https://github.com/lcdb/lcdb-test-data/blob/master/$PATH?raw=true to download. For example, to download the file rnaseq_samples/sample1/sample1.tiny_R1.fastq.gz, use the following URL:

https://github.com/lcdb/lcdb-test-data/blob/master/data/rnaseq_samples/sample1/sample1.tiny_R1.fastq.gz?raw=true

Building data sets

The Snakefile assumes an environment called lcdb-test-data, created like this (assuming bioconda channel is set up):

conda create -n lcdb-test-data --file requirements.txt python=3

Run the Snakefile as normal. Note that this will be downloading and creating 10s of GB of data, so you will want to specify the working directory as somewhere with a fair amount of room (use the --directory or -d arg for Snakemake).

You can use the WRAPPER_SLURM file for submitting to a SLURM cluster. E.g., on NIH's biowulf:

sbatch WRAPPER_SLURM Snakefile -d /path/to/full/data

When complete, run the cp-data-to-repo.sh script to just copy over the small example data to this repo, and then make the commits.

Strategy

The full reference genome, annotations, and transcriptome are downloaded. FASTQs are aligned to the full reference. The resulting BAMs are parsed and downsampled across the region indicated in the LIMITS.bed file (created by the workflow; configured in the limits rule) in such a way that read pairs are handled correctly.

The reference genome, transcriptome, and annotations are similarly subset to the region indicated in LIMITS.bed.

To avoid too-sparse data, the BAMs are first subset by the restricted region and then downsampled. The amount of downsampling and ratio of mapped-to-unmapped reads are set by the mapped_n_config and unmapped_n_config dictionaries There are "small" and "tiny" versions of each.

These newly-subset FASTQs are then re-aligned to the genome

In some cases, you may find that tests do not not have enough reads. In this case, reset the values in those dictionaries and then force-rerun the chipseq_small_fastq and rnaseq_small_fastq rules:

snakemake -d /path/to/output --forcerun chipseq_small_fastq rnaseq_small_fastq

A slightly more extreme adjustment would be use change the coordinates in the limits rule, and then force-run that rule.

snakemake -d /path/to/output --forcerun limits

RNA-seq data

RNA-seq test data currently consists of four, 48-bp PE samples from GEO accession GSE49587. This was an RNA-seq experiment WT and Smn mutant Drosophila larvae.

label	SRA accession	treatment
sample1	SRR948304	WT rep 1
sample2	SRR948305	WT rep 2
sample3	SRR948306	Smn mutant, rep 1
sample4	SRR948307	Smn mutant, rep 2

ChIP-seq data

ChIP-seq data consists of three IP/input pairs from GSE38594.

label	SRA accession	description
input1	SRR504958	wing disc input, rep1
input2	SRR504959	wing disc input, rep2
input3	SRR504959	embyro input, rep1
ip1	SRR504955	GAF ChIP in wing disc, rep1
ip2	SRR504956	GAF ChIP in wing disc, rep2
ip3	SRR504946	GAF ChIP in embyro, rep1
ip4	SRR504947	GAF ChIP in embyro, rep2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LCDB test data

Downloading data

Building data sets

Strategy

RNA-seq data

ChIP-seq data

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
config		config
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
WRAPPER_SLURM		WRAPPER_SLURM
cp-data-to-repo.sh		cp-data-to-repo.sh
requirements.txt		requirements.txt

License

lcdb/lcdb-test-data

Folders and files

Latest commit

History

Repository files navigation

LCDB test data

Downloading data

Building data sets

Strategy

RNA-seq data

ChIP-seq data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages