File and Test Organization

####Data Sets####

Root: brain:~/local_projects/paladin/data_sets

Description

This directory contains:

References
6 MCBS913 data sets (fasta and GFF)
The NT and AA translations of both versions of the UniProt DBs (full and filtered w/o the 6 datasets above)
Reads
PE sets for each of the 6 MCBS913 data sets
SE of the previous 6 concatenated, for simulated metagenomic reads (metareads.fq)
PE of a real metagenomic set (Jun_MW4)

All testing makes use of symbolic links to these files, with read mapping related files (PAC/BWT/SA) stored in the individual test's directory, and not with the dataset.

####Seed Testing####

Root: brain:~/local_projects/paladin/test-seed_length

Description

Testing the relationship between read mapped percentages and seed length

Instructions

genIndices.sh will index all references.
alignSeed.sh will run the testing for all single genome read sets (1-3 below)
alignMetagenome.sh will run the testing for metagenome read (4 below)

Notes Each subdirectory under the root directory is a numeral identifying the read set being run against the reference. Each of 3 references is also stored within each subdirectory. Outputs will be in each directory in the form of samstat files which should be compiled with sam2csv script into a single CSV file. Values are as follows (Reads, References):

AcidovoraxAvenaeATCC19860
Acidovorax_citrulli_AAC00_1_uid58429_NC_008752 (0.4%)
Variovorax_paradoxus_EPS_uid62107_NC_014931 (15.3%)
Thiomonas_intermedia_K12_uid48825_NC_014153 (31.1%)
EscherichiaColiStrK-12SubstrMG1655
Escherichia_coli_042_uid161985_NC_017626 (0.5%)
Yersinia_pestis_A1122_uid158119_NC_017168 (15.4%)
Haemophilus_parainfluenzae_T3T1_uid72801_NC_015964 (31%)
StaphylococcusEpidermidisATCC12228
Staphylococcus_pasteuri_SP1_NC_022737 (3.8%)
Macrococcus_caseolyticus_JCSC5402_NC_011995 (17%)
Bacillus_cellulosilyticus_DSM2522_NC_014829 (N/A%)
Metagenome
Iterates through directories/sets above

####ORF Length Testing####

Root: brain:~/local_projects/paladin/test-orf_length

Description

Testing the relationship between read mapped percentages and minimum ORF length filtering. NOTE - this test is likely deprecated with new algorithm variants.

Instructions

genIndices.sh will index all references.
alignOrfs.sh will run the testing for all single genome read sets (1-3 below)
alignMetagenome.sh will run the testing for metagenome read (4 below)

Notes
Each subdirectory under the root directory is a numeral identifying the read set being run against the reference. Each of 3 references is also stored within each subdirectory. Outputs will be in each directory in the form of samstat files which should be compiled with sam2csv script into a single CSV file. Values are as follows (Reads, References):

AcidovoraxAvenaeATCC19860
Acidovorax_citrulli_AAC00_1_uid58429_NC_008752 (0.4%)
Variovorax_paradoxus_EPS_uid62107_NC_014931 (15.3%)
Thiomonas_intermedia_K12_uid48825_NC_014153 (31.1%)
EscherichiaColiStrK-12SubstrMG1655
Escherichia_coli_042_uid161985_NC_017626 (0.5%)
Yersinia_pestis_A1122_uid158119_NC_017168 (15.4%)
Haemophilus_parainfluenzae_T3T1_uid72801_NC_015964 (31%)
StaphylococcusEpidermidisATCC12228
Staphylococcus_pasteuri_SP1_NC_022737 (3.8%)
Macrococcus_caseolyticus_JCSC5402_NC_011995 (17%)
Bacillus_cellulosilyticus_DSM2522_NC_014829 (N/A%)
Metagenome
Iterates through directories/sets above

####No Hidden Stop Count per Frame Testing####

Root: brain:~/local_projects/paladin/test-no_hidden_stop_count

Description

Via PALADIN variant 1, index all 6 frames for the combined MCBS913 dataset, as well as the UniProt DB. The frame number is used as the first character in each sequence header of each AA sequence, with 0 being the correctly aligned read frame for the protein in question. Then the number of frames with no hidden stop codons are counted

Instructions

Run a PALADIN index using the all 6 frame index variant
Run ~/repos/paladin/Scripts/countNoHiddenStop.py file.pro startLength, endLength, stepLength
Redirect to CSV file, will contain column headings

Notes The results of this test can be found in "No Hidden Stop Counts.xlsx"

####Order of Likelihood of Stop Codons by Frame####

Root: brain:~/local_projects/paladin/test-stop_likelihood

Description

Via PALADIN variant 1, index all 6 frames for the combined MCBS913 dataset, as well as the UniProt DB. The frame number is used as the first character in each sequence header of each AA sequence, with 0 being the correctly aligned read frame for the protein in question. Then the likelihood of stop codons per frame is reported in a matrix view

Instructions

Run a PALADIN index using the all 6 frame index variant
Run ~/repos/paladin/Scripts/stopLikelihoodCounts.py file.pro
Redirect to CSV file

Notes The results of this test can be found in "Stop Stats.xlsx"

####Order of Likelihood of Stop Codons by GC Content####

Root: brain:~/local_projects/paladin/test-stoplikelihood2

Description

Via PALADIN variant 1, index all 6 frames of the UniProt DB. The frame number is used as the first character in each sequence header of each AA sequence, with 0 being the correctly aligned read frame for the protein in question. The GC content is used as the second filed in the sequence header. Then the likelihood of stop codons per GC content is reported in a matrix view

Instructions

Run a PALADIN index using the all 6 frame index variant and testing index protein generation function
Run ~/repos/paladin/Scripts/stopLikelihoodCountsGC.py file.pro Order
Redirect to CSV file

Notes The results of this test can be found in "Stop Stats.xlsx"

####ALL ALIGNMENT TESTS####

Root: brain:~/local_projects/paladin/test-alignXXX

Description

All alignment tests are run using the follow automated pipeline:

Run a PALADIN index using the appropriate variant (stderr and runtime are reported in .LOG file)
Run alignment using the appropriate variant, redirected into SAM file (stderr and runtime are reported in .LOG file). Alignment is recorded to .SAM file.
Convert .SAM to .BAM
Flagstats are saved to .SAMSTAT file
listMappedCDS.py does a lookup of the corresponding GFF CDS entry for each mapped read in the SAM file, and saves this list to a .CDS file
listMappedCDS.py does a lookup of the corresponding GFF CDS entry for each mapped read in the SAM file, and for each corresponding mapping in the UniProt, saves the UniProt and RefSeq IDS to a .CDSMAP file

Notes "cat file.cds | sort | uniq | wc -l" can be run to see the number of CDS entries corresponding to reads that were successfully mapped. Other operations can be performed with cdsmap for UniProt mapping info

The results of these tests can be found in "PALADIN Test Stats.xlsx"

Align1 - PALADIN variant 1, MCBS913 metagenome reads, UniProt DB (full and filtered), seed length 9 and 11
Align2 - PALADIN variant 2, MCBS913 metagenome reads, UniProt DB (full and filtered), seed length 9 and 11
BWA - BWA, MCBS913 metagenome reads, UniProt DB (full and filtered)

Wiki Navigation

Home

Pending Tasks

Discussion Items

Knowledgebase

File and Test Organization

Removed BWA Functionality

Sequence Header Formats

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File and Test Organization

Wiki Navigation

Knowledgebase

Clone this wiki locally