-
Notifications
You must be signed in to change notification settings - Fork 7
Discussion Items
1: BLAST improvements like RAPSearch (and LAST?) improve orders-of-magnitude on BLAST speeds - is PALADIN still novel enough? (See Gittir conversation). Should we incorporate any of these elements in, like a reduced amino acid set?
2: Currently, PALADIN accounts for the issue of not always choosing the proper frame during ORF detection (i.e. when the stop codon is outside the read) by indexing the reference in all six frames for each CDS entry, - the effect of this is regardless of what frame the codons of the read get translated to AA in, that AA sequence will map to one of the 6 reference entries. However, a primary metagenomic goal is to be able to align reads against a protein reference database, like UniProt. These databases are typically distributed as protein sequences (by design or necessity) and not the corresponding nucleotide transcripts, and therefor a frameshifted reference cannot be created (due to the inherent ambiguity of the original codons for any two contiguous amino acids). When I did the testing in the past, I didn't account for this when using a protein database that was created in just 1 frame, and sometimes the alignments would work really well (if the ORFs happened to be in the correct frames for that read set), and other times they would be awful. When I aligned against the metagenomic readset last night, overall it did badly, as more were misaligned.
Because of this, my thought is to instead move the process of encoding across multiple frames to the read alignment side (and not during the index process). During the ORF detection process, the .PRO file (protein sequence FASTA) will be written with proteins encoded from any valid ORF, not just the first one when they tie in length. Then, after alignment has taken place, before the SAM file is generated, PALADIN will internally check competing (per frame) alignments per aligned sequence, and pick the one that aligned the best. That way, we aren't artificially inflating/deflating the percentage of mapped reads we get.
Additionally, the efficiency may actually be helped, and memory usage will be 6 times less, as currently I'm always translating and indexing all 6 frames of each CDS during the index (and hence during the MEM process it has 6 times more BWT and SA to traverse). The read process will only translate when necessary, and the BWT/SA space will be 6 times smaller, so overall it should be faster.
In response to items above - PALADIN is still novel as RAPSearch and associated tools are BLAST based, and slower. Changes to protein generation algorithm will be implemented as discussed above.
Focus of project was reiterated - we want to ensure our message for the paper (and development direction of PALADIN) is oriented toward quick metagenomic alignment for the purpose of discovering effective functionality (genes and associated metabolic pathways - both specific and GO-based groupings) - and not for taxonomic identification and classification (though PALADIN could be used in many ways (even unpredicted) other than our primary goal, including this).
To accomplish this, our target reference database will be UniProt. This presents a few technical challenges:
-
PALADIN currently does not align well directly from a protein database without the original DNA transcripts (per the issue mentioned above). This is to be fixed by implementing the change above.
-
For baseline/testing purposes, we want to compare BWA/DNA to Novo/Degen DNA to PALADIN. Since we will be using the Uniprot DB as the reference in PALADIN, we will need the corresponding DNA sequences for the first two tests. However, UniProt does not directly offer the corresponding nucleotide sequence for each protein sequence (as mentioned on their site here: http://www.uniprot.org/help/canonical_nucleotide. However, these DNA sequences are technically obtainable via included source/database information for each entry. Jordan will be using previous work he's done with the UniProt database on another project to obtain, check for the correct corresponding entry, and compile into the corresponding complete multi FASTA with matching sequence headers. Then all three tests will be using the equivalent starting reference.
-
We will also want to remove (if present) our source genomes from which we generated our reads from the UniProt references (both protein and DNA) while testing. We know these sections will align extremely well since they are identical (sans the simulated read errors), and we want to reduce this bias. We may want to experiment by extending this out to also exclude up to certain similiar species for the same reason - to be determined empirically.
After this step (impelementing fixes/additional functionality to PALADIN and compiling this nucleotide reference and testing), we'll need to actual look at how well PALADIN is matching genes in the UniProt database. For those that don't directly match, we'll want to use Gene Ontology information associated with each UniProt entry to determine how closely and how they matched for GO domain/attribute. There are a number of ways of directions we could go with analysis at this point - to revisit when closer.
Also, additional items to be implemented in PALADIN (To-Do list has been updated with these - some new, some existing):
- Fixes to multi-frame alignment mentioned above.
- Functionality to directly specify protein database instead of FASTA/GFF during the index phase. If protein reference directly given, CDS parsing stage and .PRO generation will be skipped. Lack of .PRO file will also indicate to alignment phase that original sequence is AA
- Command line argument to indicate BAM generation instead of SAM
- Command line argument to generate a gene or GO oriented output instead of alignment data
- Re-alignment from protein to nucleotide in SAM/BAM. May not be used often, but easy to implement and potentially useful.
To review, one major implementation item was to move the encoding of codons to AA across each read frame (to allow false-positive ORF detected AA sequences to match) from the index process to the alignment ORF generation process. This was necessary if we wanted to allow direct use of a protein database as a reference, such as UniProt - as we wouldn't be able to encode AA sequences from all six frames of a database already in AA format.
Moving the process to when ORFs are encoded into protein strings proved to be difficult for many reasons which we can discuss during our meeting (I have an extensive excel spreadsheet with results and reasons for each result). However, long story short, I implemented 10 different algorithms to test out how effective paladin was. I used 3 overview metrics to judge effectiveness:
- If the original nucleotide sequence was required for the reference, or if a protein database could be used
- Speed (alignment time in CPU seconds)
- Reads mapped
And tried these 10 different algorithms (descriptions a little cryptic, will explain in person):
1 Reference 6 frames, first full ORF per read seq, SE
2 Reference correct frame, first full ORF across 6 frames per read seq, filtered, SE
3 Reference correct frame, first full ORF, SE
4 Reference correct frame, all full ORFs per read seq, filtered, SE
5 Reference correct frame, all full ORFs per read seq allowing 1 stop, filtered, SE
6 Reference correct frame, best ORF by stats (best triplet) per read seq, SE
7 Reference correct frame, best ORF by stats (best triplet) (>33%) per read seq , SE
8 Reference correct frame, best ORF by stats (GC) (>33%), SE
9 Reference correct frame, all full ORFs per read seq, filtered, PE (PEAR), min 400nt
10 Reference correct frame, all full ORFs per read seq, filtered, PE (Hint)
It came down to being a tradeoff between the three metrics, 1 will always be worse than the other three.
More importantly though, I realized after doing some research into codon frequencies and ORF statistics (especially with overlapping ORFs on the antisense strand and erroneous ORFs in short sequences):
http://www.ncbi.nlm.nih.gov/pubmed/9544406
http://www.ncbi.nlm.nih.gov/pubmed/8208617
http://www.ncbi.nlm.nih.gov/pubmed/9266684
That even 250 bp reads are still too short to accurately filter out erroneous ORFs that exist due to codon bias in CDS - many read sequences have 3-4 frames that are open, even at 250 bp length. Because of this, detected ORFs will often be on the wrong frame or even in an intragenic region. This causes different issues for each method.
While the end result of this means anything less than all 6 frames encoded (either during indexing or during ORF detection) involved will always lead to fewer reads mapped, the number of mapped reads isn't reflective on how many genes are mapped. Even in scenarios where only 20% of reads are mapped, almost 100% of genes are still mapped and found. There are enough in-frame, correct ORFs detected to map to the genes. I am guessing the exact relationship between reads mapped and genes mapped will depend on the coverage levels of the particular genome in the metagenomic sample, combined with the particular codon bias that species has.
Given this, depending on what an individual's goals are (either gene detection, or read alignment at the protein level), we can tune the algorithm one of two ways. Out of the 10 algorithms, 3 of them represent the best of each evaluation metric, and the user can choose which one they want to use depending on their needs.
I think the actual numbers will demonstrate what is described above a lot better - we can discuss tomorrow.
Most of the session was spent reviewing statistics from the trial runs of the 10 different algorithm variants mentioned above. A few highlights though:
-
As seen on the Excel sheet, algorithm 1, 2, and 3 represent the extreme end of each metric (a. requiring nucleotide sequence for index, b. speed, and c. read-mapping accuracy). It's likely we'll program paladin to allow selection of one of these algorithm types at runtime depending on the user's needs (e.g. if they're more gene finding oriented, or read alignment oriented). Dan also suggested a multi-pass/pipeline approach in which variants can be combined in a filtering manner to reduce input domain of latter stages in the pipeline - thereby speeding up the normally slower algorithm. Essentially, we'd be combining algorithms to even out the extremes of each metric. E.g. (abstractly speaking) if algorithm is 100 speed, 50 readmapping, and the other was 50 speed, 100 readmapping, pipelined could result in an overall 75 speed 75 readmapping (or perhaps better) This approach will be another runtime option in paladin.
-
The erroneous ORF detection was discussed - initial trials show that even simple strategies like taking into account codon frequencies can help. Continue to research this and implement this - find the fastest ways of incorporating ORF detection "hints" for algorithm variants other than #1. Codon frequencies, GC content and arrangement, palindrome avoidance, simple machine learning (HMM/NN/Bayes), etc.
-
Though result differences between 6 frame variants (#1 & #2) and the rest are highly demonstrative of ORF misalignment being the primary issue with missed read alignments, it would be nice to clearly prove this with longer reads as a test (many papers have used 100 codons as the low-end limit on ORF detection, for these very reasons). ART appears to have an artificial limit of 250 BP for reads simulation/generation due to the limitation of its Illumina error model. Either see if this can be overridden and generate at a 400 BP minimum, or find another / write a simple read generator (doesn't need to contain errors just for testing purposes) and ensure the resulting.
-
As discovered through the trials runs, regardless of read mapping effectiveness described above, the most important metric for our focus is how well it maps genes. While the tests show that gene mapping does well, we still haven't confirmed that the CORRECT genes are being mapped. It would be good to devise a test verification method that can translate and map the UniProt IDs gathered from aligned reads in the SAM file (aligned against the UniProt database) with the original CDS entries that correspond to the read location (as obtained via the alignment data in the sequence header created by ART that we inserted at the start of MCBS913). I have the latter functionality done already (which is how I got the statistics for the excel spreadsheet), but not the part that matches anything in the CDS entry to the UniProt entry to ensure it's the correct protein (but really, it would need to be a little smarter than this, and check for close homology as opposed to exact matching).
-
Another item we had talked about the previous week that came up again - now that Jordan has been able to mine and compile 97% of the corresponding nucleotide sequences for the protein sequences of the UniProt database (which is the upper limit as documented by UniProt themselves - 3-4% of the proteins were obtained in some other manner - maybe mass spec), we can do two things: 1) Run the metagenome test on the Excel spreadsheet for algorithm variant #1, which requires of the original nucleotide sequence. 2) Run the BWA and Novo tests, which will be important comparison points for the paper. However, before we do this, we probably want to have two versions of the UniProt nucleotide database - the version we have right now, and a version with any genes pulled out that originate from our 6 original references (A. avenae, E. coli, etc), since having the exact protein sequences in the reference will bias the results. Jordan, since you have the infrastructure and scripts setup for most of this already, do you have time to take on this task?
-
Additionally, we want to basically do all the tests we've done/will do with our real metagenomic reads.
-
Dependent that no further major gotchas are found in any of the testing, we can start back into finalizing paladin and polishing the code. Right now there are essentially 10 versions of it, so these need to be condensed with command-line arguments. (plus adding in any refinement features as mentioned in item #2 above)
Next week to discuss results found from the tasks above...
Both sessions consisted of going over the further testing results, including more trial runs and the stop codon order likelihood relationships (to potentially be used to speed up ORF detection). The 18th session also included a lengthy discussion about Jordan and Louisa's upcoming work on analyzing gene ontology relationships of mapped reads between the known proteins and the reference proteins (i.e. UniProt reference). Further paladin testing will wait until this analysis is in place, as this will gauge the true effectiveness of the the protein mapping process (all metrics so far are just estimates).
Additionally, since testing is being put on hold, Toni will take this opportunity to finish some paladin coding tasks.