-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
show read fate - headers from fastq #48
Comments
Hi @handibles, Thanks for your interest in mOTUs! If I understand correctly, you are suggesting how to improve the tool/wiki.
This depends on how many reads there are in the fasta file. With few reads, then yes the majority is header.
It would be possible to do it, but this is a standard format (BAM or SAM). Printing only the reads would make the file invalid. It's more clean to save a correct SAM/BAM file and then visualise it with samtools (as you showed in your example).
Sure, I added a comment at the end of this wiki page: |
Ha, nice!
Agreed, the last thing we (I) need is invalid files being output into the workflow. However, sometimes it is useful to be able to (e.g.) easily export all the reads that weren't assigned to a separate Anyone who is interested in scrutinising their reads further could try web-searching for samtools or e.g. see the samtool & SAM format guide for info on how to interact with investigating where reads ended up. As you say - suggestions. Thanks again for all the diligence. |
Apologies for the dread return. Realised after some more digging that those commands don't return the outputs I expected/described. To clarify:
My end goal is to determine what reference each successfully aligned read was ultimately assigned to - is there a different portion of the output that I should be looking at? H |
So, we parse the result from
These reads are then used to understand to which marker gene cluster they belong with Moreover, if there are paired-end data, it will select the pair with the best score. Hence, not always the read with the best score is selected. If, for example, Also, we choose all reads that map correctly, and then we distribute proportionally the multiple mappers. For example, if |
It is not so easy to understand exactly what happen of the reads, because at certain point we abstract from the single reads and we have just read counts for MGCs, and do some counts/split on that. That said, all the reads produce by If you try to run: where There are 3 reads (
I would like to make you notice that:
|
Hi @AlessioMilanese , Sorry to comment in an old post, but I don't see the need to ask this separately. If I understand correctly from some of your answers (in this one and other issues) multiple mappers are counted, distributing the counts among the different genes/MGCs. However, looking at the code, it seems that there are cases where multiple mappers are completely ignored. map_genes_to_mOTUs.py, line 956 However, from the code and the comments alone I cannot fully understand what is happening there. So, my question is: are there cases where multiple mapper reads are discarded? Thank you. Best, |
Hi Carlos, can you send me one or two of these reads? Best, |
But basically, If there are more than 3 multiple mappings and there are no unique mapper, then the multiple mappers are ignored. The idea is that if you have a read that map equally well to E. coli, E. albertii and E. fergusonii, but you don't have any unique mapper to any of these species; then it's highly probable that it is an error. And even if you would like to count it, you would have to split the read in three and count 1/3 for each species. So you would have really low read count for these 3 species which would be removed in further analyses. In case you have many reads that are all more than 3 multiple mappers, but there is no single mapper for that species; then something fishy is happening. |
I uploaded all the reads (actually, the SAM mappings; 74022) mapping to the genes (17) from a single MGC (ref_mOTU_v3_03673) which is not appearing after calc_mgc. |
Thank you very much for the explanation. EDIT: ignore the next sentences below. I checked the same read with all the mappings (not only the uploaded ones to the single MGC), and it seems to be mapping to genes from other MGCs, so I guess it is a multiple mapper also. However, after inspecting the mappings I uploaded I am not completely sure this is the case. For example, the read "ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/1" seems to be mapping uniquely to "refMG0012003.COG0172". It is true that its pair is mapping with flag 272 to the same gene, so I am not sure if this is enough to classify it as multiple mapper... If you could give me a clue of what is happening here it would be great. Thank you @AlessioMilanese . Best, |
Hi again, From the file I uploaded before, I extracted the list of reads, and from this list I obtained all the mappings from those reads (i.e. those mapping to genes from any MGC). reads_example_motus_no_mgc.all_mappings.sam.gz I counted how many times I see each read, and none of them seem to be found only once (ranging from 3 to 54 mappings). So yes, this could be the case where no unique mapping is found. Now I have to wonder why is this happening xD Best, |
I'm wondering if the problem is another one. I create a small fastq file (paired end) with the read you mentioned (
where I save the genes ( I see then that it is counted as a multiple mapper (from the motus log):
Now if we look at what genes it maps to (
If I check
It reports the MGC because even if there are not unique mappers (I put only one read), it maps to only 2 MGCs (if it would be more than 3 then it would not be reported). But then there are no mOTUs reported:
Because there is only one MGC detected. It needs at least 3 MGCs from one mOTUs (default So changing the command to
And we can check with
|
Hi @AlessioMilanese , Thank you very much for testing this. When I run it with all the reads it is not reported neither with -g 1 or -g 3 (obviously, since no MGC is reported after all). So I guess (let me know if I am wrong), that what you suggest is that the problem is this one "It reports the MGC because even if there are not unique mappers (I put only one read), it maps to only 2 MGCs (if it would be more than 3 then it would not be reported)." So the problem is that I have multiple mapping reads to genes without unique mappers, and that those genes belong to more than 2 MGCs. If this is the case, I guess I am just losing this information, because these reads are ambiguous (map to conserved regions of the genes?), and they are impossible to assign with certain confidence. I know this is difficult to address (maybe impossible), but in this case it is not only one or more Species which are not detected, but an Order which is not present when these mappings are discarded. I wonder whether these mappings should/could be assigned to the Order (assuming that all the mapped genes belong to such Order), even when no Species is assigned. Thank you very much once more, for your time and patience. Best, |
I checked whether the reads map to genes only from a single order. To do it fine I guess I should keep only the best mappings (discard those with lower identity, etc), but to make things easier I included all the hits. I obtained 2 orders though (Burkholderiales and Pseudomonadales, both from class Betaproteobacteria if I not mistaken). I was wondering also now... if I had a sample with less sequencing depth, then in some cases (like this example we are discussing) I would detect things just because there would be no enough depth (just by chance) to have reads mapping to more than 2 MGCs? edit: sorry, I am thinking now the last paragraph makes no sense, because we are considering individual reads here |
I checked my mappings for the same read, and I have the same results as you, so I wonder why I don't get any MGC count nor detect mOTU ref_mOTU_v3_03673 (meta_mOTU_v3_12785 is detected though):
|
If I understand correctly, the MGC table is empty? It seems there is a problem in the way the code is running. Can you try to run
|
Hi @AlessioMilanese , No, the MGC file is not empty. It is just missing the results that you showed for the single read you tested. Actually, I ran again
If I did everything right, none of those MGCs are from mOTU ref_mOTU_v3_03673, which includes the next MGCs:
When I run, as you did, only the mappings from the single read above (ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0), I get the next results
COG0172.mOTU.v2b.0000252 belongs to mOTU ref_mOTU_v3_03673. Why this result is only present when I use a single read, and not when the same read mappings are along with mappings from other reads? Maybe I am doing some error which is leading to this confusion, I don't know. Thank you once more. Best, |
Hey devs,
Thanks for all the work on this. I'm checking the fate of the
fastq
reads moving through mOTUs (i.e., which specific reads are assigned, which are not etc.), so interested in the headers from the fastq input.These headers make up the bottom 99% of the BAM/SAM file and can be obtained as below - but could this header output be set as a flag during the pipeline? Or perhaps in the wiki, outline the general format & composition of a
BAM
file w.r.t to the work mOTUs is doing? Could help the next person.The text was updated successfully, but these errors were encountered: