Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long reads #18

Open
xgnusr opened this issue Nov 6, 2020 · 4 comments
Open

long reads #18

xgnusr opened this issue Nov 6, 2020 · 4 comments
Labels

Comments

@xgnusr
Copy link

xgnusr commented Nov 6, 2020

can the use of long genomic reads improve the accuracy ?

tnx !!

@muellan
Copy link
Owner

muellan commented Nov 6, 2020

In theory, the longer the reads are, the more accurate the results should become. This is certainly what we have seen for Illumina reads of different lengths.

The most crucial setting is -winlen, the "window length" which has a default of 128. This has the effect that even reads shorter than 128bp can be mapped accurately and the speed is still ok even if reads are a couple of times longer than that.

For very long reads (1000s or 10,000s of bp) the whole pipeline will still work, but mapping speed will decrease. So it could be beneficial to set -winlen to a (much) larger value when building a database for use with such reads. This will not only increase the mapping speed but also decrease the database's memory footprint (on disk and in RAM). This could however lead to lower accuracy, especially if the error rate of the reads is very high.

Since longer reads have usually more errors than Illumina reads, I would probably set -winlen so that even the shortest reads are devided into multiple windows. Something along the lines of -winlen 512 if your shortest reads are 1000bp long.

If you had calibration reads with ground truth data you could use those to optimize for the best tradeoff between speed+memory consumption and accuracy. If you are just interested in accuracy and speed and memory consumption don't matter, just stick to the default settings.

@xgnusr
Copy link
Author

xgnusr commented Nov 6, 2020

I am interested in the precision for the strain level ....
What values would be the most appropriate?
my N50 is 3500- 4000 bp

thank you !!

@muellan
Copy link
Owner

muellan commented Nov 6, 2020

If you want to distinguish reads on strain level then mapping accuracy is of course extremely important.

I would first try the default settings which promise the highest accuracy and see if you can live with the mapping speed and memory consumption. As an added bonus you can process shorter reads with the same database without having to change anything.

If however, you plan to process many more such datasets in the future, it might be worth optimizing the speed.
I would try -winlen 1024 and look at the fraction of unmapped reads and reads mapped to the lower taxonomic levels as compared to the default setting of 128. In case they are roughly the same I would double the -winlen and continue until the results get worse. If 1024 is already too bad when compared to 128 I would try 512 and 256.

@muellan
Copy link
Owner

muellan commented Mar 11, 2024

Just to give a little update on this issue. I've recently processed Oxford Nanopore datasets with varying read lengths of 100 up to 21000. The N50 was relatively low (in the 500s). Turned out that MetaCache's default settings yielded the best results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants