Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of files / multiple files as input #9

Open
SergejN opened this issue Nov 22, 2020 · 5 comments
Open

List of files / multiple files as input #9

SergejN opened this issue Nov 22, 2020 · 5 comments

Comments

@SergejN
Copy link

SergejN commented Nov 22, 2020

Dear maintainers,

is it possible to add a possibility to specify a list of input files instead of a single file? I work with the axolotl genome and have quite a few long reads. Therefore, I have two possibilities

    1. either I zcat the input files into a single huge fastq file, which is a bit wasteful given the amount of data OR
    1. I zcat the input files and pipe the data to winnowmap.

However, since the genome is to huge, minimap2 has to split the index. Therefore, if I pipe the data, winnowmap ends up mapping the reads only to the first 5 scaffolds, which are included in the first index chunk. Other scaffolds are processed as well afterwards, but there are no more data in the pipe.
It would be nice to be able to specify multiple input files, which all can be read multiple times if necessary.

I also tried creating the index first by setting -d scaffolds.mmi, and then running winnowmap, but in this case I get a segmentation fault.

thanks!

@cjain7
Copy link
Contributor

cjain7 commented Nov 22, 2020

You can run the mapper as:

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont1.fq.gz ont2.fq.gz ont3.fq.gz  ...

Will this resolve your issue?

@cjain7
Copy link
Contributor

cjain7 commented Nov 22, 2020

BTW, you can also tweak the size of chunk that is processed at a time (assuming you can tolerate more memory-usage) using -I parameter.

See https://lh3.github.io/minimap2/minimap2.html

@SergejN
Copy link
Author

SergejN commented Nov 22, 2020

You can run the mapper as:

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont1.fq.gz ont2.fq.gz ont3.fq.gz  ...

Will this resolve your issue?

In theory, yes, but it's also super inconvenient to specify the names of 137 files on the command line.

BTW, you can also tweak the size of chunk that is processed at a time (assuming you can tolerate more memory-usage) using -I parameter.

See https://lh3.github.io/minimap2/minimap2.html

Yes, I saw this parameter, but I had the impression that minimap2 cannot process sequences longer than 4G. I now saw that this was incorrect and only applies to a single sequence within the dataset and not the total length of the sequences. I will give it a try and set -I to the whole genome size (32Gb). Thanks!

@jelber2
Copy link

jelber2 commented Nov 23, 2020

You might be able to do

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa <(ls -1 *.fq.gz|tr '\n' ' ')

Not tested
*assumes all FASTQ files are desired and have the extension .fq.gz

@SergejN
Copy link
Author

SergejN commented Nov 23, 2020

Yes, sure. This will also work, unless you have to specify so many files that the command line becomes too long (2MB on my system, so quite a few file names):

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa $(find . -name "*.fq.gz" | grep -v 'whatever_you_want_to_exclude' | 'tr '\n' ' ')

But I wanted to propose a more elegant way. Of course, I can also put the file names into a text file and then run (assuming there are no spaces or other weird characters)

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa $(cat filelist | tr '\n' ' ')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants