List of files / multiple files as input #9

SergejN · 2020-11-22T12:59:06Z

Dear maintainers,

is it possible to add a possibility to specify a list of input files instead of a single file? I work with the axolotl genome and have quite a few long reads. Therefore, I have two possibilities

1. either I zcat the input files into a single huge fastq file, which is a bit wasteful given the amount of data OR
1. I zcat the input files and pipe the data to winnowmap.

However, since the genome is to huge, minimap2 has to split the index. Therefore, if I pipe the data, winnowmap ends up mapping the reads only to the first 5 scaffolds, which are included in the first index chunk. Other scaffolds are processed as well afterwards, but there are no more data in the pipe.
It would be nice to be able to specify multiple input files, which all can be read multiple times if necessary.

I also tried creating the index first by setting -d scaffolds.mmi, and then running winnowmap, but in this case I get a segmentation fault.

thanks!

The text was updated successfully, but these errors were encountered:

cjain7 · 2020-11-22T13:46:52Z

You can run the mapper as:

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont1.fq.gz ont2.fq.gz ont3.fq.gz  ...

Will this resolve your issue?

cjain7 · 2020-11-22T13:49:27Z

BTW, you can also tweak the size of chunk that is processed at a time (assuming you can tolerate more memory-usage) using -I parameter.

See https://lh3.github.io/minimap2/minimap2.html

SergejN · 2020-11-22T17:05:29Z

You can run the mapper as:
winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont1.fq.gz ont2.fq.gz ont3.fq.gz  ...
Will this resolve your issue?

In theory, yes, but it's also super inconvenient to specify the names of 137 files on the command line.

BTW, you can also tweak the size of chunk that is processed at a time (assuming you can tolerate more memory-usage) using -I parameter.

See https://lh3.github.io/minimap2/minimap2.html

Yes, I saw this parameter, but I had the impression that minimap2 cannot process sequences longer than 4G. I now saw that this was incorrect and only applies to a single sequence within the dataset and not the total length of the sequences. I will give it a try and set -I to the whole genome size (32Gb). Thanks!

jelber2 · 2020-11-23T09:48:30Z

You might be able to do

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa <(ls -1 *.fq.gz|tr '\n' ' ')

Not tested
*assumes all FASTQ files are desired and have the extension .fq.gz

SergejN · 2020-11-23T19:12:58Z

Yes, sure. This will also work, unless you have to specify so many files that the command line becomes too long (2MB on my system, so quite a few file names):

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa $(find . -name "*.fq.gz" | grep -v 'whatever_you_want_to_exclude' | 'tr '\n' ' ')

But I wanted to propose a more elegant way. Of course, I can also put the file names into a text file and then run (assuming there are no spaces or other weird characters)

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa $(cat filelist | tr '\n' ' ')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of files / multiple files as input #9

List of files / multiple files as input #9

SergejN commented Nov 22, 2020

cjain7 commented Nov 22, 2020

cjain7 commented Nov 22, 2020

SergejN commented Nov 22, 2020

jelber2 commented Nov 23, 2020

SergejN commented Nov 23, 2020

List of files / multiple files as input #9

List of files / multiple files as input #9

Comments

SergejN commented Nov 22, 2020

cjain7 commented Nov 22, 2020

cjain7 commented Nov 22, 2020

SergejN commented Nov 22, 2020

jelber2 commented Nov 23, 2020

SergejN commented Nov 23, 2020