Run sort
command without pre-calculated minimizer index
#1554
Labels
help wanted
Extra attention is needed
package: nextclade_cli
package: nextclade
t:feat
Type: request of a new feature, functionality, enchancement
Context: https://discussion.nextstrain.org/t/nextclade-sort-using-a-serverless-local-dataset-for-sorting/1777
Nextclade CLI
sort
command currently requiresminimizer_index.json
, either from dataset server or one that's provided using-m
parameter.The
minimizer_index.json
contains a mapping from dataset/ref names to the list of minimizers (hashes of ref sequence fragments).In principle, Nextclade CLI could calculate this index if given a set of reference sequences. The code already exists - it already has to calculate minimizers of query sequences, so it might as well calculate minimizers for reference sequences.
We could add a parameter to the
sort
command (e.g.--input-ref
,-r
) to provide one or more fasta files with ref sequences, and in this case Nextclade CLI would not require either-m
or fetching the index from a server. It would instead calculate minimizer index from the provided sequences and then immediately proceed to sorting. This could also be implemented as a separate command which only generates the minimizer index file - this way the produced index file could be reused (withsort -m
), without making repeated minimizer index calculations on every run.This should improve user experience for people who want to use
sort
command with their own reference sequences. Currently they have to go through building a customminimizer_index.json
which is a non-trivial task.The text was updated successfully, but these errors were encountered: