Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to nextclade v3 & update default dataset tags #375

Merged
merged 24 commits into from
Apr 4, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
6e34d2d
added new WDL task for nextclade v3. tested w miniwdl. not added to w…
kapsakcj Mar 5, 2024
1507c6e
added common miniwdl output directories to .gitignore
kapsakcj Mar 5, 2024
734700a
update sars-cov-2 nextclade defaults; removed unnecessary nextclade_d…
kapsakcj Mar 6, 2024
c2703bf
updates to nextclade v3 task
kapsakcj Mar 6, 2024
cb84f4c
update theiacov_fasta to use nextclade v3 task. tested successfully w…
kapsakcj Mar 6, 2024
0dac722
update nextclade defaults for non-sc2 organisms. Have not tested at a…
kapsakcj Mar 6, 2024
f93e691
update to nextclade 3.3.1 and implement --verbosity flag for nextclad…
kapsakcj Mar 6, 2024
4c3c21f
updated WDL task for adding samples to nextclade ref tree. tested fin…
kapsakcj Mar 6, 2024
dc08e61
update Sample_to_ref_tree_PHB workflow: removed old inputs and made a…
kapsakcj Mar 6, 2024
cd2afec
updated theiacov_fasta_batch, ilmn pe, ilmn se, and ont to use nextcl…
kapsakcj Mar 7, 2024
477a216
update theiacov_clearlabs to use nextclade_v3. did not test with mini…
kapsakcj Mar 21, 2024
7e8e9ab
Merge remote-tracking branch 'origin/main' into cjk-nextclade-v3
kapsakcj Mar 22, 2024
6982979
fix import path for organism_paramteters subwf in theiacov_clearlabs …
kapsakcj Mar 22, 2024
fbf0b49
shellcheck lied to me. reverting last commit
kapsakcj Mar 22, 2024
303a9b4
update theiacov_fasta CI
kapsakcj Mar 22, 2024
f89b0bd
update theiacov_clearlabs CI
kapsakcj Mar 22, 2024
ef1a6ac
update theiacov_ont CI
kapsakcj Mar 22, 2024
e654a22
re-enable theiacov_illumina_pe and se CI workflows; update them for n…
kapsakcj Mar 22, 2024
728504a
Merge remote-tracking branch 'origin/main' into cjk-nextclade-v3
kapsakcj Mar 28, 2024
6324136
update CI
kapsakcj Mar 28, 2024
2b7470a
nextclade_v3 task: removed unused pcr_primers_csv input; added back i…
kapsakcj Apr 4, 2024
0800fa0
nextclade_addToRefTree task and wf change: remove input-pcr-primers o…
kapsakcj Apr 4, 2024
84d506a
Merge remote-tracking branch 'origin/main' into cjk-nextclade-v3
kapsakcj Apr 4, 2024
7381cb6
corrected input file type for input-ref
kapsakcj Apr 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
cromwell*
cromwell*
_LAST
2024*
101 changes: 82 additions & 19 deletions tasks/taxon_id/task_nextclade.wdl
kevinlibuit marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,72 @@ task nextclade {
}
}

task nextclade_v3 {
meta {
description: "Nextclade classification of one sample. Leaving optional inputs unspecified will use SARS-CoV-2 defaults."
}
input {
File genome_fasta
File? auspice_reference_tree_json
File? gene_annotations_gff
File? pcr_primers_csv
File? nextclade_pathogen_json
String docker = "us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.3.1"
String dataset_name
String verbosity = "warn" # other options are: "off" "error" "info" "debug" and "trace"
String dataset_tag
Int disk_size = 50
Int memory = 4
Int cpu = 2
}
String basename = basename(genome_fasta, ".fasta")
command <<<
# track version & print to log
nextclade --version | tee NEXTCLADE_VERSION

# --reference no longer used in v3. consolidated into --name and --tag
nextclade dataset get \
--name="~{dataset_name}" \
--tag="~{dataset_tag}" \
-o nextclade_dataset_dir \
--verbosity ~{verbosity}

# exit script/task upon error
set -e

# not necessary to include `--jobs <jobs>` in v3. Nextclade will use all available CPU threads by default. It's fast so I don't think we will need to change unless we see errors
nextclade run \
--input-dataset nextclade_dataset_dir/ \
~{"--input-tree " + auspice_reference_tree_json} \
~{"--input-pathogen-json " + nextclade_pathogen_json} \
~{"--input-annotation " + gene_annotations_gff} \
~{"--input-pcr-primers " + pcr_primers_csv} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks as if --input-pcr-primers was also removed as an input flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the --input-root-seq flag (and associated root_sequencetask input) were removed from the task, but looks as if these were just renamed in nextclade v3 ^^ info in same docs linked above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thank you. I will remove the --input-pcr-primers and add the --input-root-seq to --input-ref as specified in their docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in 2b7470a

--output-json "~{basename}".nextclade.json \
--output-tsv "~{basename}".nextclade.tsv \
--output-tree "~{basename}".nextclade.auspice.json \
--output-all . \
--verbosity ~{verbosity} \
"~{genome_fasta}"
>>>
runtime {
docker: "~{docker}"
memory: "~{memory} GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB" # TES
dx_instance_type: "mem1_ssd1_v2_x2"
maxRetries: 3
}
output {
String nextclade_version = read_string("NEXTCLADE_VERSION")
File nextclade_json = "~{basename}.nextclade.json"
File auspice_json = "~{basename}.nextclade.auspice.json"
File nextclade_tsv = "~{basename}.nextclade.tsv"
String nextclade_docker = docker
String nextclade_dataset_tag = "~{dataset_tag}"
}
}

task nextclade_output_parser {
meta {
description: "Python and bash codeblocks for parsing the output files from Nextclade."
Expand Down Expand Up @@ -182,52 +248,49 @@ task nextclade_add_ref {
}
input {
File genome_fasta
File? root_sequence
File? reference_tree_json
File? qc_config_json
File? nextclade_pathogen_json
File? gene_annotations_gff
File? pcr_primers_csv
File? virus_properties
String docker = "us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:2.14.0"
String docker = "us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.3.1"
String dataset_name
String? dataset_reference
String? dataset_tag
Int disk_size = 50
Int memory = 8
String verbosity = "warn" # other options are: "off" "error" "info" "debug" and "trace"
Int disk_size = 100
Int memory = 4
Int cpu = 2
}
String basename = basename(genome_fasta, ".fasta")
command <<<
NEXTCLADE_VERSION="$(nextclade --version)"
echo $NEXTCLADE_VERSION > NEXTCLADE_VERSION
# track version & print to log
nextclade --version | tee NEXTCLADE_VERSION

echo "DEBUG: downloading nextclade dataset..."
nextclade dataset get \
--name="~{dataset_name}" \
~{"--reference " + dataset_reference} \
~{"--tag " + dataset_tag} \
-o nextclade_dataset_dir \
--verbose
--verbosity ~{verbosity}

# If no referece sequence is provided, use the reference tree from the dataset
# If no reference sequence is provided, use the reference tree from the dataset
if [ -z "~{reference_tree_json}" ]; then
echo "Default dataset reference tree JSON will be used"
cp nextclade_dataset_dir/tree.json reference_tree.json
cp -v nextclade_dataset_dir/tree.json reference_tree.json
else
echo "User reference tree JSON will be used"
cp ~{reference_tree_json} reference_tree.json
cp -v ~{reference_tree_json} reference_tree.json
fi

tree_json="reference_tree.json"

set -e
echo "DEBUG: running nextclade..."
nextclade run \
--input-dataset=nextclade_dataset_dir/ \
~{"--input-root-seq " + root_sequence} \
--input-dataset nextclade_dataset_dir/ \
--input-tree ${tree_json} \
~{"--input-qc-config " + qc_config_json} \
~{"--input-gene-map " + gene_annotations_gff} \
~{"--input-pathogen-json " + nextclade_pathogen_json} \
~{"--input-annotation " + gene_annotations_gff} \
~{"--input-pcr-primers " + pcr_primers_csv} \
~{"--input-virus-properties " + virus_properties} \
--output-json "~{basename}".nextclade.json \
--output-tsv "~{basename}".nextclade.tsv \
--output-tree "~{basename}".nextclade.auspice.json \
Expand Down
14 changes: 4 additions & 10 deletions workflows/phylogenetics/wf_nextclade_addToRefTree.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -8,28 +8,22 @@ workflow nextclade_addToRefTree {
description: "Nextclade workflow that adds samples to a curated JSON tree from Augur."
}
input {
File assembly_fastas
File? root_sequence_fasta
kevinlibuit marked this conversation as resolved.
Show resolved Hide resolved
File assembly_fasta
File? gene_annotations_gff
File? reference_tree_json
File? qc_config_json
File? nextclade_pathogen_json
File? pcr_primers_csv
File? virus_properties
String nextclade_dataset_name
String? dataset_reference
String? dataset_tag
}
call nextclade_analysis.nextclade_add_ref { # nextclade analysis
input:
genome_fasta = assembly_fastas,
root_sequence = root_sequence_fasta,
genome_fasta = assembly_fasta,
reference_tree_json = reference_tree_json,
qc_config_json = qc_config_json,
nextclade_pathogen_json = nextclade_pathogen_json,
gene_annotations_gff = gene_annotations_gff,
pcr_primers_csv = pcr_primers_csv,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment above RE input removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in 0800fa0

virus_properties = virus_properties,
dataset_name = nextclade_dataset_name,
dataset_reference = dataset_reference,
dataset_tag = dataset_tag
}
call versioning.version_capture {
Expand Down
17 changes: 7 additions & 10 deletions workflows/theiacov/wf_theiacov_fasta.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ workflow theiacov_fasta {
File? reference_genome
Int? genome_length
# nextclade inputs (default SC2)
String? nextclade_dataset_reference
String? nextclade_dataset_tag
String? nextclade_dataset_name
# sequencing values
Expand Down Expand Up @@ -53,7 +52,6 @@ workflow theiacov_fasta {
flu_subtype = select_first([flu_subtype, abricate_subtype, "N/A"]),
reference_genome = reference_genome,
genome_length_input = genome_length,
nextclade_dataset_reference_input = nextclade_dataset_reference,
nextclade_dataset_tag_input = nextclade_dataset_tag,
nextclade_dataset_name_input = nextclade_dataset_name,
vadr_max_length = maxlen,
Expand All @@ -75,16 +73,15 @@ workflow theiacov_fasta {
}
if (organism_parameters.standardized_organism == "sars-cov-2" || organism_parameters.standardized_organism == "MPXV" || organism_parameters.standardized_organism == "rsv_a" || organism_parameters.standardized_organism == "rsv_b" || organism_parameters.standardized_organism == "flu") {
if (organism_parameters.nextclade_dataset_tag != "NA") {
call nextclade_task.nextclade {
call nextclade_task.nextclade_v3 {
input:
genome_fasta = assembly_fasta,
dataset_name = organism_parameters.nextclade_dataset_name,
dataset_reference = organism_parameters.nextclade_dataset_reference,
dataset_tag = organism_parameters.nextclade_dataset_tag
}
call nextclade_task.nextclade_output_parser {
input:
nextclade_tsv = nextclade.nextclade_tsv,
nextclade_tsv = nextclade_v3.nextclade_tsv,
organism = organism_parameters.standardized_organism
}
}
Expand Down Expand Up @@ -138,11 +135,11 @@ workflow theiacov_fasta {
String? pangolin_docker = pangolin4.pangolin_docker
String? pangolin_versions = pangolin4.pangolin_versions
# Nextclade outputs
File? nextclade_json = nextclade.nextclade_json
File? auspice_json = nextclade.auspice_json
File? nextclade_tsv = nextclade.nextclade_tsv
String? nextclade_version = nextclade.nextclade_version
String? nextclade_docker = nextclade.nextclade_docker
File? nextclade_json = nextclade_v3.nextclade_json
File? auspice_json = nextclade_v3.auspice_json
File? nextclade_tsv = nextclade_v3.nextclade_tsv
String? nextclade_version = nextclade_v3.nextclade_version
String? nextclade_docker = nextclade_v3.nextclade_docker
String nextclade_ds_tag = organism_parameters.nextclade_dataset_tag
String? nextclade_clade = nextclade_output_parser.nextclade_clade
String? nextclade_aa_subs = nextclade_output_parser.nextclade_aa_subs
Expand Down
17 changes: 7 additions & 10 deletions workflows/theiacov/wf_theiacov_fasta_batch.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ workflow theiacov_fasta_batch {
Array[File] assembly_fastas
String organism = "sars-cov-2"
# nextclade inputs
String? nextclade_dataset_reference
String? nextclade_dataset_tag
String? nextclade_dataset_name
# pangolin inputs
Expand All @@ -30,7 +29,6 @@ workflow theiacov_fasta_batch {
call set_organism_defaults.organism_parameters {
input:
organism = organism,
nextclade_dataset_reference_input = nextclade_dataset_reference,
nextclade_dataset_tag_input = nextclade_dataset_tag,
nextclade_dataset_name_input = nextclade_dataset_name,
pangolin_docker_image = pangolin_docker
Expand All @@ -52,11 +50,10 @@ workflow theiacov_fasta_batch {
}
if (organism == "MPXV" || organism == "sars-cov-2"){
# tasks specific to either MPXV or sars-cov-2
call nextclade_task.nextclade {
call nextclade_task.nextclade_v3 {
input:
genome_fasta = cat_files_fasta.concatenated_files,
dataset_name = organism_parameters.nextclade_dataset_name,
dataset_reference = organism_parameters.nextclade_dataset_reference,
dataset_tag = organism_parameters.nextclade_dataset_tag
}
}
Expand All @@ -71,11 +68,11 @@ workflow theiacov_fasta_batch {
bucket_name = bucket_name,
samplenames = samplenames,
organism = organism,
nextclade_tsv = nextclade.nextclade_tsv,
nextclade_docker = nextclade.nextclade_docker,
nextclade_version = nextclade.nextclade_version,
nextclade_tsv = nextclade_v3.nextclade_tsv,
nextclade_docker = nextclade_v3.nextclade_docker,
nextclade_version = nextclade_v3.nextclade_version,
nextclade_ds_tag = nextclade_dataset_tag,
nextclade_json = nextclade.nextclade_json,
nextclade_json = nextclade_v3.nextclade_json,
pango_lineage_report = pangolin4.pango_lineage_report,
pangolin_docker = pangolin4.pangolin_docker,
theiacov_fasta_analysis_date = version_capture.date,
Expand All @@ -88,8 +85,8 @@ workflow theiacov_fasta_batch {
# Pangolin outputs
File? pango_lineage_report = pangolin4.pango_lineage_report
# Nextclade outputs
File? nextclade_json = nextclade.nextclade_json
File? nextclade_tsv = nextclade.nextclade_tsv
File? nextclade_json = nextclade_v3.nextclade_json
File? nextclade_tsv = nextclade_v3.nextclade_tsv
# Wrangling outputs
File datatable = sm_theiacov_fasta_wrangling.terra_table
}
Expand Down
22 changes: 8 additions & 14 deletions workflows/theiacov/wf_theiacov_illumina_pe.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@ workflow theiacov_illumina_pe {
Float consensus_min_freq = 0.6 # minimum frequency for a variant to be called as SNP in consensus genome
Float variant_min_freq = 0.6 # minimum frequency for a variant to be reported in ivar outputs
# nextclade inputs
String? nextclade_dataset_reference
String? nextclade_dataset_tag
String? nextclade_dataset_name
# vadr parameters
Expand All @@ -72,7 +71,6 @@ workflow theiacov_illumina_pe {
reference_gff_file = reference_gff,
reference_genome = reference_genome,
genome_length_input = genome_length,
nextclade_dataset_reference_input = nextclade_dataset_reference,
nextclade_dataset_tag_input = nextclade_dataset_tag,
nextclade_dataset_name_input = nextclade_dataset_name,
vadr_max_length = vadr_max_length,
Expand Down Expand Up @@ -182,7 +180,6 @@ workflow theiacov_illumina_pe {
reference_gff_file = reference_gff,
reference_genome = reference_genome,
genome_length_input = genome_length,
nextclade_dataset_reference_input = nextclade_dataset_reference,
nextclade_dataset_tag_input = nextclade_dataset_tag,
nextclade_dataset_name_input = nextclade_dataset_name,
vadr_max_length = vadr_max_length,
Expand All @@ -201,7 +198,6 @@ workflow theiacov_illumina_pe {
reference_gff_file = reference_gff,
reference_genome = reference_genome,
genome_length_input = genome_length,
nextclade_dataset_reference_input = nextclade_dataset_reference,
nextclade_dataset_tag_input = nextclade_dataset_tag,
nextclade_dataset_name_input = nextclade_dataset_name,
vadr_max_length = vadr_max_length,
Expand Down Expand Up @@ -239,26 +235,24 @@ workflow theiacov_illumina_pe {
# run organism-specific typing
if (organism_parameters.standardized_organism == "MPXV" || organism_parameters.standardized_organism == "sars-cov-2" || (organism_parameters.standardized_organism == "flu" && defined(irma.seg_ha_assembly) && ! defined(do_not_run_flu_ha_nextclade))) {
# tasks specific to either MPXV, sars-cov-2, or flu
call nextclade_task.nextclade {
call nextclade_task.nextclade_v3 {
input:
genome_fasta = select_first([irma.seg_ha_assembly, ivar_consensus.assembly_fasta]),
dataset_name = select_first([set_flu_ha_nextclade_values.nextclade_dataset_name, organism_parameters.nextclade_dataset_name]),
dataset_reference = select_first([set_flu_ha_nextclade_values.nextclade_dataset_reference, organism_parameters.nextclade_dataset_reference]),
dataset_tag = select_first([set_flu_ha_nextclade_values.nextclade_dataset_tag, organism_parameters.nextclade_dataset_tag])
}
call nextclade_task.nextclade_output_parser {
input:
nextclade_tsv = nextclade.nextclade_tsv,
nextclade_tsv = nextclade_v3.nextclade_tsv,
organism = organism_parameters.standardized_organism
}
}
if (organism_parameters.standardized_organism == "flu" && defined(irma.seg_na_assembly) && ! defined(do_not_run_flu_na_nextclade)) {
# tasks specific to flu NA - run nextclade a second time
call nextclade_task.nextclade as nextclade_flu_na {
call nextclade_task.nextclade_v3 as nextclade_flu_na {
input:
genome_fasta = select_first([irma.seg_na_assembly]),
dataset_name = select_first([set_flu_na_nextclade_values.nextclade_dataset_name, organism_parameters.nextclade_dataset_name]),
dataset_reference = select_first([set_flu_na_nextclade_values.nextclade_dataset_reference, organism_parameters.nextclade_dataset_reference]),
dataset_tag = select_first([set_flu_na_nextclade_values.nextclade_dataset_tag, organism_parameters.nextclade_dataset_tag])
}
call nextclade_task.nextclade_output_parser as nextclade_output_parser_flu_na {
Expand Down Expand Up @@ -422,11 +416,11 @@ workflow theiacov_illumina_pe {
String? pangolin_docker = pangolin4.pangolin_docker
String? pangolin_versions = pangolin4.pangolin_versions
# Nextclade outputs
String nextclade_json = select_first([nextclade.nextclade_json, ""])
String auspice_json = select_first([ nextclade.auspice_json, ""])
String nextclade_tsv = select_first([nextclade.nextclade_tsv, ""])
String nextclade_version = select_first([nextclade.nextclade_version, ""])
String nextclade_docker = select_first([nextclade.nextclade_docker, ""])
String nextclade_json = select_first([nextclade_v3.nextclade_json, ""])
String auspice_json = select_first([ nextclade_v3.auspice_json, ""])
String nextclade_tsv = select_first([nextclade_v3.nextclade_tsv, ""])
String nextclade_version = select_first([nextclade_v3.nextclade_version, ""])
String nextclade_docker = select_first([nextclade_v3.nextclade_docker, ""])
String nextclade_ds_tag = select_first([ha_na_nextclade_ds_tag, set_flu_ha_nextclade_values.nextclade_dataset_tag, organism_parameters.nextclade_dataset_tag, ""])
String nextclade_aa_subs = select_first([ha_na_nextclade_aa_subs, nextclade_output_parser.nextclade_aa_subs, ""])
String nextclade_aa_dels = select_first([ha_na_nextclade_aa_dels, nextclade_output_parser.nextclade_aa_dels, ""])
Expand Down
Loading
Loading