Skip to content

Conversation

@skrakau
Copy link
Member

@skrakau skrakau commented Nov 28, 2025

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).

Description of changes

  • Protein download improvements:

    • Filtered out assemblies that do not contain sequences to prevent errors during downstream processing (with debug log prints).
    • Added retry logic for Entrez handles to make downloads more robust against intermittent network or server errors.
    • Protein chunks are now correctly appended rather than overwriting previous chunks in the protein list -> Fixed protein chunk overwriting.
  • Peptide generation improvements:

    • Debug logging now prints all proteins containing invalid amino acids (determined with extended amino acid alphabet).
    • Filtered out all peptides that do not match the basic amino acid alphabet to ensure compatibility with MHCflurry and maintain standardisation across prediction tools.

Tests

  • Pipeline tested with: nf-test test tests/pipeline/test_mouse.nf.test --profile test_mouse,m3c
  • Pipeline tested on a test condition with 10 Taxa that trigger all added different cases
  • Pipeline tested on a bigger dataset of around 150 taxa (For testing protein download and epitope prediction parallelisation)

@nf-core-bot
Copy link
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.3.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@github-actions
Copy link

github-actions bot commented Nov 28, 2025

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 9b97e7d

+| ✅ 223 tests passed       |+
#| ❔   2 tests were ignored |#
!| ❗  23 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: conf/igenomes_ignored.config
  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • schema_description - No description provided in schema for parameter: igenomes_ignore
  • local_component_structure - finalize_microbiome_entities.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - predict_epitopes.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - unify_model_lengths.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - download_proteins.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - unpack_bin_archives.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - create_protein_tsv.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - prepare_score_distribution.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - split_pred_tasks.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - generate_protein_and_entity_ids.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - merge_predictions_buffer.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - prepare_entity_binding_ratios.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - collect_stats.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - merge_predictions.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - check_samplesheet_create_tables.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - plot_entity_binding_ratios.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - plot_score_distribution.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - assign_nucl_entity_weights.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - generate_peptides.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - epytope_show_supported_models.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - process_input.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure

❔ Tests ignored:

  • files_exist - File is ignored: conf/igenomes.config
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore

✅ Tests passed:

Run details

  • nf-core/tools version 3.3.2
  • Run at 2025-12-11 10:24:29

@skrakau skrakau marked this pull request as draft November 28, 2025 14:57
Comment on lines +59 to +61
# TODO: Maybe sys.exit(1)
return False
return True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should still return an error if the letter in not in the extended alphabet. Or why did you remove this?

Maybe you can use both alphabets here

  1. return an error if a letter is not in the extended alphabet
  2. return True for counting proteins with letters not in the not-extended one

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this because then the pipeline would stop entirely, just because one Protein has invalid AAs in its alphabet. This function is just used for debug log purposes (How many proteins are valid based on Extended alphabet), so I think just having it in the log should be okay. Imagine if we have over 1m proteins, there will be by chance some invalid protein sequences (that couldn't be removed by hand), so I think its even necessary to keep the pipeline running here

Copy link
Member Author

@skrakau skrakau Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So did it happen for your data? Depends I guess on the QC of the NCBI protein upload, and how this is handled


####################
# generate peptides
# generate peptides (Filter out all peptides with invalid letters)
Copy link
Member Author

@skrakau skrakau Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# generate peptides (Filter out all peptides with invalid letters)
# generate peptides (Filter out all peptides with invalid letters, i.e. containing extended AA codes)

valid_proteins = protid_protseq_protlen[protid_protseq_protlen["protein_sequence"].apply(validate_letters, alphabet=aa_list_extended)]
filtered_count = len(valid_proteins)
print(f"Info: {filtered_count} valid proteins.")
print(f"Info: {initial_count - filtered_count} proteins have invalid amino acids.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"Info: {initial_count - filtered_count} proteins have invalid amino acids.")
print(f"Info: {initial_count - filtered_count} proteins have invalid amino acids with an extended AA codes.")

Copy link
Member Author

@skrakau skrakau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xXFloBaerXx !
I have a few small remarks regarding the code changes.

In general, in this PR are different topics mixed, e.g. you changed the error handling with Entrez, fixed the processing of chunks and touched the issue with AAs with extended AA code. Ideally, such different topics are separated in different PRs (better for keeping a useful commit history, for the traceability and also for reviewing).
It's not critical know, but just if it's not too much work for you, could you split the Entrez error handling and the other changes (fixing processing of chunks and handling AA codes are more related I would say)?

@skrakau
Copy link
Member Author

skrakau commented Dec 5, 2025

You can also switch it from Draft PR to Ready for review when you thinks it's done.

@SusiJo
Copy link

SusiJo commented Dec 8, 2025

Hi @xXFloBaerXx, @skrakau,
Tests can be triggered when one adds 'bin/*' here:

triggers 'nextflow.config', 'nf-test.config', 'conf/test.config', 'tests/nextflow.config', 'tests/.nftignore'

@xXFloBaerXx xXFloBaerXx force-pushed the Fix_Protein_Download branch 2 times, most recently from 4b89f02 to cd47447 Compare December 11, 2025 10:02

####################
# generate peptides
# generate peptides (Filter out all peptides with invalid letters, i.e. containing extended AA codes)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this depends now how it is handled above, as it is currently implemented it could also contain letters not part of the extended code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants