-
Notifications
You must be signed in to change notification settings - Fork 8
Fix protein download #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Fix protein download #167
Conversation
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.3.2. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
|
05b4e49 to
cd47447
Compare
| # TODO: Maybe sys.exit(1) | ||
| return False | ||
| return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should still return an error if the letter in not in the extended alphabet. Or why did you remove this?
Maybe you can use both alphabets here
- return an error if a letter is not in the extended alphabet
- return
Truefor counting proteins with letters not in the not-extended one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this because then the pipeline would stop entirely, just because one Protein has invalid AAs in its alphabet. This function is just used for debug log purposes (How many proteins are valid based on Extended alphabet), so I think just having it in the log should be okay. Imagine if we have over 1m proteins, there will be by chance some invalid protein sequences (that couldn't be removed by hand), so I think its even necessary to keep the pipeline running here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So did it happen for your data? Depends I guess on the QC of the NCBI protein upload, and how this is handled
bin/generate_peptides.py
Outdated
|
|
||
| #################### | ||
| # generate peptides | ||
| # generate peptides (Filter out all peptides with invalid letters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # generate peptides (Filter out all peptides with invalid letters) | |
| # generate peptides (Filter out all peptides with invalid letters, i.e. containing extended AA codes) |
bin/generate_peptides.py
Outdated
| valid_proteins = protid_protseq_protlen[protid_protseq_protlen["protein_sequence"].apply(validate_letters, alphabet=aa_list_extended)] | ||
| filtered_count = len(valid_proteins) | ||
| print(f"Info: {filtered_count} valid proteins.") | ||
| print(f"Info: {initial_count - filtered_count} proteins have invalid amino acids.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| print(f"Info: {initial_count - filtered_count} proteins have invalid amino acids.") | |
| print(f"Info: {initial_count - filtered_count} proteins have invalid amino acids with an extended AA codes.") |
skrakau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xXFloBaerXx !
I have a few small remarks regarding the code changes.
In general, in this PR are different topics mixed, e.g. you changed the error handling with Entrez, fixed the processing of chunks and touched the issue with AAs with extended AA code. Ideally, such different topics are separated in different PRs (better for keeping a useful commit history, for the traceability and also for reviewing).
It's not critical know, but just if it's not too much work for you, could you split the Entrez error handling and the other changes (fixing processing of chunks and handling AA codes are more related I would say)?
|
You can also switch it from |
|
Hi @xXFloBaerXx, @skrakau, Line 18 in cd47447
|
4b89f02 to
cd47447
Compare
|
|
||
| #################### | ||
| # generate peptides | ||
| # generate peptides (Filter out all peptides with invalid letters, i.e. containing extended AA codes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this depends now how it is handled above, as it is currently implemented it could also contain letters not part of the extended code
PR checklist
nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).Description of changes
Protein download improvements:
Peptide generation improvements:
Tests