Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement suggestion] Optimization of step 4 in phagcn #56

Open
valentynbez opened this issue Feb 3, 2025 · 3 comments
Open

[Improvement suggestion] Optimization of step 4 in phagcn #56

valentynbez opened this issue Feb 3, 2025 · 3 comments

Comments

@valentynbez
Copy link

Currently, I am trying to classify ~200k vOTUs. The software is stuck for 2 weeks.
The issue lies in this chunk of code:

logger.info("[4/8] generating phagcn networks...")
# FLAGS: no proteins aligned to the database
if os.path.getsize(f'{rootpth}/{midfolder}/db_results.abc') == 0:
Accession = []
Length_list = []
for record in SeqIO.parse(f'{contigs}', 'fasta'):
Accession.append(record.id)
Length_list.append(len(record.seq))
df = pd.DataFrame({"Accession": Accession, "Length": Length_list, "Lineage":['unknown']*len(Accession), "PhaGCNScore":[0]*len(Accession), "Genus": ['-']*len(Accession), "GenusCluster": ['-']*len(Accession)})
df.to_csv(f"{rootpth}/{out_dir}/phagcn_prediction.tsv", index = None, sep='\t')
exit()
# add the genome size
genome_size = defaultdict(int)
for index, r in enumerate(SeqIO.parse(f'{rootpth}/{midfolder}/ALLprotein.fa', 'fasta')):
genome_id = r.id.rsplit('_', 1)[0]
genome_size[genome_id] += 1
compute_aai(f'{rootpth}/{midfolder}', 'db_results', genome_size)
compute_aai(f'{rootpth}/{midfolder}', 'self_results', genome_size)
# filter the network
df1 = pd.read_csv(f'{rootpth}/{midfolder}/db_results_aai.tsv', sep='\t')
df2 = pd.read_csv(f'{rootpth}/{midfolder}/self_results_aai.tsv', sep='\t')
df3 = pd.read_csv(f'{db_dir}/database_aai.tsv', sep='\t')
df = pd.concat([df1, df2, df3])
sub_df = df[((df['aai']>=aai)&((df['qcov']>=pcov)|(df['tcov']>=pcov)|(df['sgenes']>=share)))].copy()
sub_df['score'] = sub_df['aai']/100.0 * sub_df[['qcov', 'tcov']].max(axis=1)/100.0
# write the network
sub_df.drop(['qcov', 'tcov', 'qgenes', 'tgenes', 'sgenes', 'aai'], axis=1, inplace=True)
sub_df.to_csv(f'{rootpth}/{midfolder}/phagcn_network.tsv', sep='\t', index=False, header=False)
#### drop network
sub_df.rename(columns={'query':'Source', 'target':'Target', 'score':'Weight'}, inplace=True)
sub_df.to_csv(f"{rootpth}/{out_dir}/{supplementary}/phagcn_network_edges.tsv", index=False, sep='\t')
run_command(f"mcl {rootpth}/{midfolder}/phagcn_network.tsv -te {threads} -I 2.0 --abc -o {rootpth}/{midfolder}/phagcn_genus_clusters.txt > /dev/null 2>&1")

Optimization of underlying steps would be helpful for large scale data.

@KennthShang
Copy link
Owner

Hi @valentynbez ,

In our own test, a one-million >10 kbp sequence FASTA file will take no more than five hours to obtain the prediction with 40 threads on our HPC.

I am not sure what happened to your run but you can see whether the program is still running by checking whether the intermediate files are still generated in your 'midfolder'. If not, maybe it is stuck because of the memory system (I suppose).

But in any case, you can first split your. vOTU into smaller files and run them separately. Hope this will help

Best,
Jiayu

@valentynbez
Copy link
Author

valentynbez commented Feb 3, 2025

Thanks for your answer.
Yes, the program is still running and self_results is a huge file (in my dataset consisting of 4,235,658,962 rows) being processed inefficiently by pandas on a single thread within a compute_aai function. It is not a memory issue in this case.
Could your local version be different from the one in the repo?

@KennthShang
Copy link
Owner

Sure, the codes are indeed the same.

Another potential solution is using the codes below to replace the original to_csv function I guess:

import csv

# Save as TSV using the csv module
with open(f'{pth}/{file_name}_aai.tsv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file, delimiter='\t')
    # Write the header
    writer.writerow(tmp_df.columns)
    # Write the data
    writer.writerows(tmp_df.values)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants