[Improvement suggestion] Optimization of step 4 in `phagcn` #56

valentynbez · 2025-02-03T09:00:21Z

Currently, I am trying to classify ~200k vOTUs. The software is stuck for 2 weeks.
The issue lies in this chunk of code:

PhaBOX/src/phabox2/phagcn.py

Lines 135 to 173 in 6733c8b

    
           logger.info("[4/8] generating phagcn networks...") 
        
           # FLAGS: no proteins aligned to the database 
        
           if os.path.getsize(f'{rootpth}/{midfolder}/db_results.abc') == 0: 
        
               Accession = [] 
        
               Length_list = [] 
        
               for record in SeqIO.parse(f'{contigs}', 'fasta'): 
        
                   Accession.append(record.id) 
        
                   Length_list.append(len(record.seq)) 
        
               df = pd.DataFrame({"Accession": Accession, "Length": Length_list,  "Lineage":['unknown']*len(Accession), "PhaGCNScore":[0]*len(Accession), "Genus": ['-']*len(Accession), "GenusCluster": ['-']*len(Accession)}) 
        
               df.to_csv(f"{rootpth}/{out_dir}/phagcn_prediction.tsv", index = None, sep='\t') 
        
               exit() 
        
           # add the genome size 
        
           genome_size = defaultdict(int) 
        
           for index, r in enumerate(SeqIO.parse(f'{rootpth}/{midfolder}/ALLprotein.fa', 'fasta')): 
        
               genome_id = r.id.rsplit('_', 1)[0] 
        
               genome_size[genome_id] += 1 
        
           compute_aai(f'{rootpth}/{midfolder}', 'db_results', genome_size) 
        
           compute_aai(f'{rootpth}/{midfolder}', 'self_results', genome_size) 
        
           # filter the network 
        
           df1 = pd.read_csv(f'{rootpth}/{midfolder}/db_results_aai.tsv', sep='\t') 
        
           df2 = pd.read_csv(f'{rootpth}/{midfolder}/self_results_aai.tsv', sep='\t') 
        
           df3 = pd.read_csv(f'{db_dir}/database_aai.tsv', sep='\t') 
        
           df = pd.concat([df1, df2, df3]) 
        
           sub_df = df[((df['aai']>=aai)&((df['qcov']>=pcov)|(df['tcov']>=pcov)|(df['sgenes']>=share)))].copy() 
        
           sub_df['score'] = sub_df['aai']/100.0 * sub_df[['qcov', 'tcov']].max(axis=1)/100.0 
        
           # write the network 
        
           sub_df.drop(['qcov', 'tcov', 'qgenes', 'tgenes', 'sgenes', 'aai'], axis=1, inplace=True) 
        
           sub_df.to_csv(f'{rootpth}/{midfolder}/phagcn_network.tsv', sep='\t', index=False, header=False) 
        
           #### drop network 
        
           sub_df.rename(columns={'query':'Source', 'target':'Target', 'score':'Weight'}, inplace=True) 
        
           sub_df.to_csv(f"{rootpth}/{out_dir}/{supplementary}/phagcn_network_edges.tsv", index=False, sep='\t') 
        
           run_command(f"mcl {rootpth}/{midfolder}/phagcn_network.tsv -te {threads} -I 2.0 --abc -o {rootpth}/{midfolder}/phagcn_genus_clusters.txt > /dev/null 2>&1")

Optimization of underlying steps would be helpful for large scale data.

KennthShang · 2025-02-03T09:23:40Z

Hi @valentynbez ,

In our own test, a one-million >10 kbp sequence FASTA file will take no more than five hours to obtain the prediction with 40 threads on our HPC.

I am not sure what happened to your run but you can see whether the program is still running by checking whether the intermediate files are still generated in your 'midfolder'. If not, maybe it is stuck because of the memory system (I suppose).

But in any case, you can first split your. vOTU into smaller files and run them separately. Hope this will help

Best,
Jiayu

valentynbez · 2025-02-03T10:25:44Z

Thanks for your answer.
Yes, the program is still running and self_results is a huge file (in my dataset consisting of 4,235,658,962 rows) being processed inefficiently by pandas on a single thread within a compute_aai function. It is not a memory issue in this case.
Could your local version be different from the one in the repo?

KennthShang · 2025-02-03T12:18:51Z

Sure, the codes are indeed the same.

Another potential solution is using the codes below to replace the original to_csv function I guess:

import csv

# Save as TSV using the csv module
with open(f'{pth}/{file_name}_aai.tsv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file, delimiter='\t')
    # Write the header
    writer.writerow(tmp_df.columns)
    # Write the data
    writer.writerows(tmp_df.values)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement suggestion] Optimization of step 4 in `phagcn` #56

[Improvement suggestion] Optimization of step 4 in `phagcn` #56

valentynbez commented Feb 3, 2025

KennthShang commented Feb 3, 2025

valentynbez commented Feb 3, 2025 •

edited

Loading

KennthShang commented Feb 3, 2025

[Improvement suggestion] Optimization of step 4 in phagcn #56

[Improvement suggestion] Optimization of step 4 in phagcn #56

Comments

valentynbez commented Feb 3, 2025

KennthShang commented Feb 3, 2025

valentynbez commented Feb 3, 2025 • edited Loading

KennthShang commented Feb 3, 2025

[Improvement suggestion] Optimization of step 4 in `phagcn` #56

[Improvement suggestion] Optimization of step 4 in `phagcn` #56

valentynbez commented Feb 3, 2025 •

edited

Loading