Error when updating PopPUNK database without reference data #325

HarryHung · 2024-09-11T16:36:25Z

Versions
poppunk 2.6.3

Command used and output returned

poppunk_assign --update-db --db GPS_v9 --query query_for_v9_1_update.txt --external-clustering GPS_v9_external_clusters.csv  --output GPS_v9_1 --threads 32

An updated database is generated at the output directory, accompanying with this error

Traceback (most recent call last):
  File "/opt/conda/bin/poppunk_assign", line 11, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/PopPUNK/assign.py", line 211, in main
    assign_query(dbFuncs,
  File "/opt/conda/lib/python3.10/site-packages/PopPUNK/assign.py", line 307, in assign_query
    isolateClustering = assign_query_hdf5(dbFuncs,
  File "/opt/conda/lib/python3.10/site-packages/PopPUNK/assign.py", line 726, in assign_query_hdf5
    with open(ref_file_name) as refFile:
FileNotFoundError: [Errno 2] No such file or directory: 'GPS_v9/GPS_v9.refs'

Describe the bug

It is related to the usage of full-size (only) database for GPS Project GPSC assignment and its update.

The source database does not have any reference data

tree GPS_v9

GPS_v9
├── GPS_v9_clusters.csv
├── GPS_v9.dists.npy
├── GPS_v9.dists.pkl
├── GPS_v9_fit.npz
├── GPS_v9_fit.pkl
├── GPS_v9_graph.gt
├── GPS_v9.h5
└── GPS_v9_unword_clusters.csv

It is fine to use for normal assignment, due to the automatic fallback of typical poppunk_assign:

If the .refs file is missing, all of the samples in the sketch database will be used in the distance calculations.

PopPUNK/PopPUNK/assign.py

Lines 455 to 468 in 38b5d18

use_ref_graph = \

os.path.isfile(ref_file_name) and not update_db and model.type != 'lineage'

if use_ref_graph:

with open(ref_file_name) as refFile:

for reference in refFile:

rNames.append(reference.rstrip())

else:

if os.path.isfile(distances + ".pkl"):

rNames = readPickle(distances, enforce_self = True, distances=False)[0]

elif update_db:

sys.stderr.write("Distance order .pkl missing, cannot use --update-db\n")

sys.exit(1)

else:

rNames = getSeqsInDb(os.path.join(ref_db, os.path.basename(ref_db) + ".h5"))

However, when running --update-db, even though the updated full-size database is saved in the first half of the function, the later part of function does not check whether .ref exists or not, and try to prune it, causing the error.

PopPUNK/PopPUNK/assign.py

Lines 755 to 791 in 38b5d18

 # Clique pruning 

 if model.type != 'lineage': 

 existing_ref_list = [] 

 with open(ref_file_name) as refFile: 

 for reference in refFile: 

 existing_ref_list.append(reference.rstrip()) 

 # Extract references from graph 

 newRepresentativesIndices, newRepresentativesNames, \ 

 newRepresentativesFile, genomeNetwork = \ 

 extractReferences(genomeNetwork, 

 combined_seq, 

 output, 

 outSuffix = file_extension_string, 

 existingRefs = existing_ref_list, 

 type_isolate = qc_dict['type_isolate'], 

 threads = threads, 

 use_gpu = gpu_graph) 

 # intersection that maintains order 

 newQueries = [x for x in qNames if x in frozenset(newRepresentativesNames)] 

 # could also have newRepresentativesNames in this diff (should be the same) - but want 

 # to ensure consistency with the network in case of bad input/bugs 

 nodes_to_remove = set(range(len(combined_seq))).difference(newRepresentativesIndices) 

 names_to_remove = [combined_seq[n] for n in nodes_to_remove] 

 if (len(names_to_remove) > 0): 

 graph_suffix = file_extension_string + '.refs_graph' 

 save_network(genomeNetwork, 

 prefix = output, 

 suffix = graph_suffix, 

 use_gpu = gpu_graph) 

 removeFromDB(output, output, names_to_remove) 

 db_suffix = file_extension_string + '.refs.h5' 

 os.rename(output + "/" + os.path.basename(output) + ".tmp.h5", 

 output + "/" + os.path.basename(output) + db_suffix)

My question is, whether the updated database is safe to use despite the error? If so, could you please update the poppunk_assign --update-db process to handle this situation more elegantly?

Thanks!

The text was updated successfully, but these errors were encountered:

johnlees · 2024-09-12T08:17:14Z

Can you try with PopPUNK 2.7.0 and see if you get the same? I'm also working on a 'fast' update to meet a lot of your preferred criteria which will become available in 2.7.1, but it might be another couple of months before that is ready.

HarryHung · 2024-09-12T12:03:42Z

Sadly, the result is even worse on 2.7.0.

I created two fresh conda environments for fair comparsion.

The error output of 2.6.3

Traceback (most recent call last):
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk_2.6.3/bin/poppunk_assign", line 11, in <module>
    sys.exit(main())
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk_2.6.3/lib/python3.10/site-packages/PopPUNK/assign.py", line 211, in main
    assign_query(dbFuncs,
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk_2.6.3/lib/python3.10/site-packages/PopPUNK/assign.py", line 307, in assign_query
    isolateClustering = assign_query_hdf5(dbFuncs,
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk_2.6.3/lib/python3.10/site-packages/PopPUNK/assign.py", line 726, in assign_query_hdf5
    with open(ref_file_name) as refFile:
FileNotFoundError: [Errno 2] No such file or directory: 'GPS_v9/GPS_v9.refs'

The output directory of 2.6.3

GPS_v9_1-pp2.6.3/
├── GPS_v9_1-pp2.6.3_clusters.csv
├── GPS_v9_1-pp2.6.3.dists.npy
├── GPS_v9_1-pp2.6.3.dists.pkl
├── GPS_v9_1-pp2.6.3_external_clusters.csv
├── GPS_v9_1-pp2.6.3_fit.npz
├── GPS_v9_1-pp2.6.3_fit.pkl
├── GPS_v9_1-pp2.6.3_graph.gt
├── GPS_v9_1-pp2.6.3.h5
└── GPS_v9_1-pp2.6.3_unword_clusters.csv

0 directories, 9 files

The error output of 2.7.0

Traceback (most recent call last):
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/bin/poppunk_assign", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/assign.py", line 213, in main
    assign_query(dbFuncs,
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/assign.py", line 297, in assign_query
    isolateClustering = assign_query_hdf5(dbFuncs,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/assign.py", line 646, in assign_query_hdf5
    {'combined': printClusters(genomeNetwork,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/network.py", line 1520, in printClusters
    unword = next(unword_generator)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/unwords.py", line 31, in gen_unword
    word += "".join(syllable()[0]())
                    ^^^^^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/unwords.py", line 20, in <lambda>
    cv = lambda: consonant() + vowel()
                 ^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/unwords.py", line 19, in <lambda>
    consonant = lambda: random.sample(consonants, 1)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/random.py", line 413, in sample
    raise TypeError("Population must be a sequence.  "
TypeError: Population must be a sequence.  For dicts or sets, use sorted(d).

The output directory of 2.7.0

GPS_v9_1-pp2.7.0/
└── GPS_v9_1-pp2.7.0.h5

0 directories, 1 file

A different error appears in 2.7.0 and the output database is incomplete, while 2.6.3 at least seemingly generated a complete updated database.

I understand a new version of PopPUNK could take a while to come out, especially for our niche use case. I am just wondering, do you think the output of 2.6.3 is safe to use despite the error?

Thank you!

johnlees · 2024-09-12T14:22:10Z

The error above in 2.7.0 was also reported in #321 (seems to be related to newer versions of python, as it wasn't in changed code). I have PR #322 being prepared to fix it.

You could:

make the quick fix suggested here and try again: --fit-model error #321 (comment)
Wait until that PR is merged, which will become v2.7.1 (although to manage expectations, I actually think this will be Nov/Dec before done)
Use the output from 2.6.3, which as you diagnose, I think should be fine if you just want the assignments. I am not confident it would create a correct updated database for further assignment however, and would need to check this further (which I don't currently have capacity to do)
Another workaround would be to add a .refs file which just lists all the sample names

HarryHung · 2024-09-12T17:22:28Z

I think the last suggestion is the safest and least finicky route before v2.7.1 is out, so I tested it out (used v2.6.3 as it works without further modifications).

By adding a .refs that includes all samples, it allows the poppunk_assign --update-db to complete without error.

A small side effect is it will lead to the generation of "reference" data within the output directory which is actually identical in size to the full size data (can be simply solved by running rm *.refs* afterward)

I further cross-checked this newly generated database against the previously generated database (the one generated without .refs and prematurely terminated by error).

All non-csv files have identical md5sum results, except .h5. Upon further inspection with h5diff, the content of both .h5 are identical despite different md5sum results, Internet suggests h5diff is more reliable as it ignores metadata differences. For the csv files, the differences are caused by different sort order, the content is actually identical after sorting the rows (except _unword_clusters.csv, but I think it does not matter as long as _clusters.csv is okay(?))

So I think the existing updated database is suitable for further assingment. Of course, I could just replace the existing database with the one generated without error just in case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when updating PopPUNK database without reference data #325

Error when updating PopPUNK database without reference data #325

HarryHung commented Sep 11, 2024

johnlees commented Sep 12, 2024

HarryHung commented Sep 12, 2024

johnlees commented Sep 12, 2024

HarryHung commented Sep 12, 2024 •

edited

Loading

Error when updating PopPUNK database without reference data #325

Error when updating PopPUNK database without reference data #325

Comments

HarryHung commented Sep 11, 2024

johnlees commented Sep 12, 2024

HarryHung commented Sep 12, 2024

johnlees commented Sep 12, 2024

HarryHung commented Sep 12, 2024 • edited Loading

HarryHung commented Sep 12, 2024 •

edited

Loading