Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when updating PopPUNK database without reference data #325

Open
HarryHung opened this issue Sep 11, 2024 · 4 comments
Open

Error when updating PopPUNK database without reference data #325

HarryHung opened this issue Sep 11, 2024 · 4 comments

Comments

@HarryHung
Copy link
Contributor

Versions
poppunk 2.6.3

Command used and output returned

poppunk_assign --update-db --db GPS_v9 --query query_for_v9_1_update.txt --external-clustering GPS_v9_external_clusters.csv  --output GPS_v9_1 --threads 32

An updated database is generated at the output directory, accompanying with this error

Traceback (most recent call last):
  File "/opt/conda/bin/poppunk_assign", line 11, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/PopPUNK/assign.py", line 211, in main
    assign_query(dbFuncs,
  File "/opt/conda/lib/python3.10/site-packages/PopPUNK/assign.py", line 307, in assign_query
    isolateClustering = assign_query_hdf5(dbFuncs,
  File "/opt/conda/lib/python3.10/site-packages/PopPUNK/assign.py", line 726, in assign_query_hdf5
    with open(ref_file_name) as refFile:
FileNotFoundError: [Errno 2] No such file or directory: 'GPS_v9/GPS_v9.refs'

Describe the bug

It is related to the usage of full-size (only) database for GPS Project GPSC assignment and its update.

The source database does not have any reference data

tree GPS_v9

GPS_v9
├── GPS_v9_clusters.csv
├── GPS_v9.dists.npy
├── GPS_v9.dists.pkl
├── GPS_v9_fit.npz
├── GPS_v9_fit.pkl
├── GPS_v9_graph.gt
├── GPS_v9.h5
└── GPS_v9_unword_clusters.csv

It is fine to use for normal assignment, due to the automatic fallback of typical poppunk_assign:

If the .refs file is missing, all of the samples in the sketch database will be used in the distance calculations.

PopPUNK/PopPUNK/assign.py

Lines 455 to 468 in 38b5d18

use_ref_graph = \
os.path.isfile(ref_file_name) and not update_db and model.type != 'lineage'
if use_ref_graph:
with open(ref_file_name) as refFile:
for reference in refFile:
rNames.append(reference.rstrip())
else:
if os.path.isfile(distances + ".pkl"):
rNames = readPickle(distances, enforce_self = True, distances=False)[0]
elif update_db:
sys.stderr.write("Distance order .pkl missing, cannot use --update-db\n")
sys.exit(1)
else:
rNames = getSeqsInDb(os.path.join(ref_db, os.path.basename(ref_db) + ".h5"))

However, when running --update-db, even though the updated full-size database is saved in the first half of the function, the later part of function does not check whether .ref exists or not, and try to prune it, causing the error.

PopPUNK/PopPUNK/assign.py

Lines 755 to 791 in 38b5d18

# Clique pruning
if model.type != 'lineage':
existing_ref_list = []
with open(ref_file_name) as refFile:
for reference in refFile:
existing_ref_list.append(reference.rstrip())
# Extract references from graph
newRepresentativesIndices, newRepresentativesNames, \
newRepresentativesFile, genomeNetwork = \
extractReferences(genomeNetwork,
combined_seq,
output,
outSuffix = file_extension_string,
existingRefs = existing_ref_list,
type_isolate = qc_dict['type_isolate'],
threads = threads,
use_gpu = gpu_graph)
# intersection that maintains order
newQueries = [x for x in qNames if x in frozenset(newRepresentativesNames)]
# could also have newRepresentativesNames in this diff (should be the same) - but want
# to ensure consistency with the network in case of bad input/bugs
nodes_to_remove = set(range(len(combined_seq))).difference(newRepresentativesIndices)
names_to_remove = [combined_seq[n] for n in nodes_to_remove]
if (len(names_to_remove) > 0):
graph_suffix = file_extension_string + '.refs_graph'
save_network(genomeNetwork,
prefix = output,
suffix = graph_suffix,
use_gpu = gpu_graph)
removeFromDB(output, output, names_to_remove)
db_suffix = file_extension_string + '.refs.h5'
os.rename(output + "/" + os.path.basename(output) + ".tmp.h5",
output + "/" + os.path.basename(output) + db_suffix)

My question is, whether the updated database is safe to use despite the error? If so, could you please update the poppunk_assign --update-db process to handle this situation more elegantly?

Thanks!

@johnlees
Copy link
Member

Can you try with PopPUNK 2.7.0 and see if you get the same? I'm also working on a 'fast' update to meet a lot of your preferred criteria which will become available in 2.7.1, but it might be another couple of months before that is ready.

@HarryHung
Copy link
Contributor Author

Sadly, the result is even worse on 2.7.0.

I created two fresh conda environments for fair comparsion.

The error output of 2.6.3

Traceback (most recent call last):
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk_2.6.3/bin/poppunk_assign", line 11, in <module>
    sys.exit(main())
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk_2.6.3/lib/python3.10/site-packages/PopPUNK/assign.py", line 211, in main
    assign_query(dbFuncs,
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk_2.6.3/lib/python3.10/site-packages/PopPUNK/assign.py", line 307, in assign_query
    isolateClustering = assign_query_hdf5(dbFuncs,
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk_2.6.3/lib/python3.10/site-packages/PopPUNK/assign.py", line 726, in assign_query_hdf5
    with open(ref_file_name) as refFile:
FileNotFoundError: [Errno 2] No such file or directory: 'GPS_v9/GPS_v9.refs'

The output directory of 2.6.3

GPS_v9_1-pp2.6.3/
├── GPS_v9_1-pp2.6.3_clusters.csv
├── GPS_v9_1-pp2.6.3.dists.npy
├── GPS_v9_1-pp2.6.3.dists.pkl
├── GPS_v9_1-pp2.6.3_external_clusters.csv
├── GPS_v9_1-pp2.6.3_fit.npz
├── GPS_v9_1-pp2.6.3_fit.pkl
├── GPS_v9_1-pp2.6.3_graph.gt
├── GPS_v9_1-pp2.6.3.h5
└── GPS_v9_1-pp2.6.3_unword_clusters.csv

0 directories, 9 files

The error output of 2.7.0

Traceback (most recent call last):
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/bin/poppunk_assign", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/assign.py", line 213, in main
    assign_query(dbFuncs,
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/assign.py", line 297, in assign_query
    isolateClustering = assign_query_hdf5(dbFuncs,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/assign.py", line 646, in assign_query_hdf5
    {'combined': printClusters(genomeNetwork,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/network.py", line 1520, in printClusters
    unword = next(unword_generator)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/unwords.py", line 31, in gen_unword
    word += "".join(syllable()[0]())
                    ^^^^^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/unwords.py", line 20, in <lambda>
    cv = lambda: consonant() + vowel()
                 ^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/site-packages/PopPUNK/unwords.py", line 19, in <lambda>
    consonant = lambda: random.sample(consonants, 1)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/pathogen/miniconda/users/ch31/envs/poppunk/lib/python3.12/random.py", line 413, in sample
    raise TypeError("Population must be a sequence.  "
TypeError: Population must be a sequence.  For dicts or sets, use sorted(d).

The output directory of 2.7.0

GPS_v9_1-pp2.7.0/
└── GPS_v9_1-pp2.7.0.h5

0 directories, 1 file

A different error appears in 2.7.0 and the output database is incomplete, while 2.6.3 at least seemingly generated a complete updated database.

I understand a new version of PopPUNK could take a while to come out, especially for our niche use case. I am just wondering, do you think the output of 2.6.3 is safe to use despite the error?

Thank you!

@johnlees
Copy link
Member

The error above in 2.7.0 was also reported in #321 (seems to be related to newer versions of python, as it wasn't in changed code). I have PR #322 being prepared to fix it.

You could:

  • make the quick fix suggested here and try again: --fit-model error #321 (comment)
  • Wait until that PR is merged, which will become v2.7.1 (although to manage expectations, I actually think this will be Nov/Dec before done)
  • Use the output from 2.6.3, which as you diagnose, I think should be fine if you just want the assignments. I am not confident it would create a correct updated database for further assignment however, and would need to check this further (which I don't currently have capacity to do)
  • Another workaround would be to add a .refs file which just lists all the sample names

@HarryHung
Copy link
Contributor Author

HarryHung commented Sep 12, 2024

I think the last suggestion is the safest and least finicky route before v2.7.1 is out, so I tested it out (used v2.6.3 as it works without further modifications).

By adding a .refs that includes all samples, it allows the poppunk_assign --update-db to complete without error.

A small side effect is it will lead to the generation of "reference" data within the output directory which is actually identical in size to the full size data (can be simply solved by running rm *.refs* afterward)

I further cross-checked this newly generated database against the previously generated database (the one generated without .refs and prematurely terminated by error).

All non-csv files have identical md5sum results, except .h5. Upon further inspection with h5diff, the content of both .h5 are identical despite different md5sum results, Internet suggests h5diff is more reliable as it ignores metadata differences. For the csv files, the differences are caused by different sort order, the content is actually identical after sorting the rows (except _unword_clusters.csv, but I think it does not matter as long as _clusters.csv is okay(?))

So I think the existing updated database is suitable for further assingment. Of course, I could just replace the existing database with the one generated without error just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants