You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just created a Babel. Here is what babel_downloads/NCBIGene looks like:
nru@babel:/code/babel/babel_downloads/NCBIGene$ ls -alh
total 18G
drwxr-xr-x. 2 nru nru 4.0K Jan 15 21:18 .
drwxrwxrwx. 110 nobody nogroup 20K Jan 18 18:58 ..
-rw-r--r--. 1 nru nru 239M Jan 15 21:18 gene2ensembl.gz
-rw-r--r--. 1 nru nru 1.2G Jan 15 21:18 gene_info.gz
-rw-r--r--. 1 nru nru 88M Jan 15 21:18 gene_orthologs.gz
-rw-r--r--. 1 nru nru 1.2G Jan 15 21:18 gene_refseq_uniprotkb_collab.gz
-rw-r--r--. 1 nru nru 2.6M Jan 16 15:44 labels
-rw-r--r--. 1 nru nru 835K Jan 15 21:18 mim2gene_medgen
-rw-r--r--. 1 nru nru 14G Jan 15 21:25 synonyms
-rw-r--r--. 1 nru nru 1.9G Jan 15 21:25 taxa
nru@babel:/code/babel/babel_downloads/NCBIGene$ wc labels
79869 239607 2704154 labels
I deleted this folder and recreated it, and got the following files:
gene_info.gz gene_refseq_uniprotkb_collab.gz mim2gene_medgen taxa
nru@babel:/code/babel/babel_downloads/NCBIGene$ ls -alh
total 20G
drwxr-xr-x. 2 nru nru 4.0K Jan 21 20:07 .
drwxrwxrwx. 110 nobody nogroup 20K Jan 21 20:07 ..
-rw-r--r--. 1 nru nru 239M Jan 21 20:07 gene2ensembl.gz
-rw-r--r--. 1 nru nru 1.2G Jan 21 20:07 gene_info.gz
-rw-r--r--. 1 nru nru 88M Jan 21 20:07 gene_orthologs.gz
-rw-r--r--. 1 nru nru 1.2G Jan 21 20:07 gene_refseq_uniprotkb_collab.gz
-rw-r--r--. 1 nru nru 1.7G Jan 21 20:13 labels
-rw-r--r--. 1 nru nru 835K Jan 21 20:07 mim2gene_medgen
-rw-r--r--. 1 nru nru 14G Jan 21 20:13 synonyms
-rw-r--r--. 1 nru nru 1.9G Jan 21 20:13 taxa
nru@babel:/code/babel/babel_downloads/NCBIGene$ wc labels
58008023 116027268 1736342424 labels
So, even though all the other input files appear to be exactly the same size, we somehow end up with 57,928,154 fewer gene labels than we expect... which is particularly bad because it looks like unlabeled genes don't make it into Babel, and we end up with a shortfall of 56,728,492 genes when compared with the previous Babel release.
This has happened before. The problem is clearly in get_ncbigene_labels_synonyms_and_taxa somewhere, since the download files appear to be the same size and this is the only other job that executes to produce these files. However, I don't see a codepath that would cause writing of the files to be interrupted midway without some kind of exception or error:
I just created a Babel. Here is what
babel_downloads/NCBIGene
looks like:I deleted this folder and recreated it, and got the following files:
So, even though all the other input files appear to be exactly the same size, we somehow end up with 57,928,154 fewer gene labels than we expect... which is particularly bad because it looks like unlabeled genes don't make it into Babel, and we end up with a shortfall of 56,728,492 genes when compared with the previous Babel release.
This has happened before. The problem is clearly in
get_ncbigene_labels_synonyms_and_taxa
somewhere, since the download files appear to be the same size and this is the only other job that executes to produce these files. However, I don't see a codepath that would cause writing of the files to be interrupted midway without some kind of exception or error:Babel/src/datahandlers/ncbigene.py
Lines 10 to 88 in 5bc43c8
The text was updated successfully, but these errors were encountered: