NCBIGene sometimes has an unreported download issue #398

gaurav · 2025-01-21T20:28:23Z

I just created a Babel. Here is what babel_downloads/NCBIGene looks like:

nru@babel:/code/babel/babel_downloads/NCBIGene$ ls -alh
total 18G
drwxr-xr-x.   2 nru    nru     4.0K Jan 15 21:18 .
drwxrwxrwx. 110 nobody nogroup  20K Jan 18 18:58 ..
-rw-r--r--.   1 nru    nru     239M Jan 15 21:18 gene2ensembl.gz
-rw-r--r--.   1 nru    nru     1.2G Jan 15 21:18 gene_info.gz
-rw-r--r--.   1 nru    nru      88M Jan 15 21:18 gene_orthologs.gz
-rw-r--r--.   1 nru    nru     1.2G Jan 15 21:18 gene_refseq_uniprotkb_collab.gz
-rw-r--r--.   1 nru    nru     2.6M Jan 16 15:44 labels
-rw-r--r--.   1 nru    nru     835K Jan 15 21:18 mim2gene_medgen
-rw-r--r--.   1 nru    nru      14G Jan 15 21:25 synonyms
-rw-r--r--.   1 nru    nru     1.9G Jan 15 21:25 taxa
nru@babel:/code/babel/babel_downloads/NCBIGene$ wc labels 
  79869  239607 2704154 labels

I deleted this folder and recreated it, and got the following files:

gene_info.gz	 gene_refseq_uniprotkb_collab.gz  mim2gene_medgen  taxa
nru@babel:/code/babel/babel_downloads/NCBIGene$ ls -alh
total 20G
drwxr-xr-x.   2 nru    nru     4.0K Jan 21 20:07 .
drwxrwxrwx. 110 nobody nogroup  20K Jan 21 20:07 ..
-rw-r--r--.   1 nru    nru     239M Jan 21 20:07 gene2ensembl.gz
-rw-r--r--.   1 nru    nru     1.2G Jan 21 20:07 gene_info.gz
-rw-r--r--.   1 nru    nru      88M Jan 21 20:07 gene_orthologs.gz
-rw-r--r--.   1 nru    nru     1.2G Jan 21 20:07 gene_refseq_uniprotkb_collab.gz
-rw-r--r--.   1 nru    nru     1.7G Jan 21 20:13 labels
-rw-r--r--.   1 nru    nru     835K Jan 21 20:07 mim2gene_medgen
-rw-r--r--.   1 nru    nru      14G Jan 21 20:13 synonyms
-rw-r--r--.   1 nru    nru     1.9G Jan 21 20:13 taxa
nru@babel:/code/babel/babel_downloads/NCBIGene$ wc labels
  58008023  116027268 1736342424 labels

So, even though all the other input files appear to be exactly the same size, we somehow end up with 57,928,154 fewer gene labels than we expect... which is particularly bad because it looks like unlabeled genes don't make it into Babel, and we end up with a shortfall of 56,728,492 genes when compared with the previous Babel release.

This has happened before. The problem is clearly in get_ncbigene_labels_synonyms_and_taxa somewhere, since the download files appear to be the same size and this is the only other job that executes to produce these files. However, I don't see a codepath that would cause writing of the files to be interrupted midway without some kind of exception or error:

Babel/src/datahandlers/ncbigene.py

Lines 10 to 88 in 5bc43c8

    
           def pull_ncbigene_labels_synonyms_and_taxa(): 
        
               # File format described here: https://ftp.ncbi.nlm.nih.gov/gene/DATA/README 
        
               ifname = make_local_name('gene_info.gz', subpath='NCBIGene') 
        
               labelname = make_local_name('labels', subpath='NCBIGene') 
        
               synname = make_local_name('synonyms', subpath='NCBIGene') 
        
               taxaname = make_local_name('taxa', subpath='NCBIGene') 
        
               bad_gene_types = set(['biological-region', 'other', 'unknown']) 
        
               with (gzip.open(ifname, 'r') as inf, 
        
                     open(labelname, 'w') as labelfile, 
        
                     open(synname, 'w') as synfile, 
        
                     open(taxaname, 'w') as taxafile): 
        
                   # Make sure the gene_info.gz columns haven't changed from under us. 
        
                   header = inf.readline().decode('utf-8').strip().split("\t") 
        
                   assert header == [ 
        
                       "#tax_id", 
        
                       "GeneID", 
        
                       "Symbol", 
        
                       "LocusTag", 
        
                       "Synonyms", 
        
                       "dbXrefs", 
        
                       "chromosome", 
        
                       "map_location", 
        
                       "description", 
        
                       "type_of_gene", 
        
                       "Symbol_from_nomenclature_authority", 
        
                       "Full_name_from_nomenclature_authority", 
        
                       "Nomenclature_status", 
        
                       "Other_designations", 
        
                       "Modification_date", 
        
                       "Feature_type"] 
        
                   def get_field(row, field_name): 
        
                       """ 
        
                       A helper function for returning the value of a field in gene_info.gz by name. 
        
                       The value is `-` if no value is present; we need to convert that into an empty string. 
        
                       :param row: A row from gene_info.gz. 
        
                       :param field_name: A field name in the header of gene_info.gz. 
        
                       :return: The value in this column in this row, otherwise the empty string (''). 
        
                       """ 
        
                       index = header.index(field_name) 
        
                       value = row[index].strip() 
        
                       if value == '-': 
        
                           return '' 
        
                       return value 
        
                   for line in inf: 
        
                       sline = line.decode('utf-8') 
        
                       row = sline.strip().split('\t') 
        
                       gene_id = f'NCBIGene:{get_field(row, "GeneID")}' 
        
                       symbol = get_field(row, "Symbol") 
        
                       gene_type = get_field(row, "type_of_gene") 
        
                       if gene_type in bad_gene_types: 
        
                           continue 
        
                       labelfile.write(f'{gene_id}\t{symbol}\n') 
        
                       taxafile.write(f'{gene_id}\tNCBITaxon:{get_field(row, "#tax_id")}\n') 
        
                       syns = set(get_field(row, "Synonyms").split('|')) 
        
                       syns.add(symbol) 
        
                       description = get_field(row, "description") 
        
                       syns.add(description) 
        
                       authoritative_symbol = get_field(row, "Symbol_from_nomenclature_authority") 
        
                       syns.add(authoritative_symbol) 
        
                       authoritative_full_name = get_field(row, "Full_name_from_nomenclature_authority") 
        
                       syns.add(authoritative_full_name) 
        
                       others = set(get_field(row, "Other_designations").split('|')) 
        
                       syns.update(others) 
        
                       for syn in syns: 
        
                           # Skip empty synonym. 
        
                           if syn.strip() == '': 
        
                               continue 
        
                           # gene_info.gz uses `-` to indicate a blank field -- if we're seeing that here, that means 
        
                           # we've misread the file somehow! 
        
                           if syn == '-': 
        
                               raise RuntimeError(f"Synonym '-' should not be present in NCBIGene output!") 
        
                           synfile.write(f'{gene_id}\thttp://www.geneontology.org/formats/oboInOwl#hasSynonym\t{syn}\n')

The text was updated successfully, but these errors were encountered:

gaurav self-assigned this Jan 21, 2025

gaurav added the Priority: High label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCBIGene sometimes has an unreported download issue #398

NCBIGene sometimes has an unreported download issue #398

gaurav commented Jan 21, 2025

NCBIGene sometimes has an unreported download issue #398

NCBIGene sometimes has an unreported download issue #398

Comments

gaurav commented Jan 21, 2025