Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCBIGene sometimes has an unreported download issue #398

Open
gaurav opened this issue Jan 21, 2025 · 0 comments
Open

NCBIGene sometimes has an unreported download issue #398

gaurav opened this issue Jan 21, 2025 · 0 comments
Assignees

Comments

@gaurav
Copy link
Collaborator

gaurav commented Jan 21, 2025

I just created a Babel. Here is what babel_downloads/NCBIGene looks like:

nru@babel:/code/babel/babel_downloads/NCBIGene$ ls -alh
total 18G
drwxr-xr-x.   2 nru    nru     4.0K Jan 15 21:18 .
drwxrwxrwx. 110 nobody nogroup  20K Jan 18 18:58 ..
-rw-r--r--.   1 nru    nru     239M Jan 15 21:18 gene2ensembl.gz
-rw-r--r--.   1 nru    nru     1.2G Jan 15 21:18 gene_info.gz
-rw-r--r--.   1 nru    nru      88M Jan 15 21:18 gene_orthologs.gz
-rw-r--r--.   1 nru    nru     1.2G Jan 15 21:18 gene_refseq_uniprotkb_collab.gz
-rw-r--r--.   1 nru    nru     2.6M Jan 16 15:44 labels
-rw-r--r--.   1 nru    nru     835K Jan 15 21:18 mim2gene_medgen
-rw-r--r--.   1 nru    nru      14G Jan 15 21:25 synonyms
-rw-r--r--.   1 nru    nru     1.9G Jan 15 21:25 taxa
nru@babel:/code/babel/babel_downloads/NCBIGene$ wc labels 
  79869  239607 2704154 labels

I deleted this folder and recreated it, and got the following files:

gene_info.gz	 gene_refseq_uniprotkb_collab.gz  mim2gene_medgen  taxa
nru@babel:/code/babel/babel_downloads/NCBIGene$ ls -alh
total 20G
drwxr-xr-x.   2 nru    nru     4.0K Jan 21 20:07 .
drwxrwxrwx. 110 nobody nogroup  20K Jan 21 20:07 ..
-rw-r--r--.   1 nru    nru     239M Jan 21 20:07 gene2ensembl.gz
-rw-r--r--.   1 nru    nru     1.2G Jan 21 20:07 gene_info.gz
-rw-r--r--.   1 nru    nru      88M Jan 21 20:07 gene_orthologs.gz
-rw-r--r--.   1 nru    nru     1.2G Jan 21 20:07 gene_refseq_uniprotkb_collab.gz
-rw-r--r--.   1 nru    nru     1.7G Jan 21 20:13 labels
-rw-r--r--.   1 nru    nru     835K Jan 21 20:07 mim2gene_medgen
-rw-r--r--.   1 nru    nru      14G Jan 21 20:13 synonyms
-rw-r--r--.   1 nru    nru     1.9G Jan 21 20:13 taxa
nru@babel:/code/babel/babel_downloads/NCBIGene$ wc labels
  58008023  116027268 1736342424 labels

So, even though all the other input files appear to be exactly the same size, we somehow end up with 57,928,154 fewer gene labels than we expect... which is particularly bad because it looks like unlabeled genes don't make it into Babel, and we end up with a shortfall of 56,728,492 genes when compared with the previous Babel release.

This has happened before. The problem is clearly in get_ncbigene_labels_synonyms_and_taxa somewhere, since the download files appear to be the same size and this is the only other job that executes to produce these files. However, I don't see a codepath that would cause writing of the files to be interrupted midway without some kind of exception or error:

def pull_ncbigene_labels_synonyms_and_taxa():
# File format described here: https://ftp.ncbi.nlm.nih.gov/gene/DATA/README
ifname = make_local_name('gene_info.gz', subpath='NCBIGene')
labelname = make_local_name('labels', subpath='NCBIGene')
synname = make_local_name('synonyms', subpath='NCBIGene')
taxaname = make_local_name('taxa', subpath='NCBIGene')
bad_gene_types = set(['biological-region', 'other', 'unknown'])
with (gzip.open(ifname, 'r') as inf,
open(labelname, 'w') as labelfile,
open(synname, 'w') as synfile,
open(taxaname, 'w') as taxafile):
# Make sure the gene_info.gz columns haven't changed from under us.
header = inf.readline().decode('utf-8').strip().split("\t")
assert header == [
"#tax_id",
"GeneID",
"Symbol",
"LocusTag",
"Synonyms",
"dbXrefs",
"chromosome",
"map_location",
"description",
"type_of_gene",
"Symbol_from_nomenclature_authority",
"Full_name_from_nomenclature_authority",
"Nomenclature_status",
"Other_designations",
"Modification_date",
"Feature_type"]
def get_field(row, field_name):
"""
A helper function for returning the value of a field in gene_info.gz by name.
The value is `-` if no value is present; we need to convert that into an empty string.
:param row: A row from gene_info.gz.
:param field_name: A field name in the header of gene_info.gz.
:return: The value in this column in this row, otherwise the empty string ('').
"""
index = header.index(field_name)
value = row[index].strip()
if value == '-':
return ''
return value
for line in inf:
sline = line.decode('utf-8')
row = sline.strip().split('\t')
gene_id = f'NCBIGene:{get_field(row, "GeneID")}'
symbol = get_field(row, "Symbol")
gene_type = get_field(row, "type_of_gene")
if gene_type in bad_gene_types:
continue
labelfile.write(f'{gene_id}\t{symbol}\n')
taxafile.write(f'{gene_id}\tNCBITaxon:{get_field(row, "#tax_id")}\n')
syns = set(get_field(row, "Synonyms").split('|'))
syns.add(symbol)
description = get_field(row, "description")
syns.add(description)
authoritative_symbol = get_field(row, "Symbol_from_nomenclature_authority")
syns.add(authoritative_symbol)
authoritative_full_name = get_field(row, "Full_name_from_nomenclature_authority")
syns.add(authoritative_full_name)
others = set(get_field(row, "Other_designations").split('|'))
syns.update(others)
for syn in syns:
# Skip empty synonym.
if syn.strip() == '':
continue
# gene_info.gz uses `-` to indicate a blank field -- if we're seeing that here, that means
# we've misread the file somehow!
if syn == '-':
raise RuntimeError(f"Synonym '-' should not be present in NCBIGene output!")
synfile.write(f'{gene_id}\thttp://www.geneontology.org/formats/oboInOwl#hasSynonym\t{syn}\n')

@gaurav gaurav self-assigned this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant