You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The software liftoff compares two assemblies of very closely related genomes, one of which has gene annotations, and the other does not. It then tries to find which regions in the "bare" genomes align well with the "annotated" genome, and then uses the alignment and the gene coordinates of the "annotated" genome to predict the gene coordinates of the equivalent gene in the "bare" genome.
When a template gene is found once in the "annotated" genome but multiple times in the "bare" genome, the attribute ID of the template gene is retained for the first copy in the "bare" genome, but for the second copy a _1 is appended to the attribute ID.
See for example an excerpt from a resultant liftoff gff3 file (toy.edit.gff3):
The QZVMAR_2781_gene of the original genome has been found twice in the target genome, and the gene and CDS features of the second copy (here on top of the file) have been appended with a _1
As you can imagine, since the merge_strategy='create_unique' uses a similar strategy, this leads to perhaps some unexpected behavior. Applying (trials.py)
db = gffutils.create_db('toy.edit.gff3', ':memory:', merge_strategy='create_unique')
to this GFF3 file leads to the following error:
Traceback (most recent call last):
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 622, in _populate_from_lines
self._insert(f, c)
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 566, in _insert
cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./trials.py", line 5, in <module>
db = gffutils.create_db('toy.edit.gff3', ':memory:', merge_strategy='create_unique')
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 1401, in create_db
c.create()
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 543, in create
self._populate_from_lines(self.iterator)
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 656, in _populate_from_lines
self._insert(f, c)
File "/Users/joran/miniconda3/envs/gffutils/lib/python3.6/site-packages/gffutils/create.py", line 566, in _insert
cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id
My guess of what is going on here is that once the script gets to the second CDS of the second gene (the 9th line), it will try to append ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54 with a _1. However, its result, ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1 is already an ID of a CDS of the first gene! Hence, the unique requirement fails and the script crashes.
Part of the challenge with gffutils is covering all the corner cases simultaneously...your solution will work for this GFF, but I think it may fail in cases where there are identical IDs and positions (but perhaps the feature type or other attributes differ).
Another solution would be to keep trying to increment until you hit something that doesn't exist yet. In that case I think I'd want to have all sorts of warnings because it would be easy to have duplicate features (basically any time you add anything you'd get new entries).
Honestly I think the best solution is to to use the id_spec exactly as you have.
But maybe we could make it easier to discover, like an id_spec module where you could do:
I'm not sure what other id_specs would be useful, though I'd imagine digging through past issues could give some clues. Actually, now that I think about it some more, maybe instead of just id_specs it should be entire sets of tool-specific kwargs, so something like:
Such a module would be reasonably straightforward to maintain. The tricky part would be identifying the tools to support and tracking down example files to use for testing.
Part of the challenge with gffutils is covering all the corner cases simultaneously...your solution will work for this GFF, but I think it may fail in cases where there are identical IDs and positions (but perhaps the feature type or other attributes differ).
Good point.
A set of tool specific functions / arguments sounds like a good idea to me. Even with the GFF3 standardization, lots of tools still seem to have their own quircky behaviours. Like you said it may be impossible to have a universal function that works for all of them.
The software liftoff compares two assemblies of very closely related genomes, one of which has gene annotations, and the other does not. It then tries to find which regions in the "bare" genomes align well with the "annotated" genome, and then uses the alignment and the gene coordinates of the "annotated" genome to predict the gene coordinates of the equivalent gene in the "bare" genome.
When a template gene is found once in the "annotated" genome but multiple times in the "bare" genome, the attribute ID of the template gene is retained for the first copy in the "bare" genome, but for the second copy a
_1
is appended to the attribute ID.See for example an excerpt from a resultant liftoff gff3 file (
toy.edit.gff3
):The
QZVMAR_2781_gene
of the original genome has been found twice in the target genome, and thegene
andCDS
features of the second copy (here on top of the file) have been appended with a_1
As you can imagine, since the
merge_strategy='create_unique'
uses a similar strategy, this leads to perhaps some unexpected behavior. Applying (trials.py
)to this GFF3 file leads to the following error:
My guess of what is going on here is that once the script gets to the second CDS of the second gene (the 9th line), it will try to append
ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54
with a_1
. However, its result,ID=Seq_8_Seq_8_Manual.BlastoST7C_030513:00:54_1
is already an ID of a CDS of the first gene! Hence, the unique requirement fails and the script crashes.I've come up with following workaround:
which basically creates a custom ID for each CDS feature, using the ID, the start and end coordinates to make sure it is unique.
In any case, I just wanted to point out this unfortunate coincidence between Liftoff and gffutils which leads to errors.
Perhaps it would be possible to update the
create_unique
function to take into account such cases?The text was updated successfully, but these errors were encountered: