-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gene association representation in omim.ttl
#156
Comments
@joeflack4 let me know if I can help with anything - since you updated that code recently its probably best you take care of this |
I know that if there’s more than 1 association, we don’t call it causal. But I don’t know why we would not list all associations otherwise. Will look into it. |
@sabrinatoro @matentzn Just want to confirm how this is supposed to work (Trish edit: for modeling OMIM in the first file created to model the OMIM content which is omim.ttl) Gastric Cancer (OMIM:613659) has 11 Phenotype-Gene Relationships. In this case, we should declare the following property on all 11: But neither of the following properties should be used at all: |
@joeflack4 We are talking about Mondo, right? (ie NOT the Monarch KG. --- I need to mention this in case I am confusing myself). Therefore, we allow only 1 gene per disease (because we know that in OMIM, the disease is defined based on variation in that gene). If a disease is associated with more than one gene, then the genes are not defining the disease, and therefore we do not bring this gene annotation into Mondo. We documented in multiple places, I don't have time to look for the links, sorry. Note: The 11 Phenotype-Gene Relationships for Gastric Cancer (OMIM:613659) would get into the Monarch KG, but NOT into Mondo |
@sabrinatoro Joe's question is related to how OMIM should initially be modeled as an ontology, e.g. omim.ttl, as the content exists in OMIM itself. What we do with it from there, ie processing of omim.ttl to bring into Mondo, involves further steps that are out of scope for this question currently. The way this initial modeling of omim looks like in the omim.ttl file is that even entries like https://omim.org/entry/613659 for 'gastric cancer' has only 1 gene association viewable in Protege (the other 10 are viewable in the ttl file when viewing using a text editor), while OMIM itself has 11 associations. Here is a screenshot of 'gastric cancer' in the omim.ttl file. While what is viewable in Protege vs. the ttl file itself is not that important, it's not clear why only 1 of 11 the genes listed in OMIM for the 'gastric cancer' entry has the association RO:0004013 which is then later converted to RO:0004003. Is there a flag in the OMIM entry/files that are used to create this association that determines that IL1B is the causal gene out of the other 10 genes that are listed or is this representation in the omim.ttl file incorrect? My concern is that if the initial modeling of OMIM content in the omim.ttl file is not correct, the further transformations that occur to get this content into Mondo will also not be correct since the starting content is incorrect. This is related to your (Sabrina's comments) about issues with the omim pipeline/gene2disease pipeline for Mondo. |
FYI - there is now a thread in Slack in mondo-ingest about this too. |
Joe and I reviewed this further and my suspicion is that there is an issue in how associations are counted, therefore leading to incorrect application of the RO property in the omim.ttl file. More to come soon. |
omim.ttl
I think this is a bug, if not please explain the design decision. For some OMIM disease entries, e.g. https://omim.org/entry/613659, in the omim.ttl file there is only 1 disease to gene association ('has material basis in germline mutation in' IL1B). However, on the OMIM entry page there are 11. I do see all associations in the various data files that are downloaded when creating the omim.ttl file using
python3 -m omim2obo
.This issue is not limited to OMIM disease/phenotype entries that contain an INCLUDED entry since this also happens with https://omim.org/entry/605074 and the omim.ttl file only contains 1 disease to gene association ('has material basis in germline mutation in' PRCC).
I believe this is part of the issues that were reported in the PR for the OMIM g2d pipeline in Mondo, specifically point (3) Genes should not be added if the OMIM record is associated with multiple genes.
UPDATE:
For https://omim.org/entry/613659, I looked in the ttl file directly vs. from Protege and do see 11 entries in this format where the values like
_:Nbcf6a815046747ee9fe5bc8f3891b1c5
look to point back to a gene:In 10 of the 11 entries, it has
owl:onProperty RO:0003302 ;
. None have RO:0004003 as displayed in Protege.UPDATE 2: I do see that the OMIM gene entry with RO:0004013 is for IL1B and there is some code that flips RO:0004013 to RO:0004003 so those further transformations are more clear.
LATEST QUESTION --> However, it's not clear why only this one gene has RO:0004013 to start with and the others listed for 'gastric cancer' have a different RO property.
Also, did anyone have a chance to document the earlier design decisions? See #75 (comment)
Resources
The text was updated successfully, but these errors were encountered: