Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different scores? #123

Open
cbizon opened this issue Sep 15, 2023 · 3 comments
Open

Different scores? #123

cbizon opened this issue Sep 15, 2023 · 3 comments
Assignees

Comments

@cbizon
Copy link
Contributor

cbizon commented Sep 15, 2023

Two queries are run in robokop. One is (Ozone)-(gene)-(asthma) and the other is (asthma)-(gene)-(Ozone). The same answers are returned. But the scores are slightly different between the two. Attached are two messages. The first result in each (NCBIGene:7412) show the different scores.

ROBOKOP_message_asthma-gene-ozone_trapi1.4dev.json.txt
ROBOKOP_message_ozone-gene-asthma_trapi1.4dev.json.txt

As far as I can tell, the two results are the same in terms of number of edges bound, and the parameters of the omnicorp support edges.

This suggests to me a bug in ranker somewhere, but the differences are small enough that perhaps it is something numerical?

I also notice that every weight I saw has a value of 1. Is this accurate? Or are these weights no longer used in ranking?

@kennethmorton
Copy link
Contributor

Interesting case!

Looking at the first result in both sets, they are basically the same, but not exactly. I wrote some code to do a quick and dirty look at the content of the edges between the different curies in the result. I confirmed that if you remove directionality and only consider a symmetric weight matrix, the same edges are all present. The disagreements are between the subjects and objects on otherwise directionless edges. I believe this is due to Omnicorp and how it must make some arbitrary selection of subjects and objects.

This is fine, except in how it impacts ranker and weight calculation. Roughly for each subject/object pair, each source can only contribute a single weight for each property type. If there are multiple edges from the same source that have the same property value (ex. CTD publications), the maximum property value is taken. Once the edges are collapse for each unique subject/object/source/property, there can be subtle differences if the subjects, objects flip around.

Once we have a weight matrix, we make it symmetric when we calculate the graph laplacian. If instead we make the matrix symmetric while checking for subject/object/source/property collisions, it should clear up the discrepancy.

I believe this is fixed in #124 but I'd like a few more eyes on it. @uhbrar @maximusunc

@maximusunc
Copy link
Contributor

Please disregard if this is too edge case, but I'd like to toss a small wrench. In ICEES-KG, we have many edges with the same subject/object/source/property but that come from different datasets and different years. It sounds like the current ranker would not handle this case.

@kennethmorton
Copy link
Contributor

I think that's an interesting point. We should consider what other aspects of TRAPI we should using to identify unique edges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants