You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is now available at https://stars.renci.org/var/babel_outputs/2024aug18/reports/duckdb/identically_labeled_cliques.tsv.gz -- note that column 3 is the files that these clique leaders come from, while column 4 is the Biolink types for these clique leader. There are 1,716,212 preferred labels with more than one clique. Here is the frequency distribution for labels by Biolink types in Babel 2024aug18:
These are pretty fascinating. My general feeling is that we'll probably need to do some evaluation per-group here to really decide what to do. and we probably only care about the biggest groupings (over 1000?).
With my quick poking around, for instance, I think that the Gene groupings seem very much like they're all orthologs. We might think about whether we want an ortholog conflation going forward...
OrganismTaxon, on the other hand, it looks like we have lots and lots of cases where our NCBI/UMLS grouping is just failing, so that's one that we should probably consider a bug.
The small molecules seem to be mostly isotope differences, something that we've also thought about as a conflation.
Molecular Mixture it looks like Pubchem has many different e.g. mixtures of propane and ethane in different ratios. Each ratio is a different ID. But they all get the same name. Seems fine, maybe a possibility for conflation?
I think getting into the Chemical Entity vs Protein is another one where we probably have real work to do, figuring out how/when to merge these things.
This is now available at https://stars.renci.org/var/babel_outputs/2024aug18/reports/duckdb/identically_labeled_cliques.tsv.gz -- note that column 3 is the files that these clique leaders come from, while column 4 is the Biolink types for these clique leader. There are 1,716,212 preferred labels with more than one clique. Here is the frequency distribution for labels by Biolink types in Babel 2024aug18:
(That one entry with all the classes is, not surprisingly,
""
, which shows up for 12,387,395 cliques.)The text was updated successfully, but these errors were encountered: