Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for prefix declarations in OBO format #1102

Open
wants to merge 18 commits into
base: version4
Choose a base branch
from

Conversation

balhoff
Copy link
Contributor

@balhoff balhoff commented May 12, 2023

This adds support for reading and writing prefix declarations in OBO format (idspace tags), and using them for expanding and compacting identifiers. It builds on initial work done by @gouttegd in #1072, moved to the version4 branch. These changes also treat values of replaced_by and consider tags as IRIs rather than strings.

OBODocumentFormat now extends PrefixDocumentFormatImpl.

This should stay as a draft PR until some tests are added.

gouttegd and others added 5 commits May 10, 2023 13:14
The value of the `replaced_by` tag in a OBO file should be an ID
according to the OBO Flat File Format specification, so we treat it as
such.
When the header frame of a OBO file contains `idspace` tags, use them to
translate Prefixed-IDs (aka CURIEs) into full IRIs.
Process `consider` tags in a OBO file the same way as `replaced_by`
tags.
@balhoff balhoff marked this pull request as ready for review August 1, 2023 18:54
@balhoff
Copy link
Contributor Author

balhoff commented Aug 1, 2023

@ignazio1977 @cmungall @gouttegd I think this is ready.

@balhoff
Copy link
Contributor Author

balhoff commented Aug 1, 2023

I built a ROBOT jar which can be used to test the prefixes support: https://github.com/balhoff/owlapi/releases/download/prefixes-test/robot.jar

@cmungall
Copy link
Member

cmungall commented Aug 1, 2023

java -jar ~/tmp/robot.jar convert -I $OBO/hsapdv.owl -o /tmp/foo.obo

results in a very large owl-axioms header for untranslateable axioms

owl-axioms: Prefix(owl:=<http://www.w3.org/2002/07/owl#>)\nPrefix(rdf:=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>)\nPrefix(xml:=<http://www.w3.org/XML/1998/namespace>)\nPrefix(xsd:=<http://www.w3.org/2001/XMLSchema#>)\nPrefix(rdfs:=<http://www.w3.org/2000/01/rdf-schema#>)\n\n\nOntology(\nDeclaration(AnnotationProperty(<http://www.geneontology.org/formats/oboInOwl#id>))\n\n\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/BFO_0000050> \"part_of\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/BFO_0000062> \"preceded_by\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000000> \"HsapDv:0000000\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000001> \"HsapDv:0000001\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000002> \"HsapDv:0000002\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000003> \"HsapDv:0000003\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000004> \"HsapDv:0000004\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000005> \"HsapDv:0000005\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000006> \"HsapDv:0000006\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000007> \"HsapDv:0000007\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000008> \"HsapDv:0000008\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000009> \"HsapDv:0000009\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000010> \"HsapDv:0000010\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000011> \"HsapDv:0000011\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000012> \"HsapDv:0000012\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000013> \"HsapDv:0000013\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000014> \"HsapDv:0000014\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000015> \"HsapDv:0000015\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000016> \"HsapDv:0000016\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000017> \"HsapDv:0000017\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000018> \"HsapDv:0000018\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000019> \"HsapDv:0000019\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000020> \"HsapDv:0000020\")\nAnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000021>

and also has ID CURIEs translated to obo:...

[Term]
id: obo:HsapDv_0000001
name: human life cycle
namespace: human_developmental_stage
def: "Temporal interval that defines human life from the prenatal stage until late adulthood." [Bgee:curator]
xref: UBERON:0000104
is_a: obo:HsapDv_0000000 ! human life cycle stage
property_value: hsapdv:start_dpf "0.0" xsd:float

I don't think there is anything unusual about this ontology - cl.owl exhibits the same behavior.

@cmungall
Copy link
Member

cmungall commented Aug 1, 2023

If the ontology is converted from .obo, there is no problem

I though that somehow the oboInOwl:id annotations were the issue, removing them gets rid of the giant owl-prefix, but the ID issue remains

@gouttegd
Copy link
Contributor

gouttegd commented Aug 2, 2023

If the ontology is converted from .obo, there is no problem

In fact this seems to happen only if the source ontology is in RDF/XML. I don’t see anything wrong if the source ontology is in Manchester or functional syntax.

@gouttegd
Copy link
Contributor

gouttegd commented Aug 2, 2023

The ID issue seems more precisely caused by the xmlns:obo=http://purl.obolibrary.org/obo/ namespace declaration in the RDF/XML file, coupled to a behaviour of the OWL API prefix manager that I quite don’t understand:

DefaultPrefixManager pm = new DefaultPrefixManager();
pm.setPrefix("obo:", "http://purl.obolibrary.org/obo/");
pm.setPrefix("CL:", "http://purl.obolibrary.org/obo/CL_");

IRI iri = IRI.create("http://purl.obolibrary.org/obo/CL_0001");

System.err.printf("IRI to CURIE: %s -> %s\n", iri, pm.getPrefixIRI(i));

This gives:

IRI to curie: http://purl.obolibrary.org/obo/CL_0001 -> obo:CL_0001

Not sure why the prefix manager does not recognise the longer, more specific CL URL prefix.

@gouttegd
Copy link
Contributor

gouttegd commented Aug 2, 2023

If I understand correctly, this is because the OWL API prefix manager does not simply search for the longest URL prefix in the prefix map. Before doing that, it first searches for a URL prefix that corresponds to the namespace of the IRI. If it finds one, it always uses that prefix without searching any further.

The IRI namespace is determined completely independently of any prefix map, solely on the basis of some XML rules about names (it’s the longest prefix such that whatever remains after the prefix is a XML NCName). With http://purl.obolibrary.org/obo/CL_0001 for example, the namespace is http://purl.obolibrary.org/obo/.

Now because the RDF/XML file contains a xmlns:obo=http://purl.obolibrary.org/obo/ namespace declaration, when the prefix manager looks up first the namespace of http://purl.obolibrary.org/obo/CL_0001, it finds the http://purl.obolibrary.org/obo/ URL prefix, and it does not search for any other prefix in the prefix map that could be a better match.

@balhoff
Copy link
Contributor Author

balhoff commented Aug 2, 2023

@gouttegd thanks for looking into that. I actually had originally implemented my own search for the longest URL prefix, and then swapped in the prefix manager because it seemed proper to reuse code. But I much prefer always using the longest match!

I wonder if I should put in special case handling of namespaces that are http://purl.obolibrary.org/obo/ or substrings. If such a prefix is defined, currently the code will not have a chance to fall back to the built-in OBO compaction (this is separate from if you had also defined a CL prefix; I don't want to force OBO files to define all OBO prefixes).

@cthoyt
Copy link

cthoyt commented Aug 2, 2023

The ‘curies’ package for Java is available on maven and has an implementation of uri compression that correctly handles getting the longest match

@gouttegd
Copy link
Contributor

gouttegd commented Aug 2, 2023

To be clear, the OWL API prefix manager does correctly handle getting the longest match. It’s just that it does so only if it does not find a prefix that matches the namespace of the IRI (where the namespace is defined as explained above) – if a match for the namespace is found, this takes precedence over everything else.

Overall, it seems like the OWL API DefaultPrefixManager has been designed with the particular constraints of XML tags in mind. Nothing wrong with that, but I think that makes it unsuitable for our particular use case here. I’d suggest either reverting to the original custom longest prefix search (which was working just fine last time I tested this PR, if I recall correctly) or adopting curies.

@balhoff
Copy link
Contributor Author

balhoff commented Aug 3, 2023

I think the main problem @cmungall is seeing is the result of automatic insertion of oboInOwl:id annotations when parsing. When a term like this is parsed:

[Term]
id: HsapDv:0000001
name: human life cycle

An annotation is injected into the OWL:

AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000001> "HsapDv:0000001")

If there is an idspace defined like obo http://purl.obolibrary.org/obo/, this won't affect the parsing of the term, but it will be matched when the term is written. So the stanza gets created as:

[Term]
id: obo:HsapDv_0000001
name: human life cycle

And then the OBO writer doesn't know what to do with the id annotation, since there is now no HsapDv:0000001 frame, so it puts it in the owl-axioms header. If you read this OBO file back into OWL, there will then be two id annotations:

AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000001> "HsapDv:0000001")
AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/HsapDv_0000001> "obo:HsapDv_0000001")

I'm not sure, but I don't think the id annotations serve any purpose (the business of mapping relations to unprefixed names is handled by an inserted oboInOwl:shorthand annotation). If I suppress their creation, it fixes the owl-axioms header problem. But maybe that's not an acceptable change to make in this PR (would be good to do in OBO 1.6). But it will probably be super surprising to end up with obo: prefixed terms everywhere when converting from OWL files that happen to have that prefix defined. Should I make OBO format handle this namespace in a special way? I think that would also avoid these particular oboInOwl:id problems. But I don't like making special cases. And I suspect I can cause similar id annotation problems by putting other overlapping id spaces in the header (update: I can).

@balhoff
Copy link
Contributor Author

balhoff commented Oct 20, 2023

@gouttegd @cmungall do you understand the point of this check for underscore? Trying to wrap up this PR and this is a sticking point:

if (localId.contains("_")) {
uriPrefix += "#";

I don't want to append the # to the uriPrefix.

@cmungall
Copy link
Member

I think this is for subset "IDs".

@gouttegd
Copy link
Contributor

I believe this is a (poor) way of checking whether the ID is a “non-canonical prefixed ID” (canonical prefixed ID don’t contain an underscore in the local ID part). Such IDs MUST be expanded with a '#' character between the URL prefix and the local ID, according to the OBO Flat File Format specification (§5.9.2 Translation of identifiers).

I say a “poor way” because a “non-canonical prefixed ID” is defined (§2.5 Identifiers) basically as any prefixed ID that contains characters that are not allowed in a “canonical prefix ID” – so, any prefixed ID where the prefix part contains something else than alphanumeric characters and underscores and/or where the local part contains something else than digits.

In other words, that check fails to identify many ”non-canonical prefixed IDs” (all those that do not contain underscores in their local part, but may contain any other character not allowed in a canonical ID), which are then translated as if they were canonical.

Whether it would be a good idea to fix the parser (implement a proper check for “non-canonical prefixed IDs”) to make it really compliant with that section of the OBO Flat File Format specification, I am not sure.

@gouttegd
Copy link
Contributor

gouttegd commented Feb 9, 2024

Seems to be working just fine at least on Uberon. :) I’ll do some more tests with other ontologies later.

@gouttegd
Copy link
Contributor

FWIW I didn’t notice any issue with any of the ontologies I work with.

@cmungall
Copy link
Member

cmungall commented May 1, 2024

I have tested this jar and it looks like it injects labels for oio properties. While this is not so harmful and even mildly useful in some circumstances, it's generally undesirable and not supported by the spec.

E.g.

format-version: 1.4
ontology: comment

[Term]
id: X:1
comment: "This is a comment about term X:1."

generates

<?xml version="1.0"?>
<rdf:RDF xmlns="http://purl.obolibrary.org/obo/comment.owl#"
     xml:base="http://purl.obolibrary.org/obo/comment.owl"
     xmlns:owl="http://www.w3.org/2002/07/owl#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:xml="http://www.w3.org/XML/1998/namespace"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
     xmlns:oboInOwl="http://www.geneontology.org/formats/oboInOwl#">
    <owl:Ontology rdf:about="http://purl.obolibrary.org/obo/comment.owl">
        <oboInOwl:hasOBOFormatVersion>1.4</oboInOwl:hasOBOFormatVersion>
    </owl:Ontology>
    


    <!-- 
    ///////////////////////////////////////////////////////////////////////////////////////
    //
    // Annotation properties
    //
    ///////////////////////////////////////////////////////////////////////////////////////
     -->

    


    <!-- http://www.geneontology.org/formats/oboInOwl#hasOBOFormatVersion -->

    <owl:AnnotationProperty rdf:about="http://www.geneontology.org/formats/oboInOwl#hasOBOFormatVersion">
        <rdfs:label>has_obo_format_version</rdfs:label>
    </owl:AnnotationProperty>
    


    <!-- http://www.geneontology.org/formats/oboInOwl#id -->

    <owl:AnnotationProperty rdf:about="http://www.geneontology.org/formats/oboInOwl#id">
        <rdfs:label>id</rdfs:label>
    </owl:AnnotationProperty>
    


    <!-- http://www.w3.org/2000/01/rdf-schema#comment -->

    <owl:AnnotationProperty rdf:about="http://www.w3.org/2000/01/rdf-schema#comment"/>
    


    <!-- 
    ///////////////////////////////////////////////////////////////////////////////////////
    //
    // Classes
    //
    ///////////////////////////////////////////////////////////////////////////////////////
     -->

    


    <!-- http://purl.obolibrary.org/obo/X_1 -->

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/X_1">
        <oboInOwl:id>X:1</oboInOwl:id>
        <rdfs:comment>&quot;This is a comment about term X:1.&quot;</rdfs:comment>
    </owl:Class>
</rdf:RDF>



<!-- Generated by the OWL API (version 4.5.26) https://github.com/owlcs/owlapi -->

@gouttegd
Copy link
Contributor

gouttegd commented May 1, 2024

@cmungall This does not seem to be caused by this PR. I observe exactly the same behaviour with the current released version of ROBOT (1.9.5).

cmungall added a commit to INCATools/ontology-access-kit that referenced this pull request May 1, 2024
@gouttegd
Copy link
Contributor

gouttegd commented May 6, 2024

Trying to merge the PR with the tip of the version4 branch results in the RoundTripTestCase#shouldRoundTripVersionInfo() test failing.

The test loads a test ontology in functional syntax, which contains the default prefix declarations (OWL, RDFS, etc.), and converts it to OBO by calling the OWLAPIOwl2Obo.convert() method, which directly translates the prefixes from the source ontology into IDSPACE tags – resulting in the aforementioned default prefix declarations being injected into the OBO header frame, which the test doesn’t expect.

What I absolutely do not understand is why this test only fails upon merging this PR with the version4 branch. From what I understand it should also fail right here, before we even try to merge…

@gouttegd
Copy link
Contributor

gouttegd commented May 6, 2024

Not sure what the correct behaviour of the OWLAPIOwl2Obo.convert() method should be.

For now, it is set to “preserve prefix mappings loaded from a previous serialization”, which automatically results in the typical default prefixes (OWL, RDF, etc.) being converted into IDSPACE tags. Should we actively prevent those prefixes from being converted?

@gouttegd
Copy link
Contributor

gouttegd commented May 6, 2024

What I absolutely do not understand is why this test only fails upon merging this PR with the version4 branch. From what I understand it should also fail right here, before we even try to merge…

So it seems this has to do with the version of the SureFire plugin used to run the test suite.

Jim’s replaced-by-value-as-iri-v4 branch still uses version 2.20 (same version as the one used in OWLAPI 4.5.26). The current version4 branch (upcoming 4.5.27) uses SureFire 3.2.5.

Changing just the version of SureFire in Jim’s branch to use the same 3.2.5 as in the current version4 branch leads to the aforementioned test failure.

@ignazio1977
Copy link
Contributor

ignazio1977 commented May 6, 2024 via email

@ignazio1977
Copy link
Contributor

My view:

OWLAPIOwl2Obo.convert()

needs to exclude the standard prefixes, as you do not want them as idspace declarations (please confirm if I've misunderstood this). Implementing a simple list of namespaces to exclude from idispace declarations makes the test pass.

Another failing test is to do with an idspace declaration not being includes (sw in this case). I believe this is because the idspace map built during the ontology parse is not used to put the prefixes in the fresh obo document format object created for the parsing.

I'm experimenting with these two fixes to see if that will allow the tests to pass. Please let me know if you think I'm going the wrong way about this.

@gouttegd
Copy link
Contributor

gouttegd commented May 7, 2024

OWLAPIOwl2Obo.convert() needs to exclude the standard prefixes, as you do not want them as idspace declarations (please confirm if I've misunderstood this).

I’ve come to the same conclusion.

Please let me know if you think I'm going the wrong way about this.

No, I think you’re right. @balhoff what do you think?

@ignazio1977
Copy link
Contributor

Regarding the failure/non failure due to surefire, I'm thinking the problem might be this file is named Test rather than TestCase and it's Junit 5 (as all other test cases now are), perhaps the old surefire was missing those altogether.

@ignazio1977
Copy link
Contributor

I got a clean build and have pushed up the commit, any feedback welcome.

I'll check other existing PRs and bug reports, if there are any easy wins I'll add them, then I can release later today or tomorrow

@gouttegd
Copy link
Contributor

gouttegd commented May 7, 2024

I built the latest ROBOT with the current tip of the version4 branch and did some quick tests on my ontologies, it seems to behave as intended.

(FWIW I ran into some Log4J-related issues when building ROBOT against the new OWLAPI, but this has nothing to do with this PR and is almost certainly a problem on the ROBOT side.)

@balhoff
Copy link
Contributor Author

balhoff commented May 7, 2024

Thanks @ignazio1977! Where can I see your changes? My intent with the standard prefixes was that I didn't want them automatically injected into a fresh namespace manager, but if they were actually purposely declared then I would include them. Maybe it's an impossible distinction to work out.

@ignazio1977
Copy link
Contributor

@balhoff I think it would be quite hard to tell apart automatic from standard namespaces, especially on existing ontologies. On freshly created ones the prefix map would be empty, so the presence of namespaces would mean manual insertion, but at renderer level there would be no way of knowing that.

Latest patch set is a squashed commit: 1f4764d

@ignazio1977
Copy link
Contributor

I'll check the current libraries for known vulnerabilities and proceed with the release.

@ignazio1977
Copy link
Contributor

No known vulnerabilities, but quite abit of pain updating GPG keys for the new release. Now I'm stuck with credentials for sonatype, to release to central. My credentials don't seem to work.

I'll try again tomorrow, if no joy I'll raise a ticket with sonatype.

@ignazio1977
Copy link
Contributor

ignazio1977 commented May 8, 2024

Release claims to have completed, after two or three unsuccessful attempts.

Maven Central will take a bit to synchronise, I'll check tomorrow.

cmungall added a commit to INCATools/ontology-access-kit that referenced this pull request May 8, 2024
@matentzn
Copy link

matentzn commented May 9, 2024

Whats the significance of all these "conflicting files" in this PR? Is this PR merged / taken into account for the release?

@ignazio1977
Copy link
Contributor

Not significant. Part of the PR was already merged, I've rebased and squashed locally before merging the fixed branch, that left GitHub with two conflicting branches.

The PR was merged. The release doesn't seem to have made it to maven central though. I'll go check sonatype.

@ignazio1977
Copy link
Contributor

Looks like yesterday my attempts managed to publish only one artifact, and today that artifact is stopping the release form continuing.

I've incremented the version to 4.5.28 and released again. This seems to have been successful, I can see the release in sonatype. Not visible yet on maven central, that takes usually a couple of hours.

@jamesaoverton
Copy link

I can see 4.5.28 on Maven Central now: https://central.sonatype.com/artifact/net.sourceforge.owlapi/owlapi-api/overview. Thanks!

@matentzn
Copy link

matentzn commented May 9, 2024

THAAAANK you @ignazio1977!! It works: ontodev/robot#1200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants