Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up code in jena.ext.xerces #2828

Open
Ostrzyciel opened this issue Nov 6, 2024 · 0 comments
Open

Clean up code in jena.ext.xerces #2828

Ostrzyciel opened this issue Nov 6, 2024 · 0 comments
Labels
task Task

Comments

@Ostrzyciel
Copy link
Contributor

Change

This is a continuation of #2797 – trimming out the bloat in datatyped literal validations.

That issue tackled the worst offender – allocating two hashmaps per each validated literal. This is however not the whole story. For every literal validation we still allocate a new ValidationState:

public Object parse(String lexicalForm) throws DatatypeFormatException {
try {
ValidationContext context = new ValidationState();
ValidatedInfo resultInfo = new ValidatedInfo();
typeDeclaration.validate(lexicalForm, context, resultInfo);

This makes no sense, because that object is never really mutated (or, to be precise, does not need to be mutable). At the same time, it has quite a few fields. Although the JVM can probably figure out how to handle this efficiently with escape analysis, this is still unnecessary bloat.

The entire org.apache.jena.ext.xerces package contains a lot of unused code carried over from xerces. A lot of the infrastructure is not needed in Jena, because the more complex XML features make no sense in the context of RDF literals. For example, a large part of the original job of ValidationState was to check if ID and IDREF attributes are correct with respect to one another. This, along with a few other quirky XML thingies "SHOULD NOT" be used in RDF according to the spec, so I think we can safely remove this.

The plan

  • Remove code for special handling of datatypes: QName, ENTITY, ID, IDREF, NOTATION, IDREFS, ENTITIES, NMTOKENS – none of which are valid in RDF.
    • Note that although ID/IDREF validation is implemented in Jena, it currently does not work, because the ValidationState is allocated once per literal, not once per document. And what would a "document" even mean in RDF?
  • Instantiate only one instance of ValidationState per Jena instance, or something to a similar effect.
  • Remove all kinds of unused code from the xerces package to help maintainability, make the JARs a bit smaller etc.

How do I do this?

What is the process for making breaking changes to Jena APIs? I assume that even if a public class is not used in the Jena codebase, it can be removed only in a MAJOR release. So, should I do something like:

  • Make a PR deprecating all the different public methods and classes and marking them for removal in Jena 6
  • Before Jena 6.0 release (but after last Jena 5.x release) make a PR actually removing the code

?

Notes on unused code

  • According to my IntelliJ, these methods of ValidationState are not used anywhere in the Jena codebase:
    • All set* methods, except setEntityState, which is used only in ValidationManager, which is in turn an unused class.
    • resetIDTables, reset, useNamespaces
    • Of course, this needs to be double-checked.
  • Other unused code:
    • ConfigurableValidationState class
    • DatatypeValidator interface
    • EntityState interface – it is only used in unused methods of ValidationState and ValidationManager
    • SchemaDVFactory methods: createTypeRestriction, createTypeList, createTypeUnion – this refers to some advanced XSD features... I think? Anyway, this doesn't seem to be used in RDF.
      • Implementations of these methods in child classes (BaseSchemaDVFactory and BaseDVFactory) are, by extension, also unused.
    • XSDDatatype: there is a huge comment block section that probably should be removed. The inner static class XSDGenericType is also unused.
    • ValidatedInfo methods: stringValue, isComparable, copyFrom
    • NamespaceContext fields: XML_URI, XMLNS_URI and methods pushContext, popContext, declarePrefix, getDeclaredPrefixCount, getDeclaredPrefixAt
      • Actually, the whole interface is effectively unused... there are no classes implementing it. There is some code passing around instances of NamespaceContext, but this must be always null.

Are you interested in contributing a pull request for this task?

Yes

@Ostrzyciel Ostrzyciel added the task Task label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Task
Projects
None yet
Development

No branches or pull requests

1 participant