Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add "unevaluatedProperties": false, (or "additionalProperties": false,) to jsonschema dump #75

Open
yarikoptic opened this issue Aug 10, 2021 · 13 comments · May be fixed by #266
Open

add "unevaluatedProperties": false, (or "additionalProperties": false,) to jsonschema dump #75

yarikoptic opened this issue Aug 10, 2021 · 13 comments · May be fixed by #266
Assignees

Comments

@yarikoptic
Copy link
Member

Per my

investigative dump in slack
may be about
            "affiliation": [
                {
                    "name": "Department of Neurosurgery, Cedars-Sinai Medical Center, Los Angeles, CA, USA",
                    "schemaKey": "Affiliation",
                    "includeInCitation": false
                }
            ],
            "includeInCitation": true

Yarik  1 hour ago
we have includeInCitation for Person and Organization, not for Affiliation per se

Yarik  40 minutes ago
and "schemaVersion": "0.4.4", ... but I don't see any immediately relevant change in diff -Naur releases/0.{4.4,5.1}/published-dandiset.json

Yarik  26 minutes ago
my foo in jsonvalidation is weak -- but it seems that at least Python's validator doesn't care about "extra" attributes -- validates just fine...

Yarik  18 minutes ago
the same here (see having first bogus field in the data): https://www.jsonschemavalidator.net/s/pKySmfD1 -- all green.  So I guess we somehow missed producing those extra attributes which aren't described by the model.  But I think it might have been in the past and @satra's migration cleaned it up
dandischema/tests/data/metadata/meta_000004.json-      "name": "initiative, BRAIN",
dandischema/tests/data/metadata/meta_000004.json-      "roleName": [
dandischema/tests/data/metadata/meta_000004.json-        "dcite:Sponsor"
dandischema/tests/data/metadata/meta_000004.json-      ],
dandischema/tests/data/metadata/meta_000004.json:      "includeInCitation": false,
--
dandischema/tests/data/metadata/meta_000004old.json-      "affiliation": [
dandischema/tests/data/metadata/meta_000004old.json-        {
dandischema/tests/data/metadata/meta_000004old.json-          "name": "Department of Neurosurgery, Cedars-Sinai Medical Center, Los Angeles, CA, USA",
dandischema/tests/data/metadata/meta_000004old.json-          "schemaKey": "Organization",
dandischema/tests/data/metadata/meta_000004old.json:          "includeInCitation": false
dandischema/tests/data/metadata/meta_000004old.json-        }
dandischema/tests/data/metadata/meta_000004old.json-      ],
dandischema/tests/data/metadata/meta_000004old.json:      "includeInCitation": true
(git)lena:~/proj/dandi/dandischema[master]
$> grep -A5 '"affiliation' dandischema/tests/data/metadata/meta_000004.json | head
      "affiliation": [
        {
          "schemaKey": "Affiliation",
          "name": "Department of Neurosurgery, Cedars-Sinai Medical Center, Los Angeles, CA, USA"
        }
      ]
...
$> grep schemaVersion dandischema/tests/data/metadata/meta_000004*.json
dandischema/tests/data/metadata/meta_000004.json:  "schemaVersion": "0.4.0",
dandischema/tests/data/metadata/meta_000004old.json:  "schemaVersion": "0.3.0",
so a I  am just a bit  confused why non-old one is just 0.4.0 here

Yarik  16 minutes ago
stopping here with a question: how to make jsonschema to error out on encountering keys which are not in the model?

Yarik  9 minutes ago
HA "additionalProperties": false,  but it seems to be not "inherited", so would need to be defined for every "level" (edited) 

I think the fact that we do not restrict exported jsonschema to not allow extra attributes (fields) is what is behind the koumoul-dev/vuetify-jsonschema-form#284 (comment) and might lead to some data loss (whenever an arbitrary extra attribute is simply not accompanied with corresponding UI ) or just cause inefficiencies or crashes.

I guess we could easily add "unevaluatedProperties": false to every record thus making validation more stringent etc.

@satra
Copy link
Member

satra commented Aug 10, 2021

we may be able to use extra during validation: https://pydantic-docs.helpmanual.io/usage/model_config/ - i need to check if this adds anything to the json schema.

@satra
Copy link
Member

satra commented Aug 10, 2021

we will also have to explicitly drop "@context"

@yarikoptic
Copy link
Member Author

we will also have to explicitly drop "@context"

can't we add/define it to jsonschema export?

@satra
Copy link
Member

satra commented Aug 11, 2021

the issue is on the ingestion side for validate/migrate/process

@yarikoptic
Copy link
Member Author

yarikoptic commented Nov 6, 2024

@candleindark please look into how we could make our exported jsonschema more stringent and why pydantic by default does not define "additionalProperties": false or how should we change DandiBaseModel so that everything is more stringent. we might also explicitly add allowing for @context which ATM for some reason is added at dandi-archive level (as we figured out).

ATM we have only 1 object type defined with additional restriction -- that all additional properties of a digest should contain a string value.

❯ grep -4 additionalProper ../schema/releases/0.6.8/asset.json
      "nskey": "schema",
      "title": "File encoding format"
    },
    "digest": {
      "additionalProperties": {
        "type": "string"
      },
      "nskey": "dandi",
      "title": "A map of dandi digests to their values",

And indeed we would likely need to explicitly define @context as available field in json serialization (not sure how in pydantic since as cannot start with @, @candleindark mentions some "alias" feature).

@candleindark
Copy link
Member

candleindark commented Nov 7, 2024

@yarikoptic @satra Below is a demo of including "@context" as a field in the model and making the model not to accept any extra/additional field values.

import json

from pydantic import BaseModel, Field, AnyHttpUrl, ConfigDict, ValidationError
from jsonschema import validate, Draft202012Validator
import jsonschema


class M(BaseModel):
    context: AnyHttpUrl = Field(..., alias="@context")

    model_config = ConfigDict(extra="forbid")


# Validate a model expressed in JSON
m = M.model_validate_json('{"@context": "https://schema.org"}')

# Dump validated model
print(m.model_dump_json(by_alias=True))
'{"@context":"https://schema.org/"}'

# Provide extra field in validation input
try:
    M.model_validate_json('{"@context": "https://schema.org", "e": 42}')
except ValidationError as e:
    print("====================================")
    print(e)
    """
    1 validation error for M
    e
      Extra inputs are not permitted [type=extra_forbidden, input_value=42, input_type=int]
        For further information visit https://errors.pydantic.dev/2.9/v/extra_forbidden
    """

# JSON schema of the model
schema = M.model_json_schema()
with open("json_schema.json", "w") as f:
    json.dump(schema, f, indent=2)

instance = {"@context": "https://schema.org"}
instance_with_invalid_context = {"@context": "invalid context"}
instance_with_extra = {"@context": "https://schema.org", "e": 42}

validate(instance, schema, format_checker=Draft202012Validator.FORMAT_CHECKER)

try:
    validate(
        instance_with_invalid_context,
        schema,
        format_checker=Draft202012Validator.FORMAT_CHECKER,
    )
except jsonschema.exceptions.ValidationError as e:
    print("====================================")
    print(e)
    """
    'invalid context' is not a 'uri'
    Failed validating 'format' in schema['properties']['@context']:
        {'format': 'uri', 'minLength': 1, 'title': '@Context', 'type': 'string'}
    On instance['@context']:
        'invalid context'
    """

try:
    validate(
        instance_with_extra, schema, format_checker=Draft202012Validator.FORMAT_CHECKER
    )
except jsonschema.exceptions.ValidationError as e:
    print("====================================")
    print(e)
    """
    Additional properties are not allowed ('e' was unexpected)
    Failed validating 'additionalProperties' in schema:
        {'additionalProperties': False,
         'properties': {'@context': {'format': 'uri',
                                     'minLength': 1,
                                     'title': '@Context',
                                     'type': 'string'}},
         'required': ['@context'],
         'title': 'M',
         'type': 'object'}
    On instance:
        {'@context': 'https://schema.org', 'e': 42}
    """

The JSON schema of the model is the following.

The generated `json_schema.json`
{
  "additionalProperties": false,
  "properties": {
    "@context": {
      "format": "uri",
      "minLength": 1,
      "title": "@Context",
      "type": "string"
    }
  },
  "required": [
    "@context"
  ],
  "title": "M",
  "type": "object"
}

Please let me know if you have any objection to adding the "@context" field to the Dandiset (and PublishedDandiset) model and forbidding extra field values in the models in this manner.

Questions:

  1. Is the "@context" field only needed in the Dandiset and PublishedDandiset models? Is it also needed in the Asset and PublishedAsset models?

Note:
Adding the "@context" to the models may complicate the translation to LinkML, but let's worry about that later.

@yarikoptic
Copy link
Member Author

Very nice. Thanks!

  1. Is the "@context" field only needed in the Dandiset and PublishedDandiset models? Is it also needed in the Asset and PublishedAsset models?

yes, AFAIK in all the models if we want to unify it fully with jsonld

dandi@drogon:/tmp$ curl --silent https://dandiarchive.s3.amazonaws.com/dandisets/000003/0.230629.1955/assets.jsonld | jq .[0] | grep @context
  "@context": "https://raw.githubusercontent.com/dandi/schema/master/releases/0.4.4/context.json",

it seems we have somewhat of dichotomy now:

  • pydantic model -- no @context in any form
  • jsonschema exports: also no @context (besides context.json):
pwd
/home/yoh/proj/dandi/schema/releases/0.6.8
❯ grep -l @context *
context.json
 ❯ 
  • then we add @context into models on server,
❯ curl --silent -X 'GET' 'https://api.dandiarchive.org/api/dandisets/000003/versions/draft/' -H 'accept: application/json' | jq . | grep @cont
  "@context": "https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.0/context.json",
❯ curl --silent -X GET 'https://api.dandiarchive.org/api/dandisets/000003/versions/draft/assets/d426ff9a-bab3-446d-8104-e373ae188bd3/' | jq -r '."@context"'
https://raw.githubusercontent.com/dandi/schema/master/releases/0.4.4/context.json

so would be good to unify. Please check if anything else differs between "pure" json model and "jsonld" we are exporting

Please let me know if you have any objection to adding the "@context" field to the Dandiset (and PublishedDandiset) model and forbidding extra field values in the models in this manner.

ideally we would need to check all current models in the dandisets before jumping to such a change while avoid "minor" (which is our "major") version boost and some metadata migrations.

It is unfortunate that our json dumps of models we carry in https://github.com/dandisets/ and not exact copy of the manifests we dump on S3... so for such a check it would be needed to "quickly" get them all from S3 s3://dandiarchive/dandisets/ and test draft versions... the "difficulty" might be is that models might need to be migrated first before checking.

@satra
Copy link
Member

satra commented Nov 7, 2024

we can go from linkml to jsonld context (the jsonld context is only relevant to instances of data). for a pydantic model, we do want to autogenerate it from linkml, and thus we would rely on the linkml generator. i don't think we want to hand craft pydantic models, perhaps only hand patch them if needed. also we could consider the linkml generator creating an appropriate context file.

in the current schema we generate that context file by hand in a function.

@candleindark
Copy link
Member

we can go from linkml to jsonld context (the jsonld context is only relevant to instances of data). for a pydantic model, we do want to autogenerate it from linkml, and thus we would rely on the linkml generator. i don't think we want to hand craft pydantic models, perhaps only hand patch them if needed. also we could consider the linkml generator creating an appropriate context file.

in the current schema we generate that context file by hand in a function.

@yarikoptic In this case, may be we don't need to do anything regarding the @context field in the Pydantic models and allow extra field values in the data instances for now? I know that LinkML doesn't allow extra field values in validation of data instances by default.

@yarikoptic
Copy link
Member Author

so, with the goal to make export jsonschema more stringent lets

  • make pydantic model with model_config = ConfigDict(extra="forbid")
  • in the export to jsonschema, define @context to be permitted

BTW -- I stared to import from S3 , will give you access to all of those to see if anything gets broken by more stringent checks

@yarikoptic
Copy link
Member Author

@candleindark dump of manifests from S3 could be found on drogon:/mnt/backup/dandi/dandiset-manifests-s3cmd . Let me know if you have problem accessing them

@candleindark
Copy link
Member

candleindark commented Nov 12, 2024

so, with the goal to make export jsonschema more stringent lets

  • make pydantic model with model_config = ConfigDict(extra="forbid")
  • in the export to jsonschema, define @context to be permitted

This will work but only in the jsonschema level. If we do any validation at Pydantic level against data instances that have the "@context" value, then this solution doesn't work.

from pydantic import BaseModel, Field, AnyHttpUrl, ConfigDict, ValidationError


class M0(BaseModel):
    pass
    # model_config["extra"] defaults to "ignore"

class M1(BaseModel):
    context: AnyHttpUrl = Field(..., alias="@context")

    model_config = ConfigDict(extra="forbid")

class M2(BaseModel):
    model_config = ConfigDict(extra="forbid")

data_instance = {"@context": "https://schema.org"}


# Validate without any problem
m0 = M0.model_validate(data_instance)

# Validate without any problem
m1 = M1.model_validate(data_instance)

try:
    m2 = M2.model_validate(data_instance)
except ValidationError as e:
    print(e)
    """
    1 validation error for M2
    @context
      Extra inputs are not permitted [type=extra_forbidden, input_value='https://schema.org', input_type=str]
        For further information visit https://errors.pydantic.dev/2.9/v/extra_forbidden
    """

In other words, if you want the proposed change, then we must make sure all data instances have the "@context" key removed when they are validated against the corresponding Pydantic model. @yarikoptic Do we proceed with this change?

@candleindark
Copy link
Member

@yarikoptic Never mind about the last post. I think I have a way around the issue.

@candleindark candleindark linked a pull request Nov 18, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants