Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please submit pointers to your validator output formats #1

Open
yarikoptic opened this issue Oct 3, 2024 · 7 comments
Open

Please submit pointers to your validator output formats #1

yarikoptic opened this issue Oct 3, 2024 · 7 comments

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Oct 3, 2024

Please checkout README in https://github.com/con/validation for the motivations etc.

@rly
Copy link

rly commented Oct 3, 2024

The NWB validator can be called from the command line and as a Python function.

The CLI is described here. When validation passes, the output looks like:

Validating /Users/rly/Documents/NWB/scratch/test.nwb against cached namespace information using namespace 'core'.
 - no errors found.

with exit code 0.

When it fails, it looks like:

Validating /Users/rly/Documents/NWB/scratch/test.nwb against cached namespace information using namespace 'core'.
 - found the following errors:
SpatialSeries/data (processing/behavior/position/series_0/data): incorrect shape - expected '[[None], [None, 1], [None, 2], [None, 3]]', got '(27206, 6)'

with exit code 1.

The Python function is described here. Using the paths keyword argument returns a tuple (list of errors, exit code). Using the io keyword argument returns a list of errors.

@yarikoptic
Copy link
Member Author

yarikoptic commented Oct 3, 2024

FWIW, here is where we are mapping pynwb.validate output into our ValidationResult: https://github.com/dandi/dandi-cli/blob/HEAD/dandi/pynwb_utils.py#L359 so it looks like

                ValidationResult(
                    origin=ValidationOrigin(
                        name="pynwb",
                        version=pynwb.__version__,
                    ),
                    severity=Severity.WARNING,
                    id=f"pywnb.{error_output}",
                    scope=Scope.FILE,
                    path=Path(path),
                    message="Failed to validate.",
                )

but that is quite suboptimal since there is only error_output string without better decomposition into the "path" within the file etc. For nwb inspector outputs we seems to get some path within asset (file) and map it as well: https://github.com/dandi/dandi-cli/blob/HEAD/dandi/files/bases.py#L546

@rly
Copy link

rly commented Oct 4, 2024

Thanks @yarikoptic . The pynwb validator errors are parseable. I tried to address this in dandi/dandi-cli#1513

@VisLab
Copy link

VisLab commented Oct 4, 2024

The HED Python validator represents each issue as a dictionary and the error list is a list of these dictionaries. The error code is used to select a template for composing the error message from the dictionary. Most of our calling functions then get a printable issue string for these. The issue list could trivially be output as JSON but it maximizing usefulness would take some thought.

I guess the first step for us would be to define a JSON schema for the format of the issue list (which we currently don't have). We have a lot of fields, some of which correspond to yours and some don't.

@yarikoptic
Copy link
Member Author

Thank you @VisLab , of special interest would be those which don't ATM have anything corresponding. Would you be so kind to create a list/table of them?

@VisLab
Copy link

VisLab commented Oct 4, 2024

I went over the hed-python code. There are single issue and context keys. The actual issue objects are nested a tree structure so that the context that applies to multiple issues can be easily printed once. This tree could be flattened.

Single issue keys

Key Description
code External HED error code from the specification--
to be mapped to web-page explanation
message Formatted message after parameters are filled in
Ex:'Invalid character "x08" at index 7'
severity Numerical code 1 == error 10 is warning
index_in_tag Position of start of problem in tag
index_in_tag_end Position of end of problem in tag
source_tag Pointer to an object with a lot of info
including link to object in schema if available

Context keys:

Key Description
ec_title overall title for error report
ec_filename file name this error applies to
ec_sidecarColumnName sidecar column name
ec_sidecarKeyName name of a categorical value
ec_row row number in a tsv file
e_column column number in tsv file
ec_line line number in the file for which error is reported
ec_HedString HED String in which tag appears
ec_section Section of the HED schema in which error appears
(for Schema validation)
ec_schema_tag Schema tag in which error appears
(for Schema validation)
ec_attribute Schema attribute for which error appears
(For schema validation) }

Sample tree structure of the issues:


{'children': [], 
('ec_sidecarColumnName', 'defs'): {
    'children': [], 
	('ec_sidecarKeyName', 'def1'): {
	    'children': [], 
		('ec_HedString', '(Definition/Apple, Definition/Banana, (Blue))'): {
		    'children': [
				{  'code': 'DEFINITION_INVALID', 
				   'message': "Too many tags found in definition for Apple.  Expected 1, found: ['Definition/Banana']", 
				   'severity': 1, 
				   'ec_sidecarColumnName': 'defs', 
				   'ec_sidecarKeyName': 'def1', 
				   'ec_HedString': <hed.models.hed_string.HedString object at 0x000002AFFF59B2E0>
				}, 
				{'code': 'TAG_GROUP_ERROR', 
				    'message': "Multiple top level tags found in a single group.  First one found: Definition/Apple. Remainder:['Definition/Banana']  Problem spans string indexes: 1, 17", 
					'severity': 1, 
					'source_tag': <hed.models.hed_tag.HedTag object at 0x000002AFFF4A2520>, 
					'ec_filename': '', 
					'ec_sidecarColumnName': 'defs', 
					'ec_sidecarKeyName': 'def1', 
					'ec_HedString': <hed.models.hed_string.HedString object at 0x000002AFFF54B7F0>, 
					'char_index': 1, 'char_index_end': 17
					}
				]
				}
				}, ...

@tgbugs
Copy link

tgbugs commented Oct 7, 2024

The sparc validator is a bit weird in that except in cases where something breaks due to a bug in the validator itself we return the export as is or remove the malformed data (with a note in the validator that it has been removed). The validator output is embedded as errors sections within objects and a summary is lifted out, however that is not currently described in the linked schema below because we don't validate that part of the export, so I will updated the schema so that it is at least visible.

https://github.com/SciCrunch/sparc-curation/blob/master/sparcur/schemas.py

    "path_error_report": {
      "#/inputs/dataset_description_file": {
        "error_count": 6,
        "messages": [
          "'description' is a required property",
          "'name' is a required property",
          "'protocol_url_or_doi' is a required property"
        ]
      },
      "#/inputs/manifest_file/-1/checksums/-1": {
        "error_count": 1,
        "messages": [
          "{'type': 'checksum', ... 44 bytes later ... e3531d7671eab8911'} is not valid under any of the given schemas"
        ]
      },
      "#/inputs/submission_file/submission": {
        "error_count": 1,
        "messages": [
          "{'consortium_data_st ... 69 bytes later ... er': 'U19NS130608'} is not valid under any of the given schemas"
        ]
      },
      "#/meta/award_number": {
        "error_count": 1,
        "messages": [
          "'U19NS130608' does not match '^(OT2OD|OT3OD|U18|TR|U01)'"
        ]
      },
      "#/meta/techniques/-1": {
        "error_count": 2,
        "messages": [
          "'RNA -seq' is not a 'iri'",
          "'single-cell RNA sequencing' is not a 'iri'"
        ]
      }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants