Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for GeneratedBy in any sidecar .json file to collect their provenance #1970

Open
yarikoptic opened this issue Oct 25, 2024 · 19 comments
Open
Labels
good first issue Good for newcomers

Comments

@yarikoptic
Copy link
Collaborator

Your idea

and reflecting on

for GeneratedBy to be universally adopted at the Dataset level.

To accommodate provenance "per file", e.g. for dcm2niix and alike, we better allow for GeneratedBy at individual file level.

@Remi-Gau
Copy link
Collaborator

@cmaumet @satra
isn't that falling under the scope of the provenance BEP?

@effigies
Copy link
Collaborator

Not to speak for dcm2niix developers, but I am skeptical that a request to replace two JSON keys encoding (name, version) with a provenance sidecar will be well received. The complexity increase seems difficult to justify in a do-one-thing-correctly tool.

I think of prov as the domain of wrappers. Heudiconv could implement the proposal in this PR without any cooperation from dcm2niix.

It could also upgrade dcm2niix's fields to BIDS-prov, and skip over this step altogether.

@satra
Copy link
Collaborator

satra commented Oct 25, 2024

one option is to add the provenance keys to any sidecar json as an option.

@effigies
Copy link
Collaborator

@satra Could you elaborate how, for example, https://github.com/OpenNeuroDatasets/ds000224/blob/master/task-rest_bold.json would be changed to use the provenance keys?

@yarikoptic
Copy link
Collaborator Author

yarikoptic commented Oct 25, 2024

What provenance keys exactly @satra and @effigies? In the scope of this issue I would see that file instead of

	"ConversionSoftware": "dcm2niix",
	"ConversionSoftwareVersion": "v1.0.20170411 GCC4.4.7",

having

     "GeneratedBy": [
          {
             "Name": "dcm2niix",
             "Version": "v1.0.20170411 GCC4.4.7",
             "Description": "Conversion from DICOM"
          }
     ]

or alike. Ref: https://bids-specification.readthedocs.io/en/stable/glossary.html#generatedby-metadata for overall specification and other fields possible (CodeURL, Container) ATM,
.
edit 1: replaced "message" with "Description" in the example transformed record and extended with url etc

@yarikoptic
Copy link
Collaborator Author

@cmaumet @satra isn't that falling under the scope of the provenance BEP?

Provenance BEP is formalizing specification of .prov.jsonld. I am talking about expanding the use of already formalized in BIDS metadata field.

Similarly to adoption of _desc as an "easier quick way to describe provenance" and already having "GeneratedBy" side-car file we indeed in part would duplicate formalization in BEP028: Provenance which has much "wider" scope and an unknown ETA, but it could have a chance to take advantage of already existing in BIDS formalizations to produce minimal .prov.jsonld from those GeneratedBy fields if we adopt them.

@yarikoptic
Copy link
Collaborator Author

Heudiconv could implement the proposal in this PR without any cooperation from dcm2niix.

It could also upgrade dcm2niix's fields to BIDS-prov, and skip over this step altogether.

a neat idea, I like it! We would just need to merge BIDS-prov first. But also the question here was triggered by dcm2niix good desire to produce "BIDS standard" metadata fields to start with.

@satra
Copy link
Collaborator

satra commented Oct 28, 2024

we wanted to separate it into a different file simply for simplification that a tool could write out provenance without worrying about the bids json spec for an entity. this issue itself is bringing these two streams together.

there is currently a semantic conflict between the generatedBy keyword in bids spec versus bids prov. the prov keyword maps to an activity with potentially multiple agents (software or humans) being part of that activity. whereas the bids keyword maps directly to software agents that produced it. the simplification is fine, but it does mean the semantics of the keys are different, contextualized by specs, and will be in conflict with each other when brought together from a machine understanding perspective.

for keywords that could be brought in from prov, please see the Alternative representation for file-level provenance JSON-LD section of the BEP28 document. we did consider that eventually such a set of keys could be added to the json, which describes information about the blob to which it is connected in the first place.

@yarikoptic
Copy link
Collaborator Author

yarikoptic commented Oct 28, 2024

the simplification is fine, but it does mean the semantics of the keys are different, contextualized by specs, and will be in conflict with each other when brought together from a machine understanding perspective.

since we already have generatedBy formalized in BIDS, if we decide to rename or change format, we would need to slate it for BIDS 2.0 or whatever other major release to come and automate that change in migration/upgrade script. I initiated

so please chime in, or even better implement (e.g. for BIDS 2.0 targetting #1775) on that aspect there.

This issue is largely orthogonal from that discussion and just expands adoption of already formalized in BIDS generatedBy.

@yarikoptic
Copy link
Collaborator Author

@effigies WDYT about the example I have provided above? I have improved it slightly now.

  • I maintain my opinion that it would be great to formalize GeneratedBy to the file level metadata
    • question: I failed to identify file in the schema where we would define common sidecar JSON fields . Do we even have any so far?
  • This could be a nice reason to extract/formalize "migrate" functionality from BIDS-2.0 #1775 and make it easy for users to convert already present fields into this new standardized form from earlier versions.

@cmaumet
Copy link
Collaborator

cmaumet commented Nov 14, 2024

@yarikoptic @effigies @satra @Remi-Gau Sorry for arriving late in the game, I missed this thread (thanks for the ping Remi!).

As we did for _desc as an "easier quick way to describe provenance", I'd like to better understand how GeneratedBy at the file level would position itself w.r.t BIDS-Prov. I think that we should as much as possible avoid having the same info duplicated in multiple places and maybe what we can do is try and adopt the BIDS-Prov modelling directly in the jsons rather than adding more GeneratedBy info that will later enter into conflict with BIDS-Prov as a proposal.

If that's of interest I am happy to work on this with @bclenet to propose a minimal BIDS-Prov-like example including the info you have in the example above @yarikoptic

@effigies
Copy link
Collaborator

Yes, I'd like to see what the inline prov option is. I couldn't guess it from the bep text.

@yarikoptic
Copy link
Collaborator Author

I agree that it would be great to see BIDS-Prov BEP finalized an a proper PROV records appear in BIDS. Also examples would be great -- in particular the one for above information on DICOM -> nii.gz conversion.

But I want to reiterate that regardless of the destiny of BIDS-Prov and its records, I am talking about already existing/formalized GeneratedBy. When BIDS-Prov is accepted we would ideally need to move/transform existing GeneratedBy into BIDS-Prov. But regardless of that -- meanwhile, we keep getting BIDS datasets proliferated with a completely ad-hoc and non-formalized fields. This issue is an effort to formalize them at least to the form already present in the standard. That could be done pretty rapidly. Seeing a totally new formalization accepted to the standard would take much longer IMHO, and thus make even more datasets proliferated with non-standardized metadata.

@satra
Copy link
Collaborator

satra commented Nov 14, 2024

in prov we separate the agent/software and the activity/process as separate constructs as each has its utility. here is an example of such a generatedby (more fields exist for processes and software).

"GeneratedBy":
    	  {
      	    "Id": "urn:filter-00f3a18f",
      	    "Type": "Activity",
      	    "Label": "Filter",
      	    "Used": "bids::sub-001/eeg/myfile_desc-filtered_eeg.set",
      	    "AssociatedWith":
      	    {
      	      "Id": "urn:eeglab-4a586b50",
      	      "Type": "Software",
      	      "Label": "EEGLAB",
      	      "Version": "v2023"
      	    }
    	}

it could also be simplified, but wanted to give the broader picture of the graph of provenance (what it used, how long did the process take, what software agent(s) were used).

@effigies
Copy link
Collaborator

So Yarik's example would become:

"GeneratedBy": {
  "Id": "urn:conversion-<nonce>",
  "Type": "Activity",
  "Label": "Conversion",
  "Used": "bids:sourcedata:somefile.dcm",
  "AssociatedWith": {
    "Id": "urn:dcm2niix-<nonce>",
    "Type": "Software",
    "Label": "dcm2niix",
    "Version": "v1.0.20170411 GCC4.4.7"
  }
}

Is "Used" required? That may be problematic if filenames are not anonymized. Is "Id" required? If so, guidance on how to generate the nonce would probably be required for users to accept it. And I suppose if Heudiconv does some kind of additional adjustment, it would look like:

"GeneratedBy": {
  "Id": "urn:curation-<nonce>",
  "Type": "Activity",
  "Label": "Curation",
  "Used": {
    "Id": "urn:converted-tmp-file-<nonce>",
    "GeneratedBy": {
      "Id": "urn:conversion-<nonce>",
      "Type": "Activity",
      "Label": "Conversion",
      "Used": "bids:sourcedata:somefile.dcm",
      "AssociatedWith": {
        "Id": "urn:dcm2niix-<nonce>",
        "Type": "Software",
        "Label": "dcm2niix",
        "Version": "v1.0.20170411 GCC4.4.7"
      }
    }
  },
  "AssociatedWith": {
    "Id": "urn:heudiconv-<nonce>",
    "Type": "Software",
    "Label": "heudiconv",
    "Version": "1.3.2"
  }
}

If we were able to forego Id and Used:

"GeneratedBy": {
  "Type": "Activity",
  "Label": "Curation",
  "Used": {
    "GeneratedBy": {
      "Type": "Activity",
      "Label": "Conversion",
      "AssociatedWith": {
        "Type": "Software",
        "Label": "dcm2niix",
        "Version": "v1.0.20170411 GCC4.4.7"
      }
    }
  },
  "AssociatedWith": {
    "Type": "Software",
    "Label": "heudiconv",
    "Version": "1.3.2"
  }
}

@yarikoptic:

But I want to reiterate that regardless of the destiny of BIDS-Prov and its records, I am talking about already existing/formalized GeneratedBy. When BIDS-Prov is accepted we would ideally need to move/transform existing GeneratedBy into BIDS-Prov. But regardless of that -- meanwhile, we keep getting BIDS datasets proliferated with a completely ad-hoc and non-formalized fields. This issue is an effort to formalize them at least to the form already present in the standard. That could be done pretty rapidly. Seeing a totally new formalization accepted to the standard would take much longer IMHO, and thus make even more datasets proliferated with non-standardized metadata.

A couple responses:

  1. I would rather not proliferate standards we intend to deprecate. If we know BIDS-prov is coming down the pipeline, it seems counterproductive to me to standardize on something else. Nothing is stopping Heudiconv from adopting GeneratedBy as a convention, as nonstandard metadata is always permitted.
  2. Why not use BIDS-prov? Has it been changing? My impression was that it's more-or-less done.

@yarikoptic
Copy link
Collaborator Author

  1. In general I am with you this. But note that there is currently no explicitly stated intent to deprecate GeneratedBy. If it would take BIDS-prov another year to get accepted -- I think we better of with intermediate solution meanwhile. Main target here is actually not heudiconv, but dcm2niix and its custom metadata fields -- there is motion to make .json produced sidecars and these are 2 fields already in use and IMHO worth approaching first.
  2. I am all for it, but it needs to be finalized/accepted first. Needs a strong active push though (prior PR BEP028 - Provenance #487 is closed, current state is in google doc). @cmaumet @satra - any realistic timeline there?

@satra
Copy link
Collaborator

satra commented Nov 14, 2024

any realistic timeline there?

the basic form of that doc has not changed in a while, and from my perspective it's done unless someone has issues. i think @cmaumet was creating more examples of practical use in a repo and then going to submit it as a PR.

@effigies - the id is graph specific and if that is removed (it will simply assume a blank random id). whether anyone can compare that would depend on the various entities/processes/software. your example is completely valid as a stripped down version for json, not jsonld, which would need some kind of id.

@cmaumet
Copy link
Collaborator

cmaumet commented Nov 21, 2024

@yarikoptic - with respect to timeline: to me the discussion halted in this issue bids-standard/BEP028_BIDSprov#125. This was after we (@satra, myself) shared with steering that BIDS-Prov was ready for community review. To me the path forward was not clear but I still believe BIDS-Prov is ready for community review :) And I am more than happy to push that now if someone can clarify how.

@yarikoptic
Copy link
Collaborator Author

Followed up on

Overall, I think there is a chance to meld what @effigies suggested (extracting from larger one) above to allow for .json sidecar to include smth like

    "GeneratedBy": {
      "Type": "Activity",
      "Label": "Conversion",
      "AssociatedWith": {
        "Type": "Software",
        "Label": "dcm2niix",
        "Version": "v1.0.20170411 GCC4.4.7"
      }
    }

but may be even with further defaults (where could I see schema on e.g. what other Types could there be in GeneratedBy etc) and thus minimizing the actual structure? Then in BEP we could start introducing recommended vocabulary , e.g. for "Activity" Type, Labels: "Conversion", "QA", "Preprocessing".

My point is that IMHO at the level of BIDS .json file it does not really need to be a "fully fledged" jsonld record as long as it is "compatible" (could be converted to) and not confusingly similar (e.g. re-using attributes for something else of a different type etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants