-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow PROV to be carried forward with update in certain scenarios #131
Comments
this needs much more testing
Thanks for chatting about this earlier today and writing this up. Stepping back, I see two general patterns: (1) Keeping PROV by default and (2) dropping PROV by default unless certain scenarios are met. You've outlined option (2) which I think is a reasonable approach. If we go that route, I think we want two things:
I think, for (2), we could go with a
Does something like that seem like a good solution? Oh, and as a general note, the way I've been thinking about this isn't retaining PROV statements but actually being able to split the statements in the ORE into two groups: (1) Data Packaging statements and (2) all other statements. So in this ticket, any time I/we say PROV statements we'd mean "other statements". I think this is a better approach for a few reasons:
|
When updating a document like an RDF resource map, I think the default should be to preserve its contents and not be lossy unless there is a specific reason to drop something. Losing RDF triples should be considered a bug, as we will likely add other triples to our ORE docs over time. We shouldn't lose metadata every time it travels through the utils package, as that introduces the need for lots of manual fixes and people might forget to add it all back in. I think the default behavior for our R (and other) packages should be:
This does require the software to have a model of a package that understands its components. The |
@mbjones, I'm a little confused about your 3rd point since it seems in conflict with your comments here, but maybe I am misunderstanding it. Let's say we have a package with the following PROV trace: OBJECT_2 <---derivedFrom--- OBJECT_1 (using a SCRIPT during an EXECUTION) If OBJECT_2 is updated with a new version of the file from a different execution of the same script, I think it is clear that you cannot blindly add PROV triples associated with the new object (this was the conclusion we came to over a year ago). If we instead only remove the PROV triples associated with OBJECT_2 (because OBJECT_2 is not included in the data package anymore) and don't make any assumptions about how the new version of the object fits in, the triples would look like this, with the last three rows dropped.
I'm not sure I know enough about the PROV model to say whether the result of dropping these triples leaves us with a valid trace or not. We should also consider whether this approach would work if we dropped all of the triples associated with a script that got updated. From a user perspective it would make it much easier to update their provenance because they would only have to add it back in for files that they updated, as opposed to all of the files. |
I agree, and that's the way we had it when we started inserting PROV into DataONE packages, but there's plenty of cases where migrating PROV forward doesn't make sense (which we've outlined here and elsewhere). @jeanetteclark 's example above is a good example of this. Re:
This sound good but it doesn't sound like a good fit for the |
Yeah it is not set up very well to do this kind of thing at all at the moment. This is definitely something we should consider when we start work on refactoring the R packages. In the meantime...we should find some kind of solution for |
@amoeba I finally got around to this and have a tested solution. Do you mind checking it out here: https://github.com/jeanetteclark/arcticdatautils/tree/carry_prov @dmullen17 it would be good if you had a look too. If we like this, I can create a pull request for it for more formal review I have tests written up but to play with the functionality install the package from that branch and then play around with:
|
Hey @jeanetteclark, thanks for putting this together. The warning with example code is 💯 btw. Is defaulting to removing provenance ( PS: I had a few other comments that'd be suitable during code review I could make when you file a PR. |
Certainly something that could be up for debate! I think what you describe is already integrated into my workflow, just need to change the default arg. I'll create a PR and we can see where we get from there |
(carryover from this issue: NCEAS/metacatui#310)
Scenario A: update package with new object not part of existing PROV trace
O1 <---derivedFrom--- O3 (using script S1 during execution E2)
O4 added as new object
Metadata updated
Scenario B: update metadata for a package without changing objects
O1 <---derivedFrom--- O3 (using script S1 during execution E2)
Metadata updated
In both situations, all prov relationships between O1 and O3 should be included in the new version of the package.
There is some code in
arcticdatautils
already that will carry forward PROV - see this commit but it needs to be expanded a bit to allow for PROV to be carried over in the scenarios above, but not in other scenarios where pids involved in the PROV trace are updated with new versions.So, before adding the carried over PROV statements from the old resource map to the new resource map, I think we need to check that the pids contained within those statements all exist in the
data_pids
argument forupdate_resource_map
If not all of the pids involved in the PROV trace exist in the
data_pids
arguement, the function will drop all of the PROV statements in the updated version of the resource map. In this case should the function:The text was updated successfully, but these errors were encountered: