Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add more processing:expression examples for python, docker and generic URI #33

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

fmigneault
Copy link

fixes #31

@fmigneault
Copy link
Author

@emmanuelmathot @m-mohr FYI

Copy link
Contributor

@m-mohr m-mohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URI and Docker should be provided as links with the corresponding relation type.

How would the Python type be resolved?

@fmigneault
Copy link
Author

For python, text/x-python seems like a common occurrence to represent the media-type, although it is not really an official type.

I could see some combination like the following to be valid:

{
  "links": [
    {
      "href": "<uri-to-some-python-script>",
      "type": "text/x-python",
      "rel": "processing-expression",
      "processing:expression": {
        "format": "python",
        "expresion": "module:func"
      }
    }
  ]
}

I think it is valid to have either or both the links/field combinations.

One big issue about links though is that it is not useful for multiple assets.
Imagine for example that multiple assets are generated with different expressions. How do we map them properly to their respective links? It is preferable to have the expression available under the Item properties if it applies to all assets, or directly under respective assets for distinct combinations of expressions.

I don't see why there should be a distinction between gdal-calc vs python, docker, cwl or any other approach. In a similar fashion, an example corresponding to rel: processing-execution (OGC or openEO) might be relevant as well, considering that openeo is already listed as one possible processing:expression combination.

@m-mohr
Copy link
Contributor

m-mohr commented Apr 5, 2024

Actually, processing: field are only defined for / allowed in Item Properties and Collection Providers per the Readme:

Item Properties and Collection Provider Fields

The schema on the other hand, doesn't really reflect that. That's why links made sense to me due to the 1:1 mapping. So that's something to be discussed.

Edit: Updated in #32


Sorry for being ambiguos with regards to the "Python type", I meant how would I know to handle the module:func without having a reference to a script, especially as processing fields are not allowed in Links yet and in this case there's also no ambiguity in the schema.


The difference between the property and the link relation is just whether you can embed it into the JSON or not. openEO is JSON, can be embdedded. gdal-calc/rio-calc is a string, can be embdedded. CWL YAML need to be linked to (unless there's a JSON variant?). I guess you could embed code as a string, but that feels less useful than just linking to a folder or script. The format in the Expression Object is pretty much the equivivalent to the type in the links. So they are pretty much equivalent.


Conclusion: It feels like the main issue here is not updating the list of formats, but rather whether the processing:expression is allowed in more places compared to what the Readme tells us...

@fmigneault
Copy link
Author

I meant how would I know to handle the module:func without having a reference to a script

I was using a similar consideration to gdal-calc that assumes the package is preinstalled in the environment. Ideally, it should be a package that can be resolved the same way as pip install [the-package]. There could be a processing:software entry to add more explicit URI to some package registry/version if that is necessary.

CWL [...] (unless there's a JSON variant?)

Yes. YAML and JSON can be used interchangeably for CWL. Examples use YAML for readability and comments insertion, but JSON are part of its spec. They have (approved IANA) media-types as well: application/cwl, application/cwl+json.

It feels like the main issue here is not updating the list of formats, but rather whether the processing:expression is allowed in more places compared to what the Readme tells us...

I agree. I think supporting processing:expression in other places could be very useful because it allows us to provide the specific context at the right place where the processing occurred rather than leaving it up to interpretation.

Copy link
Member

@emmanuelmathot emmanuelmathot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with putting more examples as provided as long as it stays in the processing extension at item level.
@fmigneault I think you want to use processing for a larger use case that it was designed for (lineage tracing). I reckon the need for a better handling of processing in general but I am not sure that the processing extension is the right place for that.

@fmigneault
Copy link
Author

I don't understand why processing should be limited to Item level?
If distinct Assets are provided for respective bands, and that those bands are produced using a different combination of parameters with some processing function, then the lineage of the band data would be best described by the specific processing:expression under that Asset.

@emmanuelmathot
Copy link
Member

yes agree for the asset but the request was at link level.

@m-mohr
Copy link
Contributor

m-mohr commented May 8, 2024

What's the status of this @fmigneault? I didn't quite get the last comments.

My general thinking is still that for provenance purposes you can just link to Python scripts and Docker images using the relation type processing-expression (admittedly, the name is a bit weird).
So maybe this doesn't need a spec changes and only either a bit of description and/or examples?

The processing:expression field was for embeddable strings/objects, but you seem to point to external bits so links seem to be the right choice.

@m-mohr m-mohr added this to the 1.1.0 milestone May 8, 2024
@fmigneault
Copy link
Author

A link with processing-expression is not clear enough when the processing applies only to a specific asset.

There is also the case of docker that, technically, is not a URL (no protocol). Therefore, it cannot be used in links. The server and schema will sometime "accept" it, thinking it is a relative path (form org/image:tag), but an invalid https://<api-base.url>/org/image:tag will be resolved by the API, which will make it invalid for docker pull (see stac-extensions/ml-model#9).

The specification does not need any change. Users are still freely allowed to do what I am proposing. So, my recommendation would be to include the examples to provide guidance, inspiration, and hopefully align methodologies.

@m-mohr
Copy link
Contributor

m-mohr commented May 8, 2024

I see the point about assets, but on the other hand it would also encourage using these examples in the Item properties although they should be in the links. So I'm not convinved, but also don't have a direct alternative for assets (other than adding a links arrays). I have similar issues in the CEOS ARD extension where the connection between assets and links gets somewhat lost. So maybe this is a more generic question for STAC spec: How to assign additional links to assets?

There are three examples in the PR:

  • URI - this can clearly be expressed in a link
  • Docker - I still think this should potentially also be a link. This ways it can be resolved and for docker pull you indeed need to parse and extract the protocol. But isn't this good anyway? If it's http it directly tells that it needs to handle it as insecure registry. I wouldn't bind this directly to the docker pull implementation.
  • Python - How's that meant to be resolved? my_package.my_module:MyClass.my_method Are clients meant to extract my_package from the string, run pip install my_package and then run my_package.my_module:MyClass.my_method somehow without parameters?

I lean towards releasing 1.1 for now and then merge this later once we are more clear on the path forward.

@m-mohr m-mohr removed this from the 1.1.0 milestone May 8, 2024
@fmigneault
Copy link
Author

I find links to be relevant for the full Item for navigation or cross-reference purpose with other STAC documents, but describing specific behaviors within the Item or Asset with them always raises some confusing resolution via some intermediate property, such as matching the Asset key with some other property provided in the link. It is just clearer and easier to interpret the source of the processing operation on an Asset when it is directly embedded in the Asset instead of links, since it comes along with the full definition of the Asset.

For example, if a URI to an external Python script was provided in links, the user retrieving the Asset details with pystac-client or any other utility does not immediately get an indication about that script. They have to make sure to look at the other links source, and resolve (somehow) if a given link applies for their specific Asset. It is unintuitive.

For Docker, the issue is that the HTTP reference does not map to the specific URI reference of the registry/image:tag. A registry could have a certain endpoint for the HTTP access (with some HTML rendering), but the URI only tells docker the necessary information to perform relevant API requests with a compliant registry. How to resolve the endpoints is not trivial, and just adds an unnecessary burden on the developer on how to resolve it correctly and handle edge cases. Using an HTTP link could also cause insecure requests to be sent by web scrapers, while the URI is just a plain string that does not resolve to any specific endpoint directly.

For Python, the proposed format is the same used by many libraries to reference a function, such as pytest does or any WSGI/ASGI pointing at an app. This does not mean that the function is "runnable" by itself, just like a docker is not necessarily runnable by itself without parameters either.

Whether the processing expression is executable (or how to do so) is not a concern for its STAC representation. Note that examples already provided in the README have exactly the same concern. An openeo process does not indicate which/how inputs are mapped or which outputs produced which Asset. A gdal-calc or rio-calc assume that those utilities are installed, and that bands are "somehhow" mapped to their A/B/etc. or b1/b4 assets. The processing:expression remains an indication without any promise it can be executed directly, but should at least use an expression representation that looks familiar by users of the proposed tool, so they know easily what to do with it.

@m-mohr
Copy link
Contributor

m-mohr commented May 17, 2024

@fmigneault This is probably of interest for you: radiantearth/stac-spec#1284

@fmigneault
Copy link
Author

Links per asset is a nice new feature, but I believe the current examples are still relevant and better described using an embedded processing:expression within an asset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more example expression objects
3 participants