-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cordex extension and trying to build on higher level abstractions #64
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the "embedded schema", I think an extension could be defined as such (using YAML for short, but convert to JSON for applying it):
type: object
required:
- type
- properties
properties:
type:
const: Feature
properties:
$ref: "STACpopulator/extensions/schemas/cordex6/cmip6-cordex-global-attrs-schema.json"
Then, you can reuse the JSON schema on its own or as STAC extension definition.
As for the PR itself, I have a strong sensation that THREDDSCatalogDataModel
is essentially trying to accomplish what the "helpers" were trying to do (but missing some interface to connect the dots).
It is a bit hard to analyze the code path with all the abstractions involved. So, if I misinterpreted something in my comments, please let me know.
other todos
- Need to add
Ouranos_CMIP6-CORDEX
to the table in the README. - Update changelog
uri = cls._schema_uri.default | ||
if uri is not None: | ||
schema = json.load(open(uri)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be improved with requests
file-handler, allowing either local or remote URI, but not "blocking" for the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unsure how to deal with references within a schema if it was not local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible with the jsonschema library that we're using to have the library resolve remote references with requests
. See the example here (which uses httpx
not requests
but the idea is the same).
I agree that this can be for another PR though
STACpopulator/extensions/base.py
Outdated
# List of properties not meant to be validated by json schema. | ||
_schema_exclude: list[str] = PrivateAttr([]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't the model_config
be used for that?
class Model(DataModel):
model_config = ConfigDict(
populate_by_name=True,
extra="ignore",
fields={"field-to-exclude": {"exclude":True},
)
Otherwise, reuse the same PrivateAttr
approach, and filter by annotation/field-type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, because those were fields I wanted to exclude from the schema validation, but not from the model dump. I was thinking of a case where the schema is strictly prohibiting extra attributes, but I realize this might be a very edgy corner case.
def create_stac_item(self, item_name: str, item_data: dict[str, Any]) -> dict[str, Any]: | ||
dm = self.data_model.from_data(item_data) | ||
return dm.stac_item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this could be directly in STACpopulatorBase
since it only refers to data_model
overridden by the class. Especially if extensions are generalized, this might become redundant across implementations.
However, I'm noticing here that we are still limited by a single extension. If I want to define a dataset that uses datacube
and Cordex6DataModel
properties, I have to create yet another populator and define the create_stac_item
with by custom set of operations.
What we might need instead a list of helper-exntenions that apply onto the given data.
The pattern is very consistent.
For example, CMIP6populator
and CORDEX_STAC_Populator
could have:
class CMIP6populator(STACpopulatorBase):
item_helpers = [CMIP6Helper, DatacubeHelper, THREDDSHelper]
class CORDEX_STAC_Populator(STACpopulatorBase):
item_helpers = [Cordex6Helper]
And then, we would have:
class STACpopulatorBase:
def create_stac_item(self, item_name: str, item_data: dict[str, Any]) -> dict[str, Any]:
item = pystac.Item(...)
for helper in self.item_helpers:
helper = SomeHelper(item_data)
item = helper.apply(item)
return item
Where each helper has something along the lines of:
def apply(item: pystac.Item) -> pystac.Item:
dc_ext = DatacubeExtension.ext(item, add_if_missing=True)
dc_ext.apply(dimensions=dc_helper.dimensions, variables=dc_helper.variables)
return dc_ext.item
# or
def apply(item: pystac.Item) -> pystac.Item:
valid_data = Cordex6DataModel(self.item_data)
valid_json = json.loads(valid_data.model_dump_json(by_alias=True))
item.properties.update(valid_json)
return item
Using this "helper" approach, you wouldn't need to define all the boiler-plate code for a typical "stac extension classes". What apply()
does is up to the helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, will look into it and come back with questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that that THREDDSCatalogDataModel
automatically applies the datacube
and thredds
extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue I see with this is that the extension helpers have different __init__
requirements. So either the helpers know how to parse the input data, or the object instantiating them provides that logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have an in-between solution.
The item_helpers
list could define instances rather than type references:
item_helpers = [
HelperWithoutArg(),
THREDDSHelper(["<url>"]),
]
Anything that can be supplied at init would be created right away, and the STAC item
objects would be obtained during the apply(item)
call.
I don't think there are any cases where the helpers would be missing references limiting this approach, but to investigate...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow. You need the data to create instances of the helpers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we could do is something like this:
@classmethod
def from_data(cls, data):
"""Instantiate class from data provided by THREDDS Loader.
"""
# This is where we match the Loader's output to the STAC item and extensions inputs. If we had multiple
# loaders, that's probably the only thing that would be different between them.
return cls(data=data,
start_datetime=data["groups"]["CFMetadata"]["attributes"]["time_coverage_start"],
end_datetime=data["groups"]["CFMetadata"]["attributes"]["time_coverage_end"],
geometry=ncattrs_to_geometry(data),
bbox=ncattrs_to_bbox(data),
properties=data["attributes"],
)
@model_validator(mode="before")
@classmethod
def datacube_helper(cls, data):
"""Validate the DataCubeHelper."""
data["datacube"] = DataCubeHelper(data['data'])
return data
@model_validator(mode="before")
@classmethod
def thredds_helper(cls, data):
"""Validate the DataCubeHelper."""
data["thredds"] = THREDDSHelper(data['data']["access_urls"])
return data
data_model = Cordex6DataModel | ||
item_geometry_model = None # Unnecessary, but kept for consistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is defined for the CMIP6populator
:
class CMIP6populator(STACpopulatorBase):
item_properties_model = CMIP6Properties
item_geometry_model = GeoJSONPolygon
And data_model = Cordex6DataModel
basically offers:
Cordex6DataModel.properties == CordexCmip6 # -> just like CMIP6Properties
Cordex6DataModel -> THREDDSCatalogDataModel.geometry # -> just like item_geometry_model
I'm wondering if there's any duplication of the intended use of these properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because I didn't want to break the CMIP extension and implementation just yet. My idea was to try to generalize the CORDEX example, get a sense of where this is going, and once we're happy, then bring the changes to CMIP6.
STACpopulator/extensions/cordex6.py
Outdated
# This is generated using datamodel-codegen + manual edits | ||
class CordexCmip6(DataModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the model is generated from the schema, why is the @model_validator
needed to load and validate the JSON schema?
I'm not seeing the subtlety from static code analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's slimmed down version of the schema without the actual CV validation. The schema includes enums with the CVs, while the pydantic.DataModel does not.
This is a question I struggled with. I felt it didn't make a lot of sense to duplicate the jsonschema validation in pydantic. On the other hand, relying only on the schema and not even seeing the attributes in the code felt obscure and not admin friendly. So I thought it would be useful to have a pydantic DataModel
layer where you can add attributes to the data model, and exclude some that are in the schema but you don't want in the STAC item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if that is an issue with datamodel-codegen, or an option to provide?
Normally, the enums should be possible using Literal
type with pydantic.
I think it makes sense to have the DataModel
auto-generated from schema to provide the attributes. It's easier to manipulate by users used to Python but not so much JSON schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's definitely possible. I just didn't include the Literals in the python code.
…rom the _prefix attribute
uri = cls._schema_uri.default | ||
if uri is not None: | ||
schema = json.load(open(uri)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible with the jsonschema library that we're using to have the library resolve remote references with requests
. See the example here (which uses httpx
not requests
but the idea is the same).
I agree that this can be for another PR though
STACpopulator/extensions/base.py
Outdated
start_datetime: datetime | ||
end_datetime: datetime | ||
|
||
extensions: list = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason these can't be a list of the extensions themselves? Why have this as a list of strings referring to class attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I'm not sure then how we would pass the relevant attributes to each helper. I'll try to see if I can do something about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to let the helpers add them automatically, as they might change over time (e.g.: new extension version with modified attributes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'll try to explain a bit better (or maybe I'm not understanding the issue).
Right now we do something like this:
class BaseSTAC:
...
def stac_item(self) -> "pystac.Item":
...
for ext in self.extensions:
getattr(self, ext).apply(item)
class THREDDSCatalogDataModel(BaseSTAC):
...
properties: ExtensionHelper
datacube: DataCubeHelper
thredds: THREDDSHelper
...
extensions: list = ["properties", "datacube", "thredds"]
Why can't we do this?
class BaseSTAC:
...
def stac_item(self) -> "pystac.Item":
...
for ext in self.extensions:
ext.apply(item)
class THREDDSCatalogDataModel(BaseSTAC):
...
extensions: list = [ExtensionHelper, DataCubeHelper, THREDDSHelper]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of data ingestion.
For example, the THREDDS extension needs to be passed THREDDSHelper(data['data']["access_urls"])
I currently manage this with model_validator
:
@model_validator(mode="before")
@classmethod
def thredds_helper(cls, data):
"""Instantiate the THREDDSHelper."""
data["thredds"] = THREDDSHelper(data['data']["access_urls"])
return data
I'm not sure how we'd do that with your proposal without adding some obscure magic.
What I've now done is automatically detect extensions from the annotation (if it's a Helper subclass). Hope that's ok for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually misinterpreting the question thinking of extensions
as the STAC.Item.extensions
(ie: the URI to the applied extensions). I think we should rename the attribute, because that is very confusing. It should be helpers
to highlight the use of the helpers that have extended capabilities for applying the STAC extensions (and sometimes non-extension attributes).
Ideally, we should have something like:
helpers: list[Type[Helper]] = [ExtensionHelper, DataCubeHelper, THREDDSHelper]
Because only classes of the helpers are used (not instances), they should be able to receive the item data dynamically for the apply()
call.
If the data
source is needed, the Helper
base class could have it as a required argument for apply()
or in its __init__()
, whichever makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed _extensions
to _helpers
.
I don't see how I could implement your proposal, without hard-coding the data
ingestion logic into the helpers themselves, which would couple them tightly with the Loader
, which I thought we should avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick.
I think the BaseSTAC
class should define something like:
class BaseSTAC(abc.ABC):
@classmethod
@abc.abstractmethod
def helpers(cls) -> list[Type[Helper]]:
raise NotImplementedError
This way, any derived class and smart IDE flags right away that helpers
must be overridden.
I think helpers
should be used instead of _helpers
because it is part of the "public" interface of that class, for anyone that derives a new implementation from it.
STACpopulator/implementations/Ouranos_CMIP6-CORDEX/add_CORDEX6.py
Outdated
Show resolved
Hide resolved
I've added a mechanism to embed the schema into a STAC item schema, that I save in /tmp for now. This was just for testing. We can disable that feature for now. I'll be on travel starting tomorrow, so won't be able to work on this much. |
I added a CMIP6-CORDEX extension and implementation, trying to create base classes that would simplify the addition of other extensions.
This simplifies a bit the implementation part, but you'll see that there is still some boilerplate code we could do without on the implementation side.
The main change is that I created a generic
THREDDSCatalogDataModel
. Extensions then only have to define the data model for their properties, and how to construct a unique ID. If a jsonschema is provided, then it will be used to validate the incoming data. I've disabled the validation done at the STAC extension level for now (see below).I've struggled a bit with the role of the jsonschema here. In climate science, this is not a very popular tool. Even if scientific schemas appeared, we'd have to embed them into a STAC specific schema. You'll see that I've created a schema directory with the CORDEX schema for global attributes, but this is not a STAC schema per say. A STAC schema would embed those attributes into a
property
object, accompanied by atype
. I didn't know how to embed a schema into another, that's why I disabled the extension schema validation.To try it:
If you think this kind of abstraction is useful, I could port those changes to the CMIP6 case in another PR.