Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft migration path doc from ML Model to MLM Extension #19

Open
wants to merge 1 commit into
base: deprecate
Choose a base branch
from

Conversation

rbavery
Copy link
Collaborator

@rbavery rbavery commented Aug 7, 2024

This adds a migration doc that describes similarities and differences in capabilities between the ML Model Extension and the Machine Learning Model Extension. It also highlights how some field names have changed, enums have been expanded, and some fields have been removed. It notes that the ML Model Extension is not maintained and it's primary public facing implementation is deprecated.

If this is approved and merged I think we should then approve and merge #16. Once the repo ownership of https://github.com/crim-ca/mlm-extension/tree/main is changed to STAC Extensions, then I can update the links in this migration document to point to the new repo url.

I think I will need assistance from the STAC Extension maintainers to make a request to CRIM to initiate a move of the entire repo since we want to preserve stars, github history, issues, etc. cc @fmigneault

Copy link
Member

@gadomski gadomski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a strongly held opinion, but would we make lives easier on ourselves if we migrated the mlm-extension repo to the stac-extensions organization first, then updated the text here? To me it feels funny to merge this PR knowing that we'll have to go through and update a bunch of URLs shortly thereafter.

Another nice-to-have would be a migration script that we could point folks to ... maybe as part of the stac-model repo?

Other than those two points, LGTM.

@rbavery
Copy link
Collaborator Author

rbavery commented Aug 14, 2024

Thanks for the comments @gadomski

Unfortunately I don't have the bandwidth to provide a migration script right now. There might be some nontrivial decisions to handle in the migration. But I'm happy to review and work with others on PRs. stac-model is maintained alongside the spec here https://github.com/crim-ca/mlm-extension/blob/main/README_STAC_MODEL.md

I'm happy to merge this PR after the migration. If STAC maintainers initiate that @fmigneault can you accept the invite to do so for the https://github.com/crim-ca org? I'm not sure how full repo migrations work and who initiates.

@gadomski
Copy link
Member

I'm not sure how full repo migrations work and who initiates.

I've done it a few times (for https://github.com/stactools-packages/) but it's been a minute. I believe @fmigneault can go to https://github.com/crim-ca/mlm-extension/transfer and then choose stac-extensions as the target organization — I'd try that first, at least.


## Shared Goals

Both the ML Model Extension and the Machine Learning Model (MLM) Extension aim to provide a standard way to catalog machine learning (ML) models that work with Earth observation (EO) data. Their main goals are:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding something like "but not limited to" regarding the mention of working with EO data. This is one of the main goals of generalization provided by MLM, although using STAC often implies EO data is used somewhere.

Both the ML Model Extension and the Machine Learning Model (MLM) Extension aim to provide a standard way to catalog machine learning (ML) models that work with Earth observation (EO) data. Their main goals are:

1. **Search and Discovery**: Helping users find and use ML models.
2. **Describing Inference Requirements**: Making it easier to run these models by describing input requirements and outputs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider replacing "inference" by "execution" or "usage", or add "training" explicitly?
Technically, training requirements can be defined as well inference using mlm:training-runtime.


- The MLM Extension covers more details at both the Item and Asset levels, making it easier to describe and use model metadata.
- The MLM Extension covers Runtime requirements within the [Container Asset](https://github.com/crim-ca/mlm-extension?tab=readme-ov-file#container-asset), while the ML Model Extension records [similar information](./README.md#inferencetraining-runtimes) in the `ml-model:inference-runtime` or `ml-model:training-runtime` asset roles.
- The MLM extension has a corresponding Python library, [`stac-model`](https://pypi.org/project/stac-model/) which can be used to create and validate MLM metadata. An example of the library in action is [here](https://github.com/crim-ca/mlm-extension/blob/main/stac_model/examples.py#L14). The ML Model extension does not support this and requires the JSON to be written manually by interpreting the JSON Schema or existing examples.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a mention of its direct interoperability with common utilities (ie: pydantic and pystac) would be a great justification to further motivate its use.

| `ml-model:type` | N/A | No direct equivalent, it is implied by the `mlm` prefix in MLM fields and directly specified by the schema identifier. |
| `ml-model:learning_approach` | `mlm:tasks` | Removed in favor of specifying specific `mlm:tasks`. |
| `ml-model:prediction_type` | `mlm:tasks` | `mlm:tasks` provides a more comprehensive enum of prediction types. |
| `ml-model:architecture` | `mlm:architecture` | The MLM provides specific guidance on using Papers With Code - Computer Vision identifiers for model architectures. No guidance is provided in ML Model. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add * around the "Papers With Code - Computer Vision"

| `ml-model:learning_approach` | `mlm:tasks` | Removed in favor of specifying specific `mlm:tasks`. |
| `ml-model:prediction_type` | `mlm:tasks` | `mlm:tasks` provides a more comprehensive enum of prediction types. |
| `ml-model:architecture` | `mlm:architecture` | The MLM provides specific guidance on using Papers With Code - Computer Vision identifiers for model architectures. No guidance is provided in ML Model. |
| `ml-model:training-processor-type` | `mlm:accelerator` | MLM defines more choices for accelerators in an enum and specifies that this is the accelerator for inference (the focus of the MLM extension is inference). ML Model only accepts `cpu` or `gpu` but this isn't sufficient today where we have models optimized for different CPU architectures, CUDA GPUs, Intel GPUs, AMD GPUs, Mac Silicon, and TPUs. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing the focus on inference part.

Many examples use pretrained models in inference for simplicity, but there shouldn't be anything limiting the definition for training. If anything is missing, I would rather motivate users to submit issues for improvements of future MLM revisions to better handle training (if anything is actually missing).

Comment on lines +59 to +72
- **`mlm:name`**: A required name for the model.
- **`mlm:framework`**: The framework used to train the model.
- **`mlm:framework_version`**: The version of the framework. Useful in case a container runtime asset is not specified or if the consumer of the MLM wants to run the model outside of a container.
- **`mlm:memory_size`**: The in-memory size of the model.
- **`mlm:total_parameters`**: Total number of model parameters.
- **`mlm:pretrained`**: Indicates if the model is derived from a pretrained model.
- **`mlm:pretrained_source`**: Source of the pretrained model by name or URL if it is less well known.
- **`mlm:batch_size_suggestion`**: Suggested batch size for the given accelerator.
- **`mlm:accelerator_constrained`**: Indicates if the model requires a specific accelerator.
- **`mlm:accelerator_summary`**: Description of the accelerator. This might contain details on the exact accelerator version (TPUv4 vs TPUv5) and their configuration.
- **`mlm:accelerator_count`**: Minimum number of accelerator instances required.
- **`mlm:input`**: Describes the model's input shape, dtype, and normalization and resize transformations.
- **`mlm:output`**: Describes the model's output shape and dtype.
- **`mlm:hyperparameters`**: Additional hyperparameters relevant to the model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using a table just to keep sections consistent and more readable.

Suggested change
- **`mlm:name`**: A required name for the model.
- **`mlm:framework`**: The framework used to train the model.
- **`mlm:framework_version`**: The version of the framework. Useful in case a container runtime asset is not specified or if the consumer of the MLM wants to run the model outside of a container.
- **`mlm:memory_size`**: The in-memory size of the model.
- **`mlm:total_parameters`**: Total number of model parameters.
- **`mlm:pretrained`**: Indicates if the model is derived from a pretrained model.
- **`mlm:pretrained_source`**: Source of the pretrained model by name or URL if it is less well known.
- **`mlm:batch_size_suggestion`**: Suggested batch size for the given accelerator.
- **`mlm:accelerator_constrained`**: Indicates if the model requires a specific accelerator.
- **`mlm:accelerator_summary`**: Description of the accelerator. This might contain details on the exact accelerator version (TPUv4 vs TPUv5) and their configuration.
- **`mlm:accelerator_count`**: Minimum number of accelerator instances required.
- **`mlm:input`**: Describes the model's input shape, dtype, and normalization and resize transformations.
- **`mlm:output`**: Describes the model's output shape and dtype.
- **`mlm:hyperparameters`**: Additional hyperparameters relevant to the model.
| Field Name | Description |
|----------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| **`mlm:name`** | A required name for the model. |
| **`mlm:framework`** | The framework used to train the model. |
| **`mlm:framework_version`** | The version of the framework. Useful in case a container runtime asset is not specified or if the consumer of the MLM wants to run the model outside of a container. |
| **`mlm:memory_size`** | The in-memory size of the model. |
| **`mlm:total_parameters`** | Total number of model parameters. |
| **`mlm:pretrained`** | Indicates if the model is derived from a pretrained model. |
| **`mlm:pretrained_source`** | Source of the pretrained model by name or URL if it is less well known. |
| **`mlm:batch_size_suggestion`** | Suggested batch size for the given accelerator. |
| **`mlm:accelerator_constrained`**| Indicates if the model requires a specific accelerator. |
| **`mlm:accelerator_summary`** | Description of the accelerator. This might contain details on the exact accelerator version (TPUv4 vs TPUv5) and their configuration. |
| **`mlm:accelerator_count`** | Minimum number of accelerator instances required. |
| **`mlm:input`** | Describes the model's input shape, dtype, and normalization and resize transformations. |
| **`mlm:output`** | Describes the model's output shape and dtype. |
| **`mlm:hyperparameters`** | Additional hyperparameters relevant to the model. |

Comment on lines +67 to +69
- **`mlm:accelerator_constrained`**: Indicates if the model requires a specific accelerator.
- **`mlm:accelerator_summary`**: Description of the accelerator. This might contain details on the exact accelerator version (TPUv4 vs TPUv5) and their configuration.
- **`mlm:accelerator_count`**: Minimum number of accelerator instances required.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the mlm:accelerator?

| `ml-model:inference-runtime` | `mlm:inference-runtime` | Direct conversion; same role and function. |
| `ml-model:training-runtime` | `mlm:training-runtime` | Direct conversion; same role and function. |
| `ml-model:checkpoint` | `mlm:checkpoint` | Direct conversion; same role and function. |
| N/A | `mlm:model` | New required role for model assets in MLM. This represents the asset that is loaded for inference. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to describe this as the "source" of the model weights and definition (typically an ONNX, a checkpoint or similar). This would allow the flexibility for re-training/fine-tuning rather than only inference. However, the note about loading this asset can be left as well.

| N/A | `mlm:inference` | Recommended for inference pipelines. |


The MLM is focused on search, discovery descriptions, and reproducibility of inference. Nevertheless, the MLM provides a recommended asset role for `mlm:training-runtime` and asset `mlm:training`, which can point to a container URL that has the training runtime requirements. The ML Model extension specifies a field for `ml-model:training-runtime` but like `mlm:training` it only contains the default STAC Asset fields and additional fields specified by the Container Asset. Training requirements typically differ from inference requirements so therefore we recommend that fields and assets for reproducing model training or fine-tuning models be contained in a separate STAC extension.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous. Adjust "inference" for "runtime" or similar.
The intent would better be described as reproducibility of experiments, whether those are for training, fine-tuning or inference.

I agree with the start of the last sentence about differing requirements (hence the separate runtimes), but I strongly disagree about using another extension. This will only cause a lot of metadata duplication (already seeing a similar trend happening with OGC's TrainingDML-AI sub "model-extension"...). There are already examples of using MLM for both training and inference simultaneously. I would rather improve upon MLM to add any missing gaps to describe the harder "training" scenario.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feedback! With regard to training, Francis and I talked and we agree mlm:hyperparameters should be specified as a field that is for including training hyperparameters. Inference scalars or other "variables" that are not tensors/arrays (like temperature setting, text input) can be represented in the InputStructure. In the future we might want to consider inference parameters or inference variables as a separate field.

I'll tweak language throughout to remove the focus on inference at expense of training. Will note that required fields are marked required either for inference or search and discovery.

@fmigneault
Copy link
Collaborator

@rbavery @gadomski
I believe migration would be the right way to handle the transfer to preserve stars, issues, etc. I think GitHub even defines some redirects for people using the old links (but don't quote me on that 😅). This would also allow CRIM to re-fork the stac-extensions repo to indicate appropriate redirects and maintain a mirror of the old schemas (https://crim-ca.github.io/mlm-extension/), since those are employed by existing STAC deployments.

However, once this is done (and that I get the approvals from CRIM higher ups), I will need from an admin of stac-extensions to provide the necessary permissions to adjust the settings. The migration page displays a warning that all configured protection rules will be removed during the transfer. I would like to re-enable those, and make sure @rbavery, @languith and @fmigneault (at least) have the necessary access to manage the repository.

@gadomski
Copy link
Member

I will need from an admin of stac-extensions to provide the necessary permissions to adjust the settings.

No problem at all ... when migrated, I'll be able to add you as a maintainer of the repo so you can PR (or just YOLO change) anything as needed. You will also be able to add others as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants