Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Come up with strategy for upgrading Is Part Of field values #43

Open
karenmajewicz opened this issue Aug 4, 2022 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@karenmajewicz
Copy link
Collaborator

One of the main incompatibilities between Metadata 1.0 and Aardvark is the Is Part Of field. In the 1.0, this was a string value. In Aardvark, this is an ID that is read by the GeoBlacklight application to link records together.

To upgrade, users would need to create new collection records for each unique value and replace the strings with the new IDs.

Pros:

  • collections can be fully described with their own metadata record
  • provides more stability by using nonliteral URIs instead of strings (if a name changed, only the collection record would need to be updated)

Cons:

  • The way GBL is set up, a user would have to make more "clicks" to get to other items in the same collection.
  • Since it is not a direct crosswalk, there is labor involved in creating new collection records and updating existing ones.
@karenmajewicz
Copy link
Collaborator Author

Is Part Of

@kgjenkins
Copy link
Collaborator

kgjenkins commented Aug 19, 2022

The metadata converter at https://kgjenkins.github.io/gbl2aardvark/ will now automatically create new "Collections" records, using information from all the existing child records. Some of the fields (subject, keyword, etc.) aggregate all the unique values found in the child records, and the bbox (dcat_bbox, locn_geometry) is automatically expanded to include all the child record bboxes.

I've documented the process a bit in the README

I think this could be a viable approach, although one would certainly want to review the new collection records -- the descriptions will certainly need editing to better reflect the whole collection. And you may not really want every placename from all the child records to be listed in the collection record.

Date values may also require clean-up -- the script keeps every unique value (which works well for single years in gbl_indexYear_im) but dct_temporal_sm may have things like this:

   "dct_temporal_sm": [
      "1998-2013",
      "1998-2014",
      "1998-2015",
      "1998-2016",
      "1999-2013",
      "1999-2014", etc.

The collection records may also reveal spelling or capitalization inconsistencies in the child records. For example:

   "dct_subject_sm": [
      "Land Cover",
      "Land Use",
      "Land cover",
      "Land use",
      "Tree canopy", etc.

Of course, it could be nice to retain a "simple" collection field that just contains a string (similar to subject or keyword), but also have the option of the new relations-based dct_isPartOf_sm field.

@karenmajewicz
Copy link
Collaborator Author

In this case, dct_isPartOf_sm probably maps better to pcdm_memberOf_sm.

From the OGM documentation:
Is Part Of: To link items that are a subset of another item (e.g. a page in a book)
Member Of: To link items that are part of a collection

@kgjenkins kgjenkins added the enhancement New feature or request label Feb 27, 2023
@rmseifried rmseifried moved this to Todo in OGM issues Feb 27, 2023
@thatbudakguy
Copy link
Member

thatbudakguy commented Mar 3, 2023

Another possible strategy that is supported by OpenGeoMetadata/GeoCombine#143 is to assume that it's possible to get a list of all collection records (in v1 format) before attempting the conversion from v1 to Aardvark. In Earthworks, we apparently use a layer_geom_type_s of "Collection" to indicate collections (which might not be valid in v1, but that's another story). You can export all the Collection records this way by making a query to solr.

Once you have a list of collection records and their layer_slug_s, you can make any kind of structured data (JSON directly from solr, CSV, etc.), and then parse it and pass it into the converter:

id_map = {
  'My Collection 1' => 'institution:my-collection-1',
  'My Collection 2' => 'institution:my-collection-2'
}

GeoCombine::Migrators::V1AardvarkMigrator.new(v1_hash: record, collection_id_map: id_map).run

This way, you can convert all records (including collections) at the same time:

  • Non-collection records will look up their collection IDs using the data and replace the collection name in dct_isPartOf_sm
  • Collection records will be converted to Aardvark just like the non-collection records

An interesting and debatably useful side-effect of this is that it collapses collections with the same name into a single collection. While testing out this strategy, I discovered that several collections in Earthworks are duplicated, probably accidentally. The "2010 China province population census data with GIS maps" collection has this version, with only one member, and this version with several members. While it's possible to have collections with the same name, it doesn't seem desirable from a user standpoint, so using this strategy is an easy way to consolidate duplicate collections at the same time you convert to Aardvark.

@karenmajewicz
Copy link
Collaborator Author

Do some research on a new field for this that would be a plain text value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

3 participants