Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch schema BODS data nested type #230

Open
27 tasks
tiredpixel opened this issue Dec 6, 2023 · 0 comments
Open
27 tasks

Elasticsearch schema BODS data nested type #230

tiredpixel opened this issue Dec 6, 2023 · 0 comments

Comments

@tiredpixel
Copy link
Contributor

tiredpixel commented Dec 6, 2023

Similar to #225 , indexes such as bods_v2_psc_prod100 use nested field type for publicationDetails. Doing so makes exploration of the data more difficult, as well as complicating queries, since it prevents inner object flattening. I can't really see a reason why nested field types are used, in this case; this would need a little investigation.

I only just realised that publicationDetails.publicationDate gets set when statements are republished. This is admittedly in accordance with BODS 0.2. Perhaps I should have spotted this sooner, but I didn't, because publicationDate is buried within publicationDetails as a nested object, and even though I now know it's there, it's still hard to use it for data exploration or debugging, because of the field type.

I suggest that the usage of all nested fields types within BODS indexes in Elasticsearch is evaluated, to see whether the usage of such is in fact necessary or desirable. There might well be good reasons for some of them—identifiers comes to mind, for which the use of a nested field type is not only desirable but critical to correct results being returned. But some others, particularly those not modelled as arrays of objects—which requires special treatment in Elasticsearch since there is no dedicated array field type—would benefit from being re-evaluated.

Fields to check

  • addresses
  • annotations
  • identifiers (almost certainly correct, as noted above)
  • incorporatedInJurisdiction
  • interestedParty
  • interestedParty.unspecified
  • interests
  • interests.share
  • names
  • nationalities
  • pepStatusDetails
  • pepStatusDetails.source
  • pepStatusDetails.source.assertedBy
  • placeOfBirth
  • placeOfResidence
  • publicationDetails (likely incorrect, as noted above)
  • publicationDetails.publisher
  • source
  • source.assertedBy
  • subject
  • taxResidencies
  • unspecifiedEntityDetails
  • unspecifiedPersonDetails

Indexes to migrate

  • bods_v2_psc_prod100
  • bods_v2_dk_prod100
  • bods_v2_sk_prod100
  • bods_v2_am_prod100

Index templates

Given the number of affected indexes, which all contain the same mappings, this is likely a good time to consider using Elasticsearch index templates instead. This would enable mappings to be updated centrally and apply automatically to all indexes. Doing so would also eliminate the need to run multiple 'create indexes' steps within the various transformers.

References #189 , during which this was re-discovered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant