Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP-4: General Discussion #8

Open
clnsmth opened this issue Oct 18, 2024 · 3 comments
Open

PEP-4: General Discussion #8

clnsmth opened this issue Oct 18, 2024 · 3 comments

Comments

@clnsmth
Copy link
Collaborator

clnsmth commented Oct 18, 2024

During our recent meeting, we discussed PEP-4 and raised several important points that are summarized here for further consideration:

Item 1: Representation of Date and Time Components

  • Issue: Date and time components should not be described as dateTime types in EML.

  • Proposal: Individual components of date and/or time (e.g., year, hour) should be represented using numeric EML AttributeType / measurementScale rather than a dateTime type. This approach allows for the assignment of a unit from the EML standard unit dictionary to describe the date or time component (e.g., nominalYear, nominalHour).

  • Rationale: A year value, for example, is neither a complete date nor a time. Using dateTime for these components goes against the schema definition for that type. Assigning them to a numeric type allows for correct unit specification.

  • Action: The supported list of formats used by the ECC and ezEML congruence checkers needs to be reviewed and updated accordingly.

Item 2: Promoting Automation in Data Reading

  • Goal: We aim to promote automation for reading data to streamline usage by applications and researchers.

  • Challenge: This requires converting the format string declared in the EML into one that is compatible with the target application. This conversion can be complex, as shown by our experience with ECC and DEX applications.

  • Recommendation: ISO-8601 is widely recognized across many applications and remains a strong candidate for a date and time standard. However, we need to survey other commonly used formats to evaluate whether they should also be supported, while balancing automation needs with format flexibility.

  • Tentative Agreement: We should aim for a balance between enabling automation and extending support to widely-used formats that may not be fully automatable.

Item 3: Zero-Padded Dates and Times

  • Current State: Zero padding for date and time values is required by the current list of supported formats. While some programs tend to drop leading zeros when writing to file, anecdotal evidence suggests this doesn’t impact the data’s readability.

  • Action: This behavior should be verified, as it may have implications for format checking.

Item 4: Case Sensitivity in Date and Time Formats

  • Observation: Data packages in production at the repository show significant variance in case usage for date and time formats (e.g., yyyy-MM-dd vs. YYYY-MM-DD).

  • Standard: ISO-8601 specifies that date components should use uppercase letters and time components should use lowercase letters for consistency across human and machine readers.

  • Discussion: We might not need to enforce strict case sensitivity, as context (e.g., MM in a date) can differentiate between date and time components without confusion.

@clnsmth
Copy link
Collaborator Author

clnsmth commented Oct 22, 2024

Item 5: Issuing Errors for Unsupported Formats

If we expand the list of supported formats, we could consider changing the handling of unsupported formats from warnings to errors. The rational is that the expanded list covers a wide range of commonly used and unambiguous date-time formats and any remaining unsupported formats are invalid and should be rejected from publication.

@clnsmth
Copy link
Collaborator Author

clnsmth commented Oct 22, 2024

Item 6: Distinguishing Preferred and Checked Formats

A key issue addressed by PEP-4 is the publication of date-time values in the repository that aren't checked due to unsupported formats. We've previously considered expanding the list of preferred formats to address this.

However, we could address this more directly by having two lists. One that defines the preferred formats, allowing us to maintain our focus on ISO 8601 as the preferred standard, and a second expanded list that is used when checking format-value congruence.

@clnsmth
Copy link
Collaborator Author

clnsmth commented Nov 1, 2024

Preliminary Decisions on PEP-4: Expanding Supported Date and Time Formats for ECC and ezEML Congruence Checks

Below are preliminary decisions on PEP-4, with associated action items.

Standardization of Date and Time Representations

Use uppercase letters for date components (e.g., YYYY, MM, DD for year, month, day) and lowercase letters for time components (e.g., hh:mm:ss for hours, minutes, seconds), for sake of consistency.

Support additional date component separator, specifically "/", in addition to the existing "-" separator. This accommodates formatting commonly submitted to the repository, and used within the research community.

Actions

Component-Level Formatting

Represent individual date and time components (e.g., year, hour) as numeric EML AttributeType / measurementScale rather than using the dateTime type.

Actions

Best Practice Recommendation

Continue to recommend ISO 8601 in data packaging best practices.

Actions

Library for Date/Time Checking

Develop a date and time checker library to:

  1. Validate whether a date/time format is in the preferred list.
  2. Ensure congruence of date/time format with data values in data entities.

Actions

Support Automated Reading by Common Programming Languages

To facilitate programmatic data reads, and conversion between formats, we will provide mappings between EML format strings and common representations in languages like R and Python. This mapping could be included in the date and time checker library, made accessible as a web service, or provided as a resource in a PASTAplus GitHub repository.

Example:

EML format string strftime/strptime format codes
YYYY-MM-DD %Y-%m-%d %H:%M:%S

Actions

Zero-Padded Dates and Times

Zero-padding will not be required, as most programming languages can interpret these formats accurately without it. However, this may affect regex-based congruence checks, so we will verify this does not impact the PostgreSQL database used by the ECC.

Actions

  • Determine affect on regex-based congruence checks
  • Determine affect on postgreSQL-based congruence checks
  • Confirm this can be supported
  • Add note to the section Unambiguous Date and Time Format, to highlight this caveat.

Abbreviated Month Formats

Formats like dd-mon-yyyy (e.g., using abbreviated months) will not be included in the preferred list due to challenges related to supporting multiple languages.

Actions

  • None

Handling Unsupported Formats

Formats outside of the newly expanded list of preferred formats will continue to be met with a warning. This will allow valid formats, not in the expanded list (due to oversight), to enter the repository.

Actions

  • None

Seek Community Feedback

We will seek community feedback, to review and help finalize these recommendations.

Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant