Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCDE-293 pydbtools needs to be updated to latest awswrangler #50

Open
isichei opened this issue Nov 10, 2021 · 3 comments
Open

CCDE-293 pydbtools needs to be updated to latest awswrangler #50

isichei opened this issue Nov 10, 2021 · 3 comments
Assignees

Comments

@isichei
Copy link
Contributor

isichei commented Nov 10, 2021

There was an error where aws wrangler would not convert parquet dates correctly. This has been fixed in AWS Data Wrangler 2.12.0 (PR that fixed it for context) so we need to point pydbtools to at least this release.

Check that it functions correctly run the workaround that I posted in the PR:

import awswrangler as wr
df = wr.athena.read_sql_query(
    "SELECT mojap_end_datetime xhibit_v1.cases where mojap_latest_record LIMIT 10",
    database="database",
    pyarrow_additional_kwargs={"coerce_int96_timestamp_unit": "ms", "timestamp_as_object": True}
)
df.head()

You can use any SQL query that has an mojap_end_datetime as the default value is 2999-01-01 00:00:00 for latest records. This query should give back correct timestamps (2999-01-01 00:00:00).

If it works you will need to update the pyproject.toml to the relevant awrangler dependency.

@isichei isichei changed the title pydbtools needs to be updated to latest awswrangler CCDE-293 pydbtools needs to be updated to latest awswrangler Nov 10, 2021
@isichei
Copy link
Contributor Author

isichei commented Nov 10, 2021

Blocked by #51

@mratford mratford self-assigned this Nov 10, 2021
@mratford
Copy link
Contributor

I don't have access to the xhibit_v1 database at the moment but this fails on the following query.

releases = wr.athena.read_sql_query(
    """
    SELECT DISTINCT C.prison, C.offender_id, S.date_of_release
    FROM nomis_ao.core AS C
    LEFT JOIN nomis_ao.sentences AS S
    ON S.extract_datetime = C.extract_datetime AND S.record_number = C.record_number
    WHERE S.date_of_release IS NOT NULL
    """,
    database = "nomis_ao",
    ctas_approach = False,
    pyarrow_additional_kwargs={
        "coerce_int96_timestamp_unit": "ms", 
        "timestamp_as_object": True
    }
)

gives the AttributeError: Can only use .dt accessor with datetimelike values error.

For reference filtering the year does work

releases = wr.athena.read_sql_query(
    """
    SELECT DISTINCT C.prison, C.offender_id, S.date_of_release
    FROM nomis_ao.core AS C
    LEFT JOIN nomis_ao.sentences AS S
    ON S.extract_datetime = C.extract_datetime AND S.record_number = C.record_number
    WHERE S.date_of_release IS NOT NULL AND YEAR(S.date_of_release) < 2100
    """,
    database = "nomis_ao",
    ctas_approach = False,
    pyarrow_additional_kwargs={
        "coerce_int96_timestamp_unit": "ms", 
        "timestamp_as_object": True
    }
)

It seems not to like years after 2500 - this data looks like typos from prison staff.

@mratford
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants