Add support for Hive (Spark) backends #32

daniel-thom · 2024-12-28T00:10:23Z

Add support for PyHive backends. This is an optional dependency. dsgrid will use it. There are some things that are not perfect, notably handling of timestamps when creating tables (there are workarounds). I'd like to move forward as-is and make improvements later.
Add option to write the destination of a map-table operation to a Parquet file instead of the database. This is for dsgrid primarily, but other users might want that.
Add options to skip time checks on the mapped table. dsgrid won't want to do this extra work on huge tables. It shouldn't be required, but we can talk about it.
Refactor commit/rollback handling in ingest_table. There were corner cases not covered, especially with SQLite.
Refactor write_database and read_database. The addition of hive necessitated some reorganization. I found that we really don't need Polars anymore, and so removed it from the repo.
Added ignore_columns to TableSchema. This allows users to include columns that chronify ignores.

codecov-commenter · 2024-12-28T00:26:03Z

Codecov Report

Attention: Patch coverage is 92.41935% with 47 lines in your changes missing coverage. Please review.

Project coverage is 94.04%. Comparing base (ea97feb) to head (6068b97).

Files with missing lines	Patch %	Lines
src/chronify/sqlalchemy/functions.py	81.08%	21 Missing ⚠️
src/chronify/store.py	87.41%	18 Missing ⚠️
src/chronify/time_series_mapper_base.py	86.84%	5 Missing ⚠️
src/chronify/schema_manager.py	94.59%	2 Missing ⚠️
tests/conftest.py	97.29%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #32      +/-   ##
==========================================
+ Coverage   93.17%   94.04%   +0.86%     
==========================================
  Files          34       37       +3     
  Lines        2214     2520     +306     
==========================================
+ Hits         2063     2370     +307     
+ Misses        151      150       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/chronify/time_series_mapper_representative.py

daniel-thom · 2024-12-30T19:05:33Z

.github/workflows/ci.yml

+        tar -xzf spark-3.5.4-bin-hadoop3.tgz
+        export SPARK_HOME=$(pwd)/spark-3.5.4-bin-hadoop3
+        export PATH=$SPARK_HOME/sbin:$PATH
+        start-master.sh


Starting a real cluster is not needed. Starting Thrift by itself will run an in-process Spark cluster.

daniel-thom · 2025-01-07T16:17:44Z

src/chronify/sqlalchemy/functions.py

+        if isinstance(config, DatetimeRange):
+            if isinstance(df2[config.time_column].dtype, DatetimeTZDtype):
+                # Spark doesn't like ns.
+                # TODO: is there a better way to change from ns to us?


@lixiliu If you have suggestions, let me know.

No, since Pandas doesn't support other unit type ATM, there are only limited things we can do.

https://pandas.pydata.org/docs/reference/api/pandas.DatetimeTZDtype.html

src/chronify/schema_manager.py

src/chronify/sqlalchemy/functions.py

src/chronify/store.py

lixiliu

Minor comments I think

lixiliu · 2025-01-15T00:58:52Z

src/chronify/time_series_mapper.py

+    to_schema: TableSchema,
+    scratch_dir: Optional[Path] = None,
+    output_file: Optional[Path] = None,
+    check_mapped_timestamps: bool = True,


I am inclined to suggest defaulting this to False for speed. While it's useful to have this feature, we have many default checks already to ensure the mapping process. E.g., the mapping creation starts with the to_time_config so we know they will at least match up in terms of unique values.

Agree on setting this to False. Make sure that all tests set this to True.

lixiliu · 2025-01-15T01:33:44Z

src/chronify/models.py

@@ -29,6 +29,10 @@ class TableSchemaBase(ChronifyBaseModel):
            "Should not include time columns."
        ),
    ]
+    ignore_columns: list[str] = Field(


I like this. We probably don't need to do check_name here b/c we'll never query it via chronify right?

I'm not checking that the ignore_columns are actually present. Looking again, this field has no effect other than provide possible clarity to the user. I should change it to one of these:

Remove the field. Chronify only looks at the columns explicitly defined in the schema. Any other fields are implicitly ignored.

Keep the field but check that those columns are present.

This is also similar to the mapping table config model, which has an "other_columns" to do this.

We need [2] because model.list_columns() is used in the mapping process for schema consistency and to ensure all the required columns are outputted.

lixiliu · 2025-01-15T01:52:32Z

src/chronify/sqlalchemy/functions.py

+        if isinstance(config, DatetimeRange):
+            if isinstance(df2[config.time_column].dtype, DatetimeTZDtype):
+                # Spark doesn't like ns.
+                # TODO: is there a better way to change from ns to us?


No, since Pandas doesn't support other unit type ATM, there are only limited things we can do.

https://pandas.pydata.org/docs/reference/api/pandas.DatetimeTZDtype.html

src/chronify/sqlalchemy/functions.py

src/chronify/store.py

src/chronify/time_series_mapper_base.py

src/chronify/utils/path_utils.py

lixiliu · 2025-01-17T19:19:56Z

src/chronify/time_series_mapper_representative.py

@@ -45,7 +47,12 @@ def _check_source_table_has_time_zone(self) -> None:
            msg = f"time_zone is required for tz-aware representative time mapping and it is missing from source table: {self._from_schema.name}"


@daniel-thom - I think we need to check time_zone a different way here since list_columns() no longer captures all table columns.

Actually, we should enforce that time_zone is in the time_array_id_columns, so this check is still correct.

lixiliu

One remaining issue to be resolved on Teams

* Add support for Hive (Spark) backends * Run a local Spark cluster in CI * Add handling of DuckDB types * Add option to write mapped tables to Parquet

daniel-thom force-pushed the dt/hive-support branch 4 times, most recently from 90478ce to 9770b7e Compare December 30, 2024 17:25

daniel-thom commented Dec 30, 2024

View reviewed changes

src/chronify/time_series_mapper_representative.py Outdated Show resolved Hide resolved

daniel-thom commented Dec 30, 2024

View reviewed changes

daniel-thom force-pushed the dt/hive-support branch 4 times, most recently from 15ef17d to 97b6837 Compare January 7, 2025 16:39

daniel-thom commented Jan 7, 2025

View reviewed changes

daniel-thom marked this pull request as ready for review January 7, 2025 16:40

daniel-thom requested review from lixiliu and JensZack January 7, 2025 16:41

JensZack approved these changes Jan 14, 2025

View reviewed changes

src/chronify/schema_manager.py Outdated Show resolved Hide resolved

src/chronify/sqlalchemy/functions.py Show resolved Hide resolved

src/chronify/store.py Show resolved Hide resolved

daniel-thom force-pushed the dt/hive-support branch from eea219c to 88bd2f2 Compare January 14, 2025 23:15

lixiliu reviewed Jan 16, 2025

View reviewed changes

daniel-thom added 7 commits January 16, 2025 15:39

Add support for Hive (Spark) backends

25639e2

Run a local Spark cluster in CI

714250a

Add handling of DuckDB types

c27ae45

Add option to write mapped tables to Parquet

645cb5f

Add ignore_columns to TableSchema

230b8ff

Code cleanup

cb1f975

Fix remove_schema for Hive

bf62a87

daniel-thom force-pushed the dt/hive-support branch from 88bd2f2 to 10fb488 Compare January 16, 2025 22:51

Address PR comments

6068b97

daniel-thom force-pushed the dt/hive-support branch from 10fb488 to 6068b97 Compare January 16, 2025 23:58

lixiliu reviewed Jan 17, 2025

View reviewed changes

lixiliu approved these changes Jan 17, 2025

View reviewed changes

daniel-thom merged commit 3885a65 into main Jan 17, 2025
4 checks passed

daniel-thom deleted the dt/hive-support branch January 17, 2025 19:52

github-actions bot pushed a commit that referenced this pull request Jan 17, 2025

Add support for Hive (Spark) backends (#32)

5b17922

* Add support for Hive (Spark) backends * Run a local Spark cluster in CI * Add handling of DuckDB types * Add option to write mapped tables to Parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Hive (Spark) backends #32

Add support for Hive (Spark) backends #32

daniel-thom commented Dec 28, 2024 •

edited

Loading

codecov-commenter commented Dec 28, 2024 •

edited

Loading

daniel-thom Dec 30, 2024

daniel-thom Jan 7, 2025

lixiliu Jan 15, 2025

lixiliu left a comment

lixiliu Jan 15, 2025 •

edited

Loading

daniel-thom Jan 16, 2025

lixiliu Jan 15, 2025

daniel-thom Jan 16, 2025

lixiliu Jan 16, 2025

daniel-thom Jan 16, 2025

lixiliu Jan 15, 2025

lixiliu Jan 17, 2025

lixiliu Jan 17, 2025 •

edited

Loading

lixiliu left a comment

		@@ -45,7 +47,12 @@ def _check_source_table_has_time_zone(self) -> None:
		msg = f"time_zone is required for tz-aware representative time mapping and it is missing from source table: {self._from_schema.name}"

Add support for Hive (Spark) backends #32

Add support for Hive (Spark) backends #32

Conversation

daniel-thom commented Dec 28, 2024 • edited Loading

codecov-commenter commented Dec 28, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lixiliu left a comment

Choose a reason for hiding this comment

lixiliu Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lixiliu Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

lixiliu left a comment

Choose a reason for hiding this comment

daniel-thom commented Dec 28, 2024 •

edited

Loading

codecov-commenter commented Dec 28, 2024 •

edited

Loading

lixiliu Jan 15, 2025 •

edited

Loading

lixiliu Jan 17, 2025 •

edited

Loading