-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Hive (Spark) backends #32
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #32 +/- ##
==========================================
+ Coverage 93.17% 94.04% +0.86%
==========================================
Files 34 37 +3
Lines 2214 2520 +306
==========================================
+ Hits 2063 2370 +307
+ Misses 151 150 -1 ☔ View full report in Codecov by Sentry. |
90478ce
to
9770b7e
Compare
.github/workflows/ci.yml
Outdated
tar -xzf spark-3.5.4-bin-hadoop3.tgz | ||
export SPARK_HOME=$(pwd)/spark-3.5.4-bin-hadoop3 | ||
export PATH=$SPARK_HOME/sbin:$PATH | ||
start-master.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting a real cluster is not needed. Starting Thrift by itself will run an in-process Spark cluster.
15ef17d
to
97b6837
Compare
src/chronify/sqlalchemy/functions.py
Outdated
if isinstance(config, DatetimeRange): | ||
if isinstance(df2[config.time_column].dtype, DatetimeTZDtype): | ||
# Spark doesn't like ns. | ||
# TODO: is there a better way to change from ns to us? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lixiliu If you have suggestions, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, since Pandas doesn't support other unit type ATM, there are only limited things we can do.
https://pandas.pydata.org/docs/reference/api/pandas.DatetimeTZDtype.html
eea219c
to
88bd2f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments I think
src/chronify/time_series_mapper.py
Outdated
to_schema: TableSchema, | ||
scratch_dir: Optional[Path] = None, | ||
output_file: Optional[Path] = None, | ||
check_mapped_timestamps: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclined to suggest defaulting this to False for speed. While it's useful to have this feature, we have many default checks already to ensure the mapping process. E.g., the mapping creation starts with the to_time_config so we know they will at least match up in terms of unique values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on setting this to False. Make sure that all tests set this to True.
src/chronify/models.py
Outdated
@@ -29,6 +29,10 @@ class TableSchemaBase(ChronifyBaseModel): | |||
"Should not include time columns." | |||
), | |||
] | |||
ignore_columns: list[str] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. We probably don't need to do check_name here b/c we'll never query it via chronify right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not checking that the ignore_columns are actually present. Looking again, this field has no effect other than provide possible clarity to the user. I should change it to one of these:
- Remove the field. Chronify only looks at the columns explicitly defined in the schema. Any other fields are implicitly ignored.
- Keep the field but check that those columns are present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also similar to the mapping table config model, which has an "other_columns" to do this.
We need [2] because model.list_columns() is used in the mapping process for schema consistency and to ensure all the required columns are outputted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go with 1.
src/chronify/sqlalchemy/functions.py
Outdated
if isinstance(config, DatetimeRange): | ||
if isinstance(df2[config.time_column].dtype, DatetimeTZDtype): | ||
# Spark doesn't like ns. | ||
# TODO: is there a better way to change from ns to us? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, since Pandas doesn't support other unit type ATM, there are only limited things we can do.
https://pandas.pydata.org/docs/reference/api/pandas.DatetimeTZDtype.html
88bd2f2
to
10fb488
Compare
10fb488
to
6068b97
Compare
@@ -45,7 +47,12 @@ def _check_source_table_has_time_zone(self) -> None: | |||
msg = f"time_zone is required for tz-aware representative time mapping and it is missing from source table: {self._from_schema.name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daniel-thom - I think we need to check time_zone a different way here since list_columns() no longer captures all table columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we should enforce that time_zone is in the time_array_id_columns, so this check is still correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One remaining issue to be resolved on Teams
* Add support for Hive (Spark) backends * Run a local Spark cluster in CI * Add handling of DuckDB types * Add option to write mapped tables to Parquet
ingest_table
. There were corner cases not covered, especially with SQLite.write_database
andread_database
. The addition of hive necessitated some reorganization. I found that we really don't need Polars anymore, and so removed it from the repo.ignore_columns
toTableSchema
. This allows users to include columns that chronify ignores.