-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kedro-datasets: ibis.FileDataset w/ ibis.TableDataset in pipeline #935
Comments
I haven't looked into this at all, but my intuition is that it stems from #842 (comment); because the connections are different, temporary tables are not shared. I can try to look into creating a shared cache, but no other dataset follows this pattern (they are pretty independent the way they're currently designed); not sure if @merelcht @astrojuanlu @idanov @noklam any of you may have thoughts on this. |
Make sense to me. @deepyaman Maybe this can be solved in the similar way like |
I think all of the datasets we provide inherit directly from I'm going to implement this as a mixin to not change that. The other benefit of using a mixin is that it should be reusable (e.g. for the Ibis datasets, and pandas SQL datasets), rather than defining a separate piece in the inheritance hierarchy for both. (The other alternative could be to throw this into the base class, but I don't know if that's necessary.) |
This was my initial thought, but I dismissed it because it failed similarly when I materialized the FileDataset as a table. Shouldn't that work if it's just the connection or am I overlooking something?
Error:
|
This actually feels like an even simpler ask than ibis-project/ibis#8115 (I think it's a bit different, because DuckDB supports a way to load from other databases, and the ask is to expose it there). @cpcloud @gforsyth do you know if I'm either:
I think the answer I recall from some months ago was that the output of a |
I'm missing some context, I think. If the I guess I'm not clear on what the ask is from the Ibis side? |
You can think of If |
Ibis currently doesn't support things across multiple connections -- that may happen in the future, but there's no work happening on that front at the moment. Either the |
@deepyaman sorry this might be a stupid idea, but could we just do something similar to the SparkDataset? It uses a get_spark() function that gets or creates the active session in the load method. Pipelines using SparkDataset use a hook to create the spark session when the pipeline loads. |
Unfortunately, not really; getting the active session is a Spark thing. |
Thanks! That's all I wanted to check. |
Description
I'm trying to update a pipeline to use the new ibis.FileDataset. My pipeline reads in csv files, but writes them to duckdb for all data engineering operations. My current catalog is:
When I change the first catalog entry to FileDataset it fails with the message
Catalog Error: Table with name ibis_read_csv_24taho52bbdw5nhlthjptakvyu does not exist!
:The FileDataset entry loads fine in a kedro ipython session:
Context
For now I can continue using TableDataset with no impact.
Steps to Reproduce
Expected Result
FileDataset should be able to read in a file and that catalog entry should be able to be used as input to a TableDataset node
Actual Result
Full Error message:
Your Environment
I'm using kedro 0.19.9, kedro-datasets 5.1.0, & ibis 9.5.0
The text was updated successfully, but these errors were encountered: