-
Notifications
You must be signed in to change notification settings - Fork 43
Data Lake features #14
Comments
@aaronsteers do you still have these needs? I have very similar needs. Did you fork and implement? |
Hi @aroder - I do still have the same requirements but our ultimate destination was Snowflake, and since
Essentially the flow is Source ->S3 -> Snowflake, and I retain the S3 files so it's similar enough output to having used this tap directly, while also getting my downstream database populated. (A similar approach should also be doable for Redshift since both platforms recommend ingesting data from S3.) Another challenge I found as a downside to using the target-s3-csv directly is that there doesn't appear to be a clear way of passing on a catalog to downstream consumers. So, for instance, if I'm pulling from salesforce into S3-CSV, |
I have not yet implemented any kind of date partitioning but that's something I would like to eventually revisit. |
We use Redshift on top of S3, and we use the Spectrum feature. This allows querying the data in S3 as if it were stored in tables in Redshift. Saves a lot of space and removes the need to make another copy of the data within Redshift. Controlling the naming scheme of the key is necessary to make this work. Date partitioning is also feature of Spectrum, although not required. |
@aaronsteers good point on losing the metadata. The target-redshift tap maintains some of this metadata on the tables it generates in Redshift, but it has some downsides too. Like it can only create tables based on the schema definition, because the properties selected/not selected in the catalog are not available. So if I have a source table with 200 of columns but only need a handful, all 200 columns are created, and space is allocated to them. But most are null. My testing so far using this target with a data lake setup results in less data and far fewer columns. So I'm leaning to this setup over using target-redshift |
I'd like to follow your lead in transferwise/pipelinewise-target-snowflake#77. Did you leave out the |
just wondering how do you guys feel about CSV in general? |
@koszti - I do think CSV (especially when gzipped) has a lot of value. It's perhaps the most universally accepted cloud data format, accepted natively by basically every cloud data platform (Snowflake, Redshift, Spark, and even Pandas). That said, I think parquet would be a great addition and - perhaps combined with avro schemas - would potentially resolve some of the issues around not retaining robust primary key and data type info when the target needs to be consumed again by downstream pipelines. I don't know much about avro standalone - except that it's row-based instead of column-based and that avro schemas are often used to describe parquet datasets (as well as avro ones). |
@aroder - Yes, the spec up above where I proposed a value like What I found when working on the snowflake target was that to fit into the existing paradigm, the best solve was a combination of a s3_path_prefix (required) and a s3_file_name_scheme (optional). Internally, we will concatenate them, but it allows existing users to keep getting the same behavior while allowing the extra customization for those who want it. I did not implement a |
I submitted #19 if you want to review. I took a similar approach to where you landed @aaronsteers, of keeping backwards compatibility. @koszti I like the idea of other formats, especially if the target-s3 is doing the heavy lifting. I like what Aaron was talking about with maintaining the metadata. I think finding an optimal format (Parquet + Avro sounds nice) and making that an option would be a great feature. But CSV is very workable too Our data lake configuration does not assume that (because I never considered it)--we took the approach of just dump the data into a raw bucket, then do some cleanup and put it in a prepped bucket. Redshift queries the prepped bucket data directly using Spectrum, without actually copying the data into Redshift. |
Some reference material for parquet format https://github.com/mirelagrigoras/target-parquet |
Hello! I'd like to ask if one or more of the following features might be accepted in a PR. These features would enable "data lake" type target, and would address some potential scalability limits in the present implementation.
Dynamic file naming schemes.
s3_key_naming_scheme
setting for a 'salesforce' data pipeline might accept values likedata/raw/salesforce/{stream}/v1/*
- in which case the{stream}
text would be replaced with the name of the stream (e.g. "Account", "Opportunities", etc.).*
- which would be replaced by the text which is presently used in the target output.data/raw/{tap}/{stream}/v1/*
in which{tap}
is replaced with the textsalesforce
.Date partitioning.
{yyyy}
,{mm}
,{dd}
- as in:data/raw/salesforce/{stream}/v1/{yyyy}/{mm}/{dd}/*
.The above has recently become relevant to my use cases, as I'm starting new projects which have a long backfill requirement - between 2 and 7 years. The current behavior seems to be that these initial backfills would reach very large size, with one single initial file potentially containing multiple years of data.
The text was updated successfully, but these errors were encountered: