Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow, writing nested types. #389

Open
archaic opened this issue Jan 22, 2024 · 8 comments
Open

Arrow, writing nested types. #389

archaic opened this issue Jan 22, 2024 · 8 comments

Comments

@archaic
Copy link

archaic commented Jan 22, 2024

Hi, my work requires me to implement writing nested types in arrow format. Currently
I use tech.ml.dataset to convert Clojure columnar data into the arrows format for processing in C++.
I need to implement the writing of nested vectors and maps in particular (in arrow/dataset->stream!).
Is this something you are interested in me contributing for this project?
If so any advice appreciated, otherwise will have to do something like maintain
a private fork etc.

@cnuernber
Copy link
Collaborator

cnuernber commented Jan 24, 2024

Definitely interested in contributions here - we have read-only support for a subset of this - lists I believe. Let me know if you want to discuss details of the design or anything else. If not and you are fine just researching and implementing that works for us too.

@cnuernber
Copy link
Collaborator

@archaic - is this still an issue for you? I would rather see this contribution in tmducken but arrow also makes sense.

@archaic
Copy link
Author

archaic commented Apr 20, 2024

I wasn't able to come up with a good solution. I would love to have this functionality, as most of the datasets I work with are annoying enough to have small segments of nested data that are difficult to wrangle into columns. I think it is a difficult problem to handle generically - does the schema get inferred, that is hard in itself?

I ended up falling back to using metosin/malli (for columns with complex schemas) to define schema's for columns, then the raw java arrow library to convert the malli schema's to arrow datatypes and column writers, however this seems like a regression to just be able to use a dataset and arrow/write! automatically.

@cnuernber
Copy link
Collaborator

cnuernber commented Apr 20, 2024

I think we can break the problem up a bit.

Are you writing the small portions of nested data into arrow as its generic map type or its struct type?

Structs are more arrow-friendly I would think but I would be curious which direction you went.

In any case it would be possible to add a map of column->simple schema type to describe both a generic map or a struct with a defined set of datatypes is members. Then the writing system would respect this mapping if provided else write out nested data using arrow's generic map type. This isn't implemented in the writing layer yet and would be the new addition.

Then it seems like the same pathway would apply where you would could use malli to detect the schema type and then just provide some data definition of the schema type in the options to the write method. I think relying on malli to do this is totally reasonable and first class - we are big fans of malli here - and could be an optional dependency and optional method that is available if malli is on the classpath.

So if you have some simple examples and test cases and could upload those along with some of the code specifically the usage of malli and the mapping into arrow-schema-land I could take it from there and get the low level read/write work done.

It would be good to have this done in a solid minimal way and then we can provide similar pathways for duckdb so at least the interface and potentially the datatype detection layers are shared between tmducken and tmd's arrow pathways.

@archaic
Copy link
Author

archaic commented Apr 20, 2024

I will write a more detailed response and provide some code over the next few days, but essentially I am wanting to be able to write arbitrary clojure data that has some structure into arrow (similar to what xtdb v2 does).

For example a column with {[42 86 95] 93.2, [88 104 23] 0.3 ... } as {uint8{3,} float32 ...} or {[45 86 95] [103 32], [991 42 58] [88 14], ...} as {uint16{3}, uint8{2} ...} etc. I played around with using both Map and Structs but actually ended up creating separate vectors for the keys and values and adding appropriate metadata (perhaps structs containing "keys" and "vals" would be more appropriate ..., from memory I don't think the arrow Map interface worked well for arbitrary data).

This would be the type of schema I have in malli
{:foo/keys [:sequential [:int {:min 1 :max 160}]
:foo/vals [:maybe [:double {:min 0}]]}

Then the keys would be dispatched to the appropriate uint8 writer,
also I found malli works well with the :maybe for having nil values.

@margintop15px
Copy link

Hello guys! I was looking exactly for that feature as well. I need to write and read data into/from Arrow with nested data structures like maps or vectors. I see there wasn't a much of activity on that topic recently. Maybe I can help somehow with this feature? Also I noticed that TMD is using kind of old version of arrow vector. Maybe we can update it along the way?

@cnuernber
Copy link
Collaborator

I think lists of a fixed datatype are supported but maps, and structs are not at this point. Updating the arrow vector dependency won't fix this as we don't use the arrow serialization system just its schema system.

@margintop15px
Copy link

yeah, I saw that list is supported but only for reading Arrow. Now it's impossible to write dataset like that
(-> (ds/->dataset [{:id 1 :name "Alice" :score 95.5 :features [1 2 3]} {:id 2 :name "Bob" :score 85.0 :features [3 4 5]}]) (arrow/dataset->stream! "tmp/dataset-example.arrow"))

this code will throw
Execution error (IllegalArgumentException) at tech.v3.libs.arrow/datatype->field-type (arrow.clj:630). No matching clause: :persistent-vector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants