Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic reads #132

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

robinbernon
Copy link
Contributor

Adding ability to read a subset of parquet data dynamically.

Copy link
Member

@alecmocatta alecmocatta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1,4 +1,4 @@
use hashlink::LinkedHashMap;
use hashlink::linked_hash_map::LinkedHashMap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was getting some strange issues due to double struct use in hashlink lib.rs. Found this fix was necessary to stop the given issue when compiling.

/// Predicate for [`Group`]s
pub struct GroupPredicate(
/// Map of field names to predicates for the fields in the group
pub(super) LinkedHashMap<String, Option<<Value as ParquetData>::Predicate>, FxBuildHasher>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I realise I overlooked when you messaged about this. This would have a perf regression. Looks like it might be a simple fix to hashlink, would you be able to make that PR? We can patch until your PR is merged and a new release published:

[patch.crates-io]
hashlink = { version = "0.6", git = "...", branch = "..." }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have made the fix for this and submitted a PR: kyren/hashlink#7

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to evaluate ritelinked ? This is a version derived from hashlink . We use griddle to reduce the tail delay and have solved the serialization problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh sure, happy to have a look :) Do you have any example code available for a common use case you might encounter that wouldn't work directly with LinkedHashMap or BTreeMap?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robinbernon You can directly replace hashlink with ritelinked without any additional changes, and the APIs of the two are compatible. In fact, ritelinked is a hashlink that only provides LinkedHashMap and LinkedHashSet, we just did some work to make it more usable .

@@ -1,11 +1,6 @@
use amadeus::prelude::*;
use serde::{Deserialize, Serialize};

#[derive(Data, Clone, PartialEq, Debug)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry have put it back now, currently experiencing an issue with the predicate derives - Have got the specific issue highlighted in the following branch: https://github.com/robinbernon/amadeus/blob/temp2/amadeus-derive/src/lib.rs#L279-L290.

@mergify mergify bot dismissed alecmocatta’s stale review March 30, 2021 11:58

Pull request has been modified.

@alecmocatta
Copy link
Member

The helpers here unfortunately break amadeus on stable. Ideally they impl serde_closure::traits::FnMut (which work on stable) rather than std::ops::FnMut (which don't yet). Implementing that is currently pretty rough

FnMutNamed! {
pub type Closure<P, Row, E> = |self|partition=> P| -> Output<P, Row, E>
where
P: Partition,
Row: SerdeData,
E: 'static
{
I've been meaning to write another proc macro for serde_closure to make this easier but haven't had a chance yet.

The other alternative is to make helpers nightly-only, and hide it from the docs. What do you think?

@alecmocatta alecmocatta force-pushed the dynamic_reads branch 2 times, most recently from a19053b to b312e54 Compare April 5, 2021 12:45
@robinbernon
Copy link
Contributor Author

robinbernon commented Apr 8, 2021

The helpers here unfortunately break amadeus on stable. Ideally they impl serde_closure::traits::FnMut (which work on stable) rather than std::ops::FnMut (which don't yet). Implementing that is currently pretty rough

FnMutNamed! {
pub type Closure<P, Row, E> = |self|partition=> P| -> Output<P, Row, E>
where
P: Partition,
Row: SerdeData,
E: 'static
{

I've been meaning to write another proc macro for serde_closure to make this easier but haven't had a chance yet.
The other alternative is to make helpers nightly-only, and hide it from the docs. What do you think?

Issue with impl serde closure instead of normal closure is that it would mean all adapter operations for parallel streams would have to change to using serde closures to support the change. There is currently a noticeable lack of intellisense when working with macros even on the best available IDE's - due to this I'd personally prefer to keep the standard closures in parallel streams as I like having intellisense when working on the various adapter operations etc before moving them into the required serde macros for distributed processing. Any chaance I can include the helpers here as a nightly feature then?

@alecmocatta
Copy link
Member

it would mean all adapter operations for parallel streams would have to change to using serde closures to support the change

Oh right, I see the issue. Yes agreed it's strongly desirable to keep std::ops::Fn* on parallel streams. Given this, if we do merge these helpers they will need to be behind a cfg(nightly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants