-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change TypeDataset#apply syntax to use a function #110
Conversation
Before/After: ```scala case class Foo(a: String, b: Int) case class Bar(foo: Foo, c: Double) val ds = TypedDataset.create(spark.createDataset(Seq.empty[Bar])) // After: val a = ds.select(ds(_.foo.a)) val b = ds.select(ds(_.foo.b)) val c = ds.select(ds(_.c)) // Before: val a = ??? // no equivalent val b = ??? // no equivalent val c = ds.select(ds('c)) ``` I am proposing this change because: 1. It typechecks before any macros are invoked 2. It makes editors like IntelliJ happy (because of #1) 3. It seems more idiomatic to me than using `Symbol`s 4. It allows specifying columns of nested structures, which is impossible with the current syntax The downside is that a macro is used. But, a macro is used (indirectly, through Witness.apply) for the existing syntax anyway. Also, that syntax uses implicit conversions, which the proposed syntax doesn't. The macro introduced for the new syntax is reasonably uncomplicated. I changed `apply` and left `col` intact, so that both syntaxes would be available. If this change is distasteful (understandable due to the BC break), it could be renamed to something other than `apply`. However, I really think this syntax is strictly more powerful than what's currently used for `apply`, and should be the default behavior.
Hey @jeremyrsmith ! Minor correction about the nested field access: // It is currently supported using colMany
val a = ds.select( ds.colMany('foo, 'a) )
val b = ds.select( ds.colMany('foo, 'b) ) One quick question: In this code segment, does the macro know's that we are only accessing field |
As for backwards compatibility comment, I don't think we really mind that that much right now :) |
@imarios I'm not quite sure what you're asking, but ds.select(ds.col('c)) is exactly equivalent to ds.select(ds(_.c)) since in the end they both essentially expand to ds.select(new TypedColumn[Double](ds.dataset.col("c"))) |
CI is failing because I didn't think to go through and update tut. 🤦♂️ |
Does the With @OlivierBlanvillain, we have been working on alternative syntax for select: case class Foo(a: Int, b: Int, c: Double, d: Boolean)
val ds: TypedDataset[Foo] = ...
val x: TypedDataset[(Int, Int, Double)] = ds.select( t => (t.a, t.b, 10 * t.c) ) The above would select columns a, b, and c multiplied by 10. The benefit here shows up when you have chained operations: ds.select( t => (t.a, t.b, 10 * t.c) ).select( t => (t._1 + t._2, t._3) ) Note that the second select CANNOT longer reference We tried to do this by just relying on implicit resolution (no macros). I made enough progress but it required so much overloading that clattered the code to a point I didn't want to proceed further. I am wondering how hard it will be to retrofit what you have to support this ... |
Codecov Report
@@ Coverage Diff @@
## master #110 +/- ##
=========================================
Coverage ? 94.89%
=========================================
Files ? 35
Lines ? 686
Branches ? 9
=========================================
Hits ? 651
Misses ? 35
Partials ? 0
Continue to review full report at Codecov.
|
@imarios I see what you're asking. It won't work (currently) to do operations inside of the column selection function. I added an I think it would be really cool to support expressions inside of the function, but it will get really tricky and a lot of stuff will end up being moved into macros. I have another project I've been slowly working on, which is a PoC "general relational monad" that does similar stuff with macros . It is certainly possible (and would be way easier when constrained to Spark and not being monadic) but it does wind up being less clear what's going on. To do what you suggested would (with proposed changes) look like: ds.select(ds(_.a), ds(_.b), ds(_.c) * 10) which is pretty much isomorphic to the current ds.select(ds('a), ds('b), ds('c) * 10) With the benefit that each |
@imarios I guess you could actually accomplish this with a small macro, and leaving the rest to ds.selectExpr(t => (t.a, t.b, t.c * 20)) into ds.select(ds(_.a), ds(_.b), ds(_.c) * 20) then it wouldn't be much work for the (note that you would have to name this |
Looks awesome, going to review in a more detail in following week. By the way, is it possible to provide such syntax: trait TypedDataset[A] {
object column {
// somehow fake Intellij to typecheck column as `A`
def applyDynamic(...) = macro
}
} case class Person(name: String)
val df: TypedDataset[Person] = ???
val c: TypedColumn[Person, String] = df.column.name Update: Got it, it would not work because IntelliJ would typecheck |
@kanterov You could have that syntax, though. It just wouldn't typecheck as "cleanly" as the function syntax. It would work in the end, though IntelliJ would probably still hate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicer syntax with IntelliJ support? 💯 x 👍
|
||
// could be used to reintroduce apply('foo) | ||
// $COVERAGE-OFF$ Currently unused | ||
def fromSymbol[A : WeakTypeTag, B : WeakTypeTag](selector: c.Expr[scala.Symbol])(encoder: c.Expr[TypedEncoder[B]]): Tree = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we ever want to use this macro instead of the shapeless one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I left this here was in case we wanted to support either ds('a)
or ds(_.a)
at the same time. We can't do that with overloading, because it will ruin type inference for the function syntax. So if we really wanted to allow both, I thought we could have the macro figure it all out instead.
There are other problems with this, though - I would prefer to just embrace the function syntax because it has better type inference and about 95% smaller bytecode (after implicit expansion is all said and done).
val B = weakTypeOf[B].dealias | ||
|
||
val selectorStr = selector.tree match { | ||
case Function(List(ValDef(_, ArgName(argName), argTyp, _)), body) => body match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this reads better than
case Function(List(ValDef(_, name, argTyp, _)), body) =>
NameExtractor(name).unapply(body) match {
case Some(strs) => strs.mkString(".")
case None => fail(other)
}
@kanterov would you like to review? |
@jeremyrsmith @kanterov @OlivierBlanvillain this is probably the first feature to go in for 0.4. |
I don't suppose you mind merging this and cutting a snapshot of 0.4? |
@Voltir I'm not sure this feature is really going to make it in. I think we've coalesced toward a pure typeclass-based approach to API and type safety; adding macros to the mix goes against that and is burdensome in other ways. @imarios @OlivierBlanvillain should I go ahead and close this PR? |
If someone takes the time to rebase this PR I think we should get it in! What's proposed here is basically syntactic sugar for column selection, which is more compiler (intellij/scalac) friendly than shapeless' |
Sure - I can probably donate some work time for this, as there is a lot of interest here to make frameless work, and tooling / intellij friendliness would go a long way. |
I had to do a merge commit to resolve the conflicts, but it's mergable now. We'll see if CI still goes through. |
@OlivierBlanvillain @imarios @kanterov do you want to give this another look while CI is going? It is a pretty major change to the API - anyone who uses Just want to be sure you're sure you want the change :) |
I love the syntax and it's great when working on Datasets with dedicated types, but as I understand it, it doesn't accomodate the usecase of an ad-hoc dataset - for example, one that results from adding a computed column using It's usually pretty convenient to create intermediate datasets without creating dedicated case classes for them. |
@iravid In that case wouldn't you have a dataset of tuples? So you'd say |
Yes, that's true - I was actually jumping a few logical steps here; opening a new issue to discuss this, but the gist is that you'd have to introduce an intermediate val for the intermediate dataset in that case. |
@jeremyrsmith 👍 for the feature, but I would like to see a check that you're selecting a field so that things like def isField(sym: TermSymbol) = sym.isCaseAccessor || sym.isGetter
case Select(Self(strs), nested) if isField(tree.symbol.asTerm) => ... |
@joroKr21 added a check for |
val TEEObj = reify(TypedExpressionEncoder) | ||
|
||
val datasetCol = c.typecheck( | ||
q"${c.prefix}.dataset.col($selectorStr).as[$B]($TEEObj.apply[$B]($encoder))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you need an .as
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. it's to go from o.a.s.s.Column
to o.a.s.s.TypedColumn
. But you're right, it looks like you can make a frameless.TypedColumn
from an ordinary Column
. Can't remember what I thought the advantage would be in doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, if it's Spark's .as
then it's not a problem! I thought it was one of ours that triggers implicit search & co, but it's not.
LGTM 👍 |
BTW @jeremyrsmith I remember you saying that we not really convinced that we should get this PR in, could you elaborate your thoughts? |
Looks good, do you think it makes sense to have similar to #187 |
Are there any plans to get this PR through ? |
Hey @jeremyrsmith, do you remember what we concluded here? |
@jeremyrsmith I am curious what the status of this is? I am currently trying to decide if I should ditch intellij because the frameless project i'm working on totally confuses it. |
@kyprifog I do use IntelliJ but I only use it for navigating and searching through code. If you want it for compilation and running test then that might be hard. I typically use IntelliJ up to the point where I want to run tests where I then switch to sbt. |
but this syntax change has other benefits than making IntelliJ happy |
@ayoub-benali I agree. I think this PR can be a great contribution for someone that wants to play around with Macros. I am sure @jeremyrsmith wouldn't mind anyone taking this and creating a new PR for it. |
@imarios of course I wouldn't mind! I kind of got the impression that frameless wanted to stay macro-free, but if anyone wants to take a crack at making this mergeable again it's 👍 by me. Keep in mind that in approximately 100 years when Spark starts supporting Scala 3, it will have to change 😆 |
Biiiiig 👍 for it working with IntelliJ! |
It would be really nice to get this merged 👍 |
Is this PR dead? This sounds really cool. |
@niebloomj I think this PR is probably dead. The main advantage (better IDE type inference and faster compile times) have probably largely been obviated by improvements to IntelliJ and the compiler (I'm not sure though, as I don't get to use frameless day-to-day). I'm just going to go ahead and close this; if someone wants to take the branch and fix it up and get it approved, feel free to re-open this PR or create a new one. |
Before/After:
I am proposing this change because:
Symbol
sThe downside is that a macro is used. But, a macro is used (indirectly, through
Witness.apply
) for the existing syntax anyway. Also, that syntax uses implicit conversions, which the proposed syntax doesn't. The macro introduced for the new syntax is reasonably uncomplicated.I changed
apply
and leftcol
intact, so that both syntaxes would be available. If this change is distasteful (understandable due to the BC break), it could be renamed to something other thanapply
. However, I really think this syntax is strictly more powerful than what's currently used forapply
, and should be the default behavior.