Change TypeDataset#apply syntax to use a function #110

jeremyrsmith · 2017-02-18T02:53:40Z

Before/After:

case class Foo(a: String, b: Int)
case class Bar(foo: Foo, c: Double)

val ds = TypedDataset.create(spark.createDataset(Seq.empty[Bar]))

// After:
val a = ds.select(ds(_.foo.a))
val b = ds.select(ds(_.foo.b))
val c = ds.select(ds(_.c))

// Before:
val a = ??? // no equivalent
val b = ??? // no equivalent
val c = ds.select(ds('c))

I am proposing this change because:

It typechecks before any macros are invoked
It makes editors like IntelliJ happy (because of Add a Gitter chat badge to README.md #1)
It seems more idiomatic to me than using Symbols
It allows specifying columns of nested structures, which is impossible with the current syntax

The downside is that a macro is used. But, a macro is used (indirectly, through Witness.apply) for the existing syntax anyway. Also, that syntax uses implicit conversions, which the proposed syntax doesn't. The macro introduced for the new syntax is reasonably uncomplicated.

I changed apply and left col intact, so that both syntaxes would be available. If this change is distasteful (understandable due to the BC break), it could be renamed to something other than apply. However, I really think this syntax is strictly more powerful than what's currently used for apply, and should be the default behavior.

Before/After: ```scala case class Foo(a: String, b: Int) case class Bar(foo: Foo, c: Double) val ds = TypedDataset.create(spark.createDataset(Seq.empty[Bar])) // After: val a = ds.select(ds(_.foo.a)) val b = ds.select(ds(_.foo.b)) val c = ds.select(ds(_.c)) // Before: val a = ??? // no equivalent val b = ??? // no equivalent val c = ds.select(ds('c)) ``` I am proposing this change because: 1. It typechecks before any macros are invoked 2. It makes editors like IntelliJ happy (because of #1) 3. It seems more idiomatic to me than using `Symbol`s 4. It allows specifying columns of nested structures, which is impossible with the current syntax The downside is that a macro is used. But, a macro is used (indirectly, through Witness.apply) for the existing syntax anyway. Also, that syntax uses implicit conversions, which the proposed syntax doesn't. The macro introduced for the new syntax is reasonably uncomplicated. I changed `apply` and left `col` intact, so that both syntaxes would be available. If this change is distasteful (understandable due to the BC break), it could be renamed to something other than `apply`. However, I really think this syntax is strictly more powerful than what's currently used for `apply`, and should be the default behavior.

imarios · 2017-02-18T03:57:03Z

Hey @jeremyrsmith !
Love the syntax!

Minor correction about the nested field access:

// It is currently supported using colMany
val a = ds.select( ds.colMany('foo, 'a) )
val b = ds.select( ds.colMany('foo, 'b) )

One quick question:
val c = ds.select(ds(_.c))

In this code segment, does the macro know's that we are only accessing field c (so field foo is not needed)?

imarios · 2017-02-18T04:13:13Z

As for backwards compatibility comment, I don't think we really mind that that much right now :)

jeremyrsmith · 2017-02-18T17:33:21Z

@imarios I'm not quite sure what you're asking, but

ds.select(ds.col('c))

is exactly equivalent to

ds.select(ds(_.c))

since in the end they both essentially expand to

ds.select(new TypedColumn[Double](ds.dataset.col("c")))

jeremyrsmith · 2017-02-18T17:34:42Z

CI is failing because I didn't think to go through and update tut. 🤦‍♂️

imarios · 2017-02-18T18:59:06Z

Does the ds.select( 10.0 * ds(_.c) ) syntax works? That is, when you apply operations to the columns.

With @OlivierBlanvillain, we have been working on alternative syntax for select:

case class Foo(a: Int, b: Int, c: Double, d: Boolean)
val ds: TypedDataset[Foo] = ...
val x: TypedDataset[(Int, Int, Double)]  = ds.select( t => (t.a, t.b, 10 * t.c) )

The above would select columns a, b, and c multiplied by 10.

The benefit here shows up when you have chained operations:

ds.select( t => (t.a, t.b, 10 * t.c) ).select( t => (t._1 + t._2, t._3) )

Note that the second select CANNOT longer reference ds since that refers to a different TypedDataset.

We tried to do this by just relying on implicit resolution (no macros). I made enough progress but it required so much overloading that clattered the code to a point I didn't want to proceed further. I am wondering how hard it will be to retrofit what you have to support this ...

codecov-io · 2017-02-18T19:46:18Z

Codecov Report

❗ No coverage uploaded for pull request base (master@048d06c). Click here to learn what that means.
The diff coverage is 93.75%.

@@            Coverage Diff            @@
##             master     #110   +/-   ##
=========================================
  Coverage          ?   94.89%           
=========================================
  Files             ?       35           
  Lines             ?      686           
  Branches          ?        9           
=========================================
  Hits              ?      651           
  Misses            ?       35           
  Partials          ?        0

Impacted Files	Coverage Δ
...ataset/src/main/scala/frameless/TypedDataset.scala	`93.16% <ø> (ø)`
dataset/src/main/scala/frameless/TypedColumn.scala	`95.83% <ø> (ø)`
...src/main/scala/frameless/column/ColumnMacros.scala	`93.75% <93.75%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 048d06c...ff238a0. Read the comment docs.

* Added failure test case for unusable function * Added scoverage tags for known-unreachable branches

jeremyrsmith · 2017-02-18T20:16:27Z

@imarios I see what you're asking. It won't work (currently) to do operations inside of the column selection function. I added an illTyped to demonstrate this in ColTests. But it does work the same for operations outside of the column selection function.

I think it would be really cool to support expressions inside of the function, but it will get really tricky and a lot of stuff will end up being moved into macros. I have another project I've been slowly working on, which is a PoC "general relational monad" that does similar stuff with macros . It is certainly possible (and would be way easier when constrained to Spark and not being monadic) but it does wind up being less clear what's going on.

To do what you suggested would (with proposed changes) look like:

ds.select(ds(_.a), ds(_.b), ds(_.c) * 10)

which is pretty much isomorphic to the current

ds.select(ds('a), ds('b), ds('c) * 10)

With the benefit that each TypedColumn is being typechecked before any implicit resolution takes place at all. Also, since you don't need any Exists instance (the fact that the function typechecks proves the same thing as Exists) you will see much faster compile times.

jeremyrsmith · 2017-02-18T20:22:17Z

@imarios I guess you could actually accomplish this with a small macro, and leaving the rest to TypedColumn. If you added selectExpr macro methods that expanded

ds.selectExpr(t => (t.a, t.b, t.c * 20))

into

ds.select(ds(_.a), ds(_.b), ds(_.c) * 20)

then it wouldn't be much work for the selectExpr macro and would defer everything else to a combination of the apply macro and the implicit resolution that TypedColumn ops do. If you think that would be cool I'm happy to take a crack at it.

(note that you would have to name this selectExpr or something different than select, because it expands to an invocation of select and if you have an overload there then you'll lose type inference benefits of the one that takes a function)

kanterov · 2017-02-19T13:51:11Z

Looks awesome, going to review in a more detail in following week.

By the way, is it possible to provide such syntax:

trait TypedDataset[A] {
  object column { 
    // somehow fake Intellij to typecheck column as `A`
    def applyDynamic(...) = macro   
  }
}

case class Person(name: String)

val df: TypedDataset[Person] = ???
val c: TypedColumn[Person, String] = df.column.name

Update: Got it, it would not work because IntelliJ would typecheck df.column.name to String not TypedColumn[Person, String].

jeremyrsmith · 2017-02-19T17:00:24Z

@kanterov You could have that syntax, though. It just wouldn't typecheck as "cleanly" as the function syntax. It would work in the end, though IntelliJ would probably still hate it.

OlivierBlanvillain

Nicer syntax with IntelliJ support? 💯 x 👍

OlivierBlanvillain · 2017-02-22T12:22:46Z

dataset/src/main/scala/frameless/column/ColumnMacros.scala

+
+  // could be used to reintroduce apply('foo)
+  // $COVERAGE-OFF$ Currently unused
+  def fromSymbol[A : WeakTypeTag, B : WeakTypeTag](selector: c.Expr[scala.Symbol])(encoder: c.Expr[TypedEncoder[B]]): Tree = {


Why would we ever want to use this macro instead of the shapeless one?

The reason I left this here was in case we wanted to support either ds('a) or ds(_.a) at the same time. We can't do that with overloading, because it will ruin type inference for the function syntax. So if we really wanted to allow both, I thought we could have the macro figure it all out instead.

There are other problems with this, though - I would prefer to just embrace the function syntax because it has better type inference and about 95% smaller bytecode (after implicit expansion is all said and done).

OlivierBlanvillain · 2017-02-22T12:27:05Z

dataset/src/main/scala/frameless/column/ColumnMacros.scala

+    val B = weakTypeOf[B].dealias
+
+    val selectorStr = selector.tree match {
+      case Function(List(ValDef(_, ArgName(argName), argTyp, _)), body) => body match {


I'm not sure this reads better than

case Function(List(ValDef(_, name, argTyp, _)), body) => NameExtractor(name).unapply(body) match { case Some(strs) => strs.mkString(".") case None => fail(other) }

OlivierBlanvillain · 2017-04-24T18:43:44Z

@kanterov would you like to review?

imarios · 2017-05-18T07:59:52Z

@jeremyrsmith @kanterov @OlivierBlanvillain this is probably the first feature to go in for 0.4.

Voltir · 2017-09-21T14:53:47Z

I don't suppose you mind merging this and cutting a snapshot of 0.4?

jeremyrsmith · 2017-09-21T16:39:37Z

@Voltir I'm not sure this feature is really going to make it in. I think we've coalesced toward a pure typeclass-based approach to API and type safety; adding macros to the mix goes against that and is burdensome in other ways.

@imarios @OlivierBlanvillain should I go ahead and close this PR?

OlivierBlanvillain · 2017-09-21T17:20:18Z

If someone takes the time to rebase this PR I think we should get it in! What's proposed here is basically syntactic sugar for column selection, which is more compiler (intellij/scalac) friendly than shapeless' LabelledGeneric + implicit resolution. @Voltir would you like to give it a shot?

Voltir · 2017-09-21T17:38:38Z

Sure - I can probably donate some work time for this, as there is a lot of interest here to make frameless work, and tooling / intellij friendliness would go a long way.

jeremyrsmith · 2017-09-21T17:50:51Z

I had to do a merge commit to resolve the conflicts, but it's mergable now. We'll see if CI still goes through.

jeremyrsmith · 2017-09-21T17:53:02Z

@OlivierBlanvillain @imarios @kanterov do you want to give this another look while CI is going? It is a pretty major change to the API - anyone who uses df('col) syntax will have to change their code.

Just want to be sure you're sure you want the change :)

iravid · 2017-09-21T17:57:15Z

I love the syntax and it's great when working on Datasets with dedicated types, but as I understand it, it doesn't accomodate the usecase of an ad-hoc dataset - for example, one that results from adding a computed column using select.

It's usually pretty convenient to create intermediate datasets without creating dedicated case classes for them.

jeremyrsmith · 2017-09-21T18:01:34Z

@iravid In that case wouldn't you have a dataset of tuples? So you'd say ds(_._1) just like now you say ds('_1).

iravid · 2017-09-21T18:03:21Z

Yes, that's true - I was actually jumping a few logical steps here; opening a new issue to discuss this, but the gist is that you'd have to introduce an intermediate val for the intermediate dataset in that case.

joroKr21 · 2017-09-21T18:23:24Z

@jeremyrsmith 👍 for the feature, but I would like to see a check that you're selecting a field so that things like x4(_.toString) or tuple2s(_.swap) don't compile. Something like this:

def isField(sym: TermSymbol) = sym.isCaseAccessor || sym.isGetter
case Select(Self(strs), nested) if isField(tree.symbol.asTerm) => ...

jeremyrsmith · 2017-09-21T19:13:26Z

@joroKr21 added a check for isCaseAccessorLike

OlivierBlanvillain · 2017-09-25T06:51:42Z

dataset/src/main/scala/frameless/column/ColumnMacros.scala

+    val TEEObj = reify(TypedExpressionEncoder)
+
+    val datasetCol = c.typecheck(
+      q"${c.prefix}.dataset.col($selectorStr).as[$B]($TEEObj.apply[$B]($encoder))"


I don't understand why you need an .as here

Hmm.. it's to go from o.a.s.s.Column to o.a.s.s.TypedColumn. But you're right, it looks like you can make a frameless.TypedColumn from an ordinary Column. Can't remember what I thought the advantage would be in doing this.

I see, if it's Spark's .as then it's not a problem! I thought it was one of ours that triggers implicit search & co, but it's not.

OlivierBlanvillain · 2017-09-26T17:29:53Z

LGTM 👍

OlivierBlanvillain · 2017-09-26T17:33:20Z

BTW @jeremyrsmith I remember you saying that we not really convinced that we should get this PR in, could you elaborate your thoughts?

kanterov · 2017-09-27T08:55:32Z

Looks good, do you think it makes sense to have similar to #187 expr: (A => B) => TypedColumn[A, B] that would utilize this macro?

ayoub-benali · 2018-02-12T22:06:34Z

Are there any plans to get this PR through ?

imarios · 2018-02-12T22:08:54Z

Hey @jeremyrsmith, do you remember what we concluded here?

kyprifog · 2018-12-14T20:20:17Z

@jeremyrsmith I am curious what the status of this is? I am currently trying to decide if I should ditch intellij because the frameless project i'm working on totally confuses it.

imarios · 2018-12-17T21:01:31Z

@kyprifog I do use IntelliJ but I only use it for navigating and searching through code. If you want it for compilation and running test then that might be hard. I typically use IntelliJ up to the point where I want to run tests where I then switch to sbt.

ayoub-benali · 2018-12-17T21:07:02Z

but this syntax change has other benefits than making IntelliJ happy

imarios · 2018-12-17T21:18:28Z

@ayoub-benali I agree. I think this PR can be a great contribution for someone that wants to play around with Macros. I am sure @jeremyrsmith wouldn't mind anyone taking this and creating a new PR for it.

jeremyrsmith · 2018-12-18T01:04:55Z

@imarios of course I wouldn't mind! I kind of got the impression that frameless wanted to stay macro-free, but if anyone wants to take a crack at making this mergeable again it's 👍 by me. Keep in mind that in approximately 100 years when Spark starts supporting Scala 3, it will have to change 😆

kujon · 2018-12-21T17:48:41Z

Biiiiig 👍 for it working with IntelliJ!

kyprifog · 2019-01-08T21:29:19Z

It would be really nice to get this merged 👍

niebloomj · 2019-07-16T15:59:20Z

Is this PR dead? This sounds really cool.

jeremyrsmith · 2019-07-16T17:31:00Z

@niebloomj I think this PR is probably dead. The main advantage (better IDE type inference and faster compile times) have probably largely been obviated by improvements to IntelliJ and the compiler (I'm not sure though, as I don't get to use frameless day-to-day).

I'm just going to go ahead and close this; if someone wants to take the branch and fix it up and get it approved, feel free to re-open this PR or create a new one.

Update tuts for new syntax

82f1ba7

Coverage

fa0ce2f

* Added failure test case for unusable function * Added scoverage tags for known-unreachable branches

jeremyrsmith mentioned this pull request Feb 19, 2017

Experiment with new function syntax #60

Closed

jeremyrsmith requested review from kanterov, OlivierBlanvillain and adelbertc February 19, 2017 01:34

jeremyrsmith mentioned this pull request Feb 20, 2017

Select expression #112

Closed

OlivierBlanvillain approved these changes Feb 22, 2017

View reviewed changes

Clean up macro

6909cf5

imarios added the feature label May 17, 2017

imarios added this to the 0.4-release milestone May 18, 2017

Merge remote-tracking branch 'origin/master' into function-column-syntax

693fa9c

jeremyrsmith added 2 commits September 21, 2017 12:05

Update tests to new syntax

72171d9

Add additional check for isCaseAccessorLike

ff238a0

iravid mentioned this pull request Sep 22, 2017

Add a typed col function for creating column references #187

Open

OlivierBlanvillain reviewed Sep 25, 2017

View reviewed changes

imarios mentioned this pull request Sep 25, 2017

Provide a way to construct a TypedColumn reference from a Symbol #186

Open

frosforever mentioned this pull request Dec 3, 2017

[FEEDBACK WANTED] Don't create column from dataset #216

Closed

jeremyrsmith closed this Jul 16, 2019

cchantep deleted the function-column-syntax branch March 29, 2023 19:56

Change TypeDataset#apply syntax to use a function #110

Change TypeDataset#apply syntax to use a function #110

Conversation

jeremyrsmith commented Feb 18, 2017

imarios commented Feb 18, 2017 • edited Loading

imarios commented Feb 18, 2017

jeremyrsmith commented Feb 18, 2017

jeremyrsmith commented Feb 18, 2017

imarios commented Feb 18, 2017 • edited Loading

codecov-io commented Feb 18, 2017 • edited Loading

Codecov Report

jeremyrsmith commented Feb 18, 2017 • edited Loading

jeremyrsmith commented Feb 18, 2017 • edited Loading

kanterov commented Feb 19, 2017 • edited Loading

jeremyrsmith commented Feb 19, 2017

OlivierBlanvillain left a comment

Choose a reason for hiding this comment

OlivierBlanvillain Feb 22, 2017

Choose a reason for hiding this comment

jeremyrsmith Feb 22, 2017

Choose a reason for hiding this comment

OlivierBlanvillain Feb 22, 2017

Choose a reason for hiding this comment

OlivierBlanvillain commented Apr 24, 2017 • edited Loading

imarios commented May 18, 2017

Voltir commented Sep 21, 2017

jeremyrsmith commented Sep 21, 2017

OlivierBlanvillain commented Sep 21, 2017 • edited Loading

Voltir commented Sep 21, 2017

jeremyrsmith commented Sep 21, 2017

jeremyrsmith commented Sep 21, 2017

iravid commented Sep 21, 2017

jeremyrsmith commented Sep 21, 2017

iravid commented Sep 21, 2017

joroKr21 commented Sep 21, 2017

jeremyrsmith commented Sep 21, 2017

OlivierBlanvillain Sep 25, 2017

Choose a reason for hiding this comment

jeremyrsmith Sep 25, 2017

Choose a reason for hiding this comment

OlivierBlanvillain Sep 26, 2017

Choose a reason for hiding this comment

OlivierBlanvillain commented Sep 26, 2017

OlivierBlanvillain commented Sep 26, 2017 • edited Loading

kanterov commented Sep 27, 2017

ayoub-benali commented Feb 12, 2018

imarios commented Feb 12, 2018

kyprifog commented Dec 14, 2018 • edited Loading

imarios commented Dec 17, 2018

ayoub-benali commented Dec 17, 2018

imarios commented Dec 17, 2018

jeremyrsmith commented Dec 18, 2018

kujon commented Dec 21, 2018

kyprifog commented Jan 8, 2019 • edited Loading

niebloomj commented Jul 16, 2019

jeremyrsmith commented Jul 16, 2019

imarios commented Feb 18, 2017 •

edited

Loading

imarios commented Feb 18, 2017 •

edited

Loading

codecov-io commented Feb 18, 2017 •

edited

Loading

jeremyrsmith commented Feb 18, 2017 •

edited

Loading

jeremyrsmith commented Feb 18, 2017 •

edited

Loading

kanterov commented Feb 19, 2017 •

edited

Loading

OlivierBlanvillain commented Apr 24, 2017 •

edited

Loading

OlivierBlanvillain commented Sep 21, 2017 •

edited

Loading

OlivierBlanvillain commented Sep 26, 2017 •

edited

Loading

kyprifog commented Dec 14, 2018 •

edited

Loading

kyprifog commented Jan 8, 2019 •

edited

Loading