Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change TypeDataset#apply syntax to use a function #110

Closed
wants to merge 7 commits into from

Conversation

jeremyrsmith
Copy link
Contributor

Before/After:

case class Foo(a: String, b: Int)
case class Bar(foo: Foo, c: Double)

val ds = TypedDataset.create(spark.createDataset(Seq.empty[Bar]))

// After:
val a = ds.select(ds(_.foo.a))
val b = ds.select(ds(_.foo.b))
val c = ds.select(ds(_.c))

// Before:
val a = ??? // no equivalent
val b = ??? // no equivalent
val c = ds.select(ds('c))

I am proposing this change because:

  1. It typechecks before any macros are invoked
  2. It makes editors like IntelliJ happy (because of Add a Gitter chat badge to README.md #1)
  3. It seems more idiomatic to me than using Symbols
  4. It allows specifying columns of nested structures, which is impossible with the current syntax

The downside is that a macro is used. But, a macro is used (indirectly, through Witness.apply) for the existing syntax anyway. Also, that syntax uses implicit conversions, which the proposed syntax doesn't. The macro introduced for the new syntax is reasonably uncomplicated.

I changed apply and left col intact, so that both syntaxes would be available. If this change is distasteful (understandable due to the BC break), it could be renamed to something other than apply. However, I really think this syntax is strictly more powerful than what's currently used for apply, and should be the default behavior.

Before/After:

```scala
case class Foo(a: String, b: Int)
case class Bar(foo: Foo, c: Double)

val ds = TypedDataset.create(spark.createDataset(Seq.empty[Bar]))

// After:
val a = ds.select(ds(_.foo.a))
val b = ds.select(ds(_.foo.b))
val c = ds.select(ds(_.c))

// Before:
val a = ??? // no equivalent
val b = ??? // no equivalent
val c = ds.select(ds('c))
```

I am proposing this change because:

1. It typechecks before any macros are invoked
2. It makes editors like IntelliJ happy (because of #1)
3. It seems more idiomatic to me than using `Symbol`s
4. It allows specifying columns of nested structures, which is
   impossible with the current syntax

The downside is that a macro is used. But, a macro is used (indirectly,
through Witness.apply) for the existing syntax anyway. Also, that syntax
uses implicit conversions, which the proposed syntax doesn't. The macro
introduced for the new syntax is reasonably uncomplicated.

I changed `apply` and left `col` intact, so that both syntaxes would be
available. If this change is distasteful (understandable due to the BC
break), it could be renamed to something other than `apply`. However, I
really think this syntax is strictly more powerful than what's currently
used for `apply`, and should be the default behavior.
@imarios
Copy link
Contributor

imarios commented Feb 18, 2017

Hey @jeremyrsmith !
Love the syntax!

Minor correction about the nested field access:

// It is currently supported using colMany
val a = ds.select( ds.colMany('foo, 'a) )
val b = ds.select( ds.colMany('foo, 'b) )

One quick question:
val c = ds.select(ds(_.c))

In this code segment, does the macro know's that we are only accessing field c (so field foo is not needed)?

@imarios
Copy link
Contributor

imarios commented Feb 18, 2017

As for backwards compatibility comment, I don't think we really mind that that much right now :)

@jeremyrsmith
Copy link
Contributor Author

@imarios I'm not quite sure what you're asking, but

ds.select(ds.col('c))

is exactly equivalent to

ds.select(ds(_.c))

since in the end they both essentially expand to

ds.select(new TypedColumn[Double](ds.dataset.col("c")))

@jeremyrsmith
Copy link
Contributor Author

CI is failing because I didn't think to go through and update tut. 🤦‍♂️

@imarios
Copy link
Contributor

imarios commented Feb 18, 2017

Does the ds.select( 10.0 * ds(_.c) ) syntax works? That is, when you apply operations to the columns.

With @OlivierBlanvillain, we have been working on alternative syntax for select:

case class Foo(a: Int, b: Int, c: Double, d: Boolean)
val ds: TypedDataset[Foo] = ...
val x: TypedDataset[(Int, Int, Double)]  = ds.select( t => (t.a, t.b, 10 * t.c) ) 

The above would select columns a, b, and c multiplied by 10.

The benefit here shows up when you have chained operations:

ds.select( t => (t.a, t.b, 10 * t.c) ).select( t => (t._1 + t._2, t._3) )

Note that the second select CANNOT longer reference ds since that refers to a different TypedDataset.

We tried to do this by just relying on implicit resolution (no macros). I made enough progress but it required so much overloading that clattered the code to a point I didn't want to proceed further. I am wondering how hard it will be to retrofit what you have to support this ...

@codecov-io
Copy link

codecov-io commented Feb 18, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@048d06c). Click here to learn what that means.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #110   +/-   ##
=========================================
  Coverage          ?   94.89%           
=========================================
  Files             ?       35           
  Lines             ?      686           
  Branches          ?        9           
=========================================
  Hits              ?      651           
  Misses            ?       35           
  Partials          ?        0
Impacted Files Coverage Δ
...ataset/src/main/scala/frameless/TypedDataset.scala 93.16% <ø> (ø)
dataset/src/main/scala/frameless/TypedColumn.scala 95.83% <ø> (ø)
...src/main/scala/frameless/column/ColumnMacros.scala 93.75% <93.75%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 048d06c...ff238a0. Read the comment docs.

* Added failure test case for unusable function
* Added scoverage tags for known-unreachable branches
@jeremyrsmith
Copy link
Contributor Author

jeremyrsmith commented Feb 18, 2017

@imarios I see what you're asking. It won't work (currently) to do operations inside of the column selection function. I added an illTyped to demonstrate this in ColTests. But it does work the same for operations outside of the column selection function.

I think it would be really cool to support expressions inside of the function, but it will get really tricky and a lot of stuff will end up being moved into macros. I have another project I've been slowly working on, which is a PoC "general relational monad" that does similar stuff with macros . It is certainly possible (and would be way easier when constrained to Spark and not being monadic) but it does wind up being less clear what's going on.

To do what you suggested would (with proposed changes) look like:

ds.select(ds(_.a), ds(_.b), ds(_.c) * 10) 

which is pretty much isomorphic to the current

ds.select(ds('a), ds('b), ds('c) * 10) 

With the benefit that each TypedColumn is being typechecked before any implicit resolution takes place at all. Also, since you don't need any Exists instance (the fact that the function typechecks proves the same thing as Exists) you will see much faster compile times.

@jeremyrsmith
Copy link
Contributor Author

jeremyrsmith commented Feb 18, 2017

@imarios I guess you could actually accomplish this with a small macro, and leaving the rest to TypedColumn. If you added selectExpr macro methods that expanded

ds.selectExpr(t => (t.a, t.b, t.c * 20))

into

ds.select(ds(_.a), ds(_.b), ds(_.c) * 20)

then it wouldn't be much work for the selectExpr macro and would defer everything else to a combination of the apply macro and the implicit resolution that TypedColumn ops do. If you think that would be cool I'm happy to take a crack at it.

(note that you would have to name this selectExpr or something different than select, because it expands to an invocation of select and if you have an overload there then you'll lose type inference benefits of the one that takes a function)

@kanterov
Copy link
Contributor

kanterov commented Feb 19, 2017

Looks awesome, going to review in a more detail in following week.

By the way, is it possible to provide such syntax:

trait TypedDataset[A] {
  object column { 
    // somehow fake Intellij to typecheck column as `A`
    def applyDynamic(...) = macro   
  }
}
case class Person(name: String)

val df: TypedDataset[Person] = ???
val c: TypedColumn[Person, String] = df.column.name

Update: Got it, it would not work because IntelliJ would typecheck df.column.name to String not TypedColumn[Person, String].

@jeremyrsmith
Copy link
Contributor Author

@kanterov You could have that syntax, though. It just wouldn't typecheck as "cleanly" as the function syntax. It would work in the end, though IntelliJ would probably still hate it.

@jeremyrsmith jeremyrsmith mentioned this pull request Feb 20, 2017
Copy link
Contributor

@OlivierBlanvillain OlivierBlanvillain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicer syntax with IntelliJ support? 💯 x 👍


// could be used to reintroduce apply('foo)
// $COVERAGE-OFF$ Currently unused
def fromSymbol[A : WeakTypeTag, B : WeakTypeTag](selector: c.Expr[scala.Symbol])(encoder: c.Expr[TypedEncoder[B]]): Tree = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we ever want to use this macro instead of the shapeless one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I left this here was in case we wanted to support either ds('a) or ds(_.a) at the same time. We can't do that with overloading, because it will ruin type inference for the function syntax. So if we really wanted to allow both, I thought we could have the macro figure it all out instead.

There are other problems with this, though - I would prefer to just embrace the function syntax because it has better type inference and about 95% smaller bytecode (after implicit expansion is all said and done).

val B = weakTypeOf[B].dealias

val selectorStr = selector.tree match {
case Function(List(ValDef(_, ArgName(argName), argTyp, _)), body) => body match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this reads better than

case Function(List(ValDef(_, name, argTyp, _)), body) =>
  NameExtractor(name).unapply(body) match {
    case Some(strs) => strs.mkString(".")
    case None       => fail(other)
  }

@OlivierBlanvillain
Copy link
Contributor

OlivierBlanvillain commented Apr 24, 2017

@kanterov would you like to review?

@imarios
Copy link
Contributor

imarios commented May 18, 2017

@jeremyrsmith @kanterov @OlivierBlanvillain this is probably the first feature to go in for 0.4.

@imarios imarios added this to the 0.4-release milestone May 18, 2017
@Voltir
Copy link

Voltir commented Sep 21, 2017

I don't suppose you mind merging this and cutting a snapshot of 0.4?

@jeremyrsmith
Copy link
Contributor Author

@Voltir I'm not sure this feature is really going to make it in. I think we've coalesced toward a pure typeclass-based approach to API and type safety; adding macros to the mix goes against that and is burdensome in other ways.

@imarios @OlivierBlanvillain should I go ahead and close this PR?

@OlivierBlanvillain
Copy link
Contributor

OlivierBlanvillain commented Sep 21, 2017

If someone takes the time to rebase this PR I think we should get it in! What's proposed here is basically syntactic sugar for column selection, which is more compiler (intellij/scalac) friendly than shapeless' LabelledGeneric + implicit resolution. @Voltir would you like to give it a shot?

@Voltir
Copy link

Voltir commented Sep 21, 2017

Sure - I can probably donate some work time for this, as there is a lot of interest here to make frameless work, and tooling / intellij friendliness would go a long way.

@jeremyrsmith
Copy link
Contributor Author

I had to do a merge commit to resolve the conflicts, but it's mergable now. We'll see if CI still goes through.

@jeremyrsmith
Copy link
Contributor Author

@OlivierBlanvillain @imarios @kanterov do you want to give this another look while CI is going? It is a pretty major change to the API - anyone who uses df('col) syntax will have to change their code.

Just want to be sure you're sure you want the change :)

@iravid
Copy link
Contributor

iravid commented Sep 21, 2017

I love the syntax and it's great when working on Datasets with dedicated types, but as I understand it, it doesn't accomodate the usecase of an ad-hoc dataset - for example, one that results from adding a computed column using select.

It's usually pretty convenient to create intermediate datasets without creating dedicated case classes for them.

@jeremyrsmith
Copy link
Contributor Author

@iravid In that case wouldn't you have a dataset of tuples? So you'd say ds(_._1) just like now you say ds('_1).

@iravid
Copy link
Contributor

iravid commented Sep 21, 2017

Yes, that's true - I was actually jumping a few logical steps here; opening a new issue to discuss this, but the gist is that you'd have to introduce an intermediate val for the intermediate dataset in that case.

@joroKr21
Copy link
Member

@jeremyrsmith 👍 for the feature, but I would like to see a check that you're selecting a field so that things like x4(_.toString) or tuple2s(_.swap) don't compile. Something like this:

def isField(sym: TermSymbol) = sym.isCaseAccessor || sym.isGetter
case Select(Self(strs), nested) if isField(tree.symbol.asTerm) => ...

@jeremyrsmith
Copy link
Contributor Author

@joroKr21 added a check for isCaseAccessorLike

val TEEObj = reify(TypedExpressionEncoder)

val datasetCol = c.typecheck(
q"${c.prefix}.dataset.col($selectorStr).as[$B]($TEEObj.apply[$B]($encoder))"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you need an .as here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. it's to go from o.a.s.s.Column to o.a.s.s.TypedColumn. But you're right, it looks like you can make a frameless.TypedColumn from an ordinary Column. Can't remember what I thought the advantage would be in doing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, if it's Spark's .as then it's not a problem! I thought it was one of ours that triggers implicit search & co, but it's not.

@OlivierBlanvillain
Copy link
Contributor

LGTM 👍

@OlivierBlanvillain
Copy link
Contributor

OlivierBlanvillain commented Sep 26, 2017

BTW @jeremyrsmith I remember you saying that we not really convinced that we should get this PR in, could you elaborate your thoughts?

@kanterov
Copy link
Contributor

Looks good, do you think it makes sense to have similar to #187 expr: (A => B) => TypedColumn[A, B] that would utilize this macro?

@ayoub-benali
Copy link
Contributor

Are there any plans to get this PR through ?

@imarios
Copy link
Contributor

imarios commented Feb 12, 2018

Hey @jeremyrsmith, do you remember what we concluded here?

@kyprifog
Copy link

kyprifog commented Dec 14, 2018

@jeremyrsmith I am curious what the status of this is? I am currently trying to decide if I should ditch intellij because the frameless project i'm working on totally confuses it.

@imarios
Copy link
Contributor

imarios commented Dec 17, 2018

@kyprifog I do use IntelliJ but I only use it for navigating and searching through code. If you want it for compilation and running test then that might be hard. I typically use IntelliJ up to the point where I want to run tests where I then switch to sbt.

@ayoub-benali
Copy link
Contributor

but this syntax change has other benefits than making IntelliJ happy

@imarios
Copy link
Contributor

imarios commented Dec 17, 2018

@ayoub-benali I agree. I think this PR can be a great contribution for someone that wants to play around with Macros. I am sure @jeremyrsmith wouldn't mind anyone taking this and creating a new PR for it.

@jeremyrsmith
Copy link
Contributor Author

@imarios of course I wouldn't mind! I kind of got the impression that frameless wanted to stay macro-free, but if anyone wants to take a crack at making this mergeable again it's 👍 by me. Keep in mind that in approximately 100 years when Spark starts supporting Scala 3, it will have to change 😆

@kujon
Copy link
Contributor

kujon commented Dec 21, 2018

Biiiiig 👍 for it working with IntelliJ!

@kyprifog
Copy link

kyprifog commented Jan 8, 2019

It would be really nice to get this merged 👍

@niebloomj
Copy link

Is this PR dead? This sounds really cool.

@jeremyrsmith
Copy link
Contributor Author

@niebloomj I think this PR is probably dead. The main advantage (better IDE type inference and faster compile times) have probably largely been obviated by improvements to IntelliJ and the compiler (I'm not sure though, as I don't get to use frameless day-to-day).

I'm just going to go ahead and close this; if someone wants to take the branch and fix it up and get it approved, feel free to re-open this PR or create a new one.

@cchantep cchantep deleted the function-column-syntax branch March 29, 2023 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.