diff --git a/build.sbt b/build.sbt index 7bf008b5c..25c7d02df 100644 --- a/build.sbt +++ b/build.sbt @@ -184,7 +184,8 @@ lazy val docs = project addCompilerPlugin( "org.typelevel" % "kind-projector" % "0.13.2" cross CrossVersion.full ), - scalacOptions += "-Ydelambdafy:inline" + scalacOptions += "-Ydelambdafy:inline", + libraryDependencies += "org.typelevel" %% "mouse" % "1.2.1" ) .dependsOn(dataset, cats, ml) diff --git a/docs/FeatureOverview.md b/docs/FeatureOverview.md index 66504a129..eba591630 100644 --- a/docs/FeatureOverview.md +++ b/docs/FeatureOverview.md @@ -59,6 +59,7 @@ val aptTypedDs2 = aptDs.typed ``` ## Typesafe column referencing + This is how we select a particular column from a `TypedDataset`: ```scala mdoc @@ -389,7 +390,6 @@ c.select(c.colMany('_1, 'city), c('_2)).show(2).run() ### Working with collections - ```scala mdoc import frameless.functions._ import frameless.functions.nonAggregate._ @@ -416,7 +416,6 @@ in Frameless `explode()` is part of `TypedDataset` and not a function of a colum This provides additional safety since more than one `explode()` applied in a single statement results in runtime error in vanilla Spark. - ```scala mdoc val t2 = cityRatio.select(cityRatio('city), lit(List(1,2,3,4))) val flattened = t2.explode('_2): TypedDataset[(String, Int)] @@ -434,8 +433,6 @@ to a single column at a time. } ``` - - ### Collecting data to the driver In Frameless all Spark actions (such as `collect()`) are safe. @@ -462,7 +459,6 @@ cityBeds.limit(4).collect().run() ## Sorting columns - Only column types that can be sorted are allowed to be selected for sorting. ```scala mdoc @@ -478,7 +474,6 @@ aptTypedDs.orderBy( ).show(2).run() ``` - ## User Defined Functions Frameless supports lifting any Scala function (up to five arguments) to the @@ -577,7 +572,6 @@ In a DataFrame, if you just ignore types, this would equivelantly be written as: bedroomStats.dataset.toDF().filter($"AvgPriceBeds2".isNotNull) ``` - ### Entire TypedDataset Aggregation We often want to aggregate the entire `TypedDataset` and skip the `groupBy()` clause. @@ -611,7 +605,6 @@ aptds.agg( ).show().run() ``` - ## Joins ```scala mdoc:silent @@ -646,6 +639,33 @@ withCityInfo.select( ).as[AptPriceCity].show().run ``` +### Chained Joins + +Joins, or any similar operation, may be chained using a thrush combinator removing the need for intermediate values. Instead of: + +```scala mdoc +val withBedroomInfoInterim = aptTypedDs.joinInner(citiInfoTypedDS)( aptTypedDs('city) === citiInfoTypedDS('name) ) +val withBedroomInfo = withBedroomInfoInterim + .joinLeft(bedroomStats)( withBedroomInfoInterim.col('_1).field('city) === bedroomStats('city) ) + +withBedroomInfo.show().run() +``` + +You can use thrush from [mouse](https://github.com/typelevel/mouse): + +```scala +libraryDependencies += "org.typelevel" %% "mouse" % "1.2.1" +``` + +```scala mdoc +import mouse.all._ + +val withBedroomInfoChained = aptTypedDs.joinInner(citiInfoTypedDS)( aptTypedDs('city) === citiInfoTypedDS('name) ) + .thrush( interim => interim.joinLeft(bedroomStats)( interim.col('_1).field('city) === bedroomStats('city) ) ) + +withBedroomInfoChained.show().run() +``` + ```scala mdoc:invisible spark.stop() ```