Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#385 document approach for chaining via thrush #729

Merged
merged 5 commits into from
Sep 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,8 @@ lazy val docs = project
addCompilerPlugin(
"org.typelevel" % "kind-projector" % "0.13.2" cross CrossVersion.full
),
scalacOptions += "-Ydelambdafy:inline"
scalacOptions += "-Ydelambdafy:inline",
libraryDependencies += "org.typelevel" %% "mouse" % "1.2.1"
cchantep marked this conversation as resolved.
Show resolved Hide resolved
)
.dependsOn(dataset, cats, ml)

Expand Down
36 changes: 28 additions & 8 deletions docs/FeatureOverview.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ val aptTypedDs2 = aptDs.typed
```

## Typesafe column referencing

This is how we select a particular column from a `TypedDataset`:

```scala mdoc
Expand Down Expand Up @@ -389,7 +390,6 @@ c.select(c.colMany('_1, 'city), c('_2)).show(2).run()

### Working with collections


```scala mdoc
import frameless.functions._
import frameless.functions.nonAggregate._
Expand All @@ -416,7 +416,6 @@ in Frameless `explode()` is part of `TypedDataset` and not a function of a colum
This provides additional safety since more than one `explode()` applied in a single
statement results in runtime error in vanilla Spark.


```scala mdoc
val t2 = cityRatio.select(cityRatio('city), lit(List(1,2,3,4)))
val flattened = t2.explode('_2): TypedDataset[(String, Int)]
Expand All @@ -434,8 +433,6 @@ to a single column at a time.
}
```



### Collecting data to the driver

In Frameless all Spark actions (such as `collect()`) are safe.
Expand All @@ -462,7 +459,6 @@ cityBeds.limit(4).collect().run()

## Sorting columns


Only column types that can be sorted are allowed to be selected for sorting.

```scala mdoc
Expand All @@ -478,7 +474,6 @@ aptTypedDs.orderBy(
).show(2).run()
```


## User Defined Functions

Frameless supports lifting any Scala function (up to five arguments) to the
Expand Down Expand Up @@ -577,7 +572,6 @@ In a DataFrame, if you just ignore types, this would equivelantly be written as:
bedroomStats.dataset.toDF().filter($"AvgPriceBeds2".isNotNull)
```


### Entire TypedDataset Aggregation

We often want to aggregate the entire `TypedDataset` and skip the `groupBy()` clause.
Expand Down Expand Up @@ -611,7 +605,6 @@ aptds.agg(
).show().run()
```


## Joins

```scala mdoc:silent
Expand Down Expand Up @@ -646,6 +639,33 @@ withCityInfo.select(
).as[AptPriceCity].show().run
```

### Chained Joins

Joins, or any similar operation, may be chained using a thrush combinator removing the need for intermediate values. Instead of:

```scala mdoc
val withBedroomInfoInterim = aptTypedDs.joinInner(citiInfoTypedDS)( aptTypedDs('city) === citiInfoTypedDS('name) )
val withBedroomInfo = withBedroomInfoInterim
.joinLeft(bedroomStats)( withBedroomInfoInterim.col('_1).field('city) === bedroomStats('city) )

withBedroomInfo.show().run()
```

You can use thrush from [mouse](https://github.com/typelevel/mouse):

```scala
libraryDependencies += "org.typelevel" %% "mouse" % "1.2.1"
```

```scala mdoc
import mouse.all._

val withBedroomInfoChained = aptTypedDs.joinInner(citiInfoTypedDS)( aptTypedDs('city) === citiInfoTypedDS('name) )
.thrush( interim => interim.joinLeft(bedroomStats)( interim.col('_1).field('city) === bedroomStats('city) ) )

withBedroomInfoChained.show().run()
```

```scala mdoc:invisible
spark.stop()
```
Loading