Skip to content

Latest commit

 

History

History

relation-dsl-sc

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Relation Tutorial

An introduction to Embedded Domain-Specific Language (EDSL) developement can be found on the Development process page. Here, we give an example of that methodology, with a simple library for relational algebra (dubbed relation-dsl).

Step 1: Defining the Library

The EDSL is first defined as a standard Scala library, that we will place in package relation.shallow. It provides a Relation data type and associated classes, and ways to apply common manipulations on it (selection, projection, join and aggregation).

class Relation(val schema: Schema, val underlying: List[Row]) {
  def select(p: Row => Boolean): Relation = ...
  def project(schema: Schema): Relation = ...
  def join(o: Relation, leftKey: String, rightKey: String): Relation = ...
  def aggregate(key: Schema, agg: (Double, Row) => Double): Relation = ...
}
object Relation {
  def scan(filename: String, schema: Schema, delimiter: String): Relation = ...
}

class Row(val values: List[String]) {
  def getField(schema: Schema, fieldName: String): String = ...
}

class Schema(val columns: List[String])
object Schema {
  def apply(columns: String*): Schema = new Schema(columns.toList)
}

See the full file here.

The reason we have a Relation "companion object" defined beside the Relation class is to provide the factory method scan (it's equivalent to a static method in Java).

The EDSL can already be used like a normal Scala library. It can be tested by typing sbt console, and importing list.shallow._, or by creating a Scala application that uses it, as in:

package relation.shallow

object Main extends App {
  val s = Schema("number", "digit")
  val r = Relation.scan("data/R.csv", s, "|")
  r.print
}

Such an application can be run by typing sbt run.

Now, we'd like to have more power and be able to optimize/transform programs written in the DSL, as well as compile it to different target languages. These require deep embedding.

Step 2: Generating the Deep Embedding

Specifying the Generation

SC provides a way to deeply embed our shallow DSL automatically, via an SBT plugin called Purgatory. We just need to add a few annotations to the library, and Purgatory will handle the rest. In the case of Relation, we have:

@deep
@needs[Row :: Schema :: List[_]]
class Relation(val schema: Schema, val underlying: List[Row]) {
  ...

Here, the @deep annotation tells Purgatory that we would like deep embedding of this class both for expression (Exp, the normal deep embedding) and extraction (Ext, used for pattern-matching with quasiquotes). In case you do not want to generate the Ext or Exp part, add the @noDeepExt/@noDeepExp annotation.

@needs specifies which DSL features our DSL requires. Since our library uses 2-element tuples and they are not present in the default Base DSL embedded with SC, we need to add this requirement. If we omitted it, the generated code would not be able to compile.
Note that a special :: type, defined in pardis.annotations, is used to separate the types passed to @needs.

Recommended SBT Configuration

In order to use Purgatory, we need to add it to the plugins.sbt file:

resolvers += Resolver.sonatypeRepo("snapshots")
addSbtPlugin("ch.epfl.data" % "sc-purgatory-plugin" % "0.1.1-SNAPSHOT")

We recommend that one create an SBT sub-project for the deep embedding of the library. This is to prevent a wrongly-generated deep embedding from breaking compilation of the main project (which would prevent re-generating the deep embedding).

The important generation settings needed in the SBT configuration are the following:

generatorSettings ++ Seq(
    generatePlugins += "ch.epfl.data.sc.purgatory.generator.QuasiGenerator",
    pluginLibraries += "ch.epfl.data" % "sc-purgatory-quasi_2.11" % "0.1.1-SNAPSHOT",

    outputFolder := "relation-deep/src/main/scala/relation/deep",
    inputPackage := "relation.shallow",
    outputPackage := "relation.deep"
)

One should not forget the generatePlugins line to be able to use quasiquotation, and to specify a correct outputFolder (the folder for generated files), inputPackage and outputPackage (package of the shallow library and package for the deep embedding to be generated).

See here for the full build file of our example.

Now that SBT is set up, we can proceed to the actual code generation, by going to the main project root directory, and typing sbt embed.

In our example, files like DeepRelation.scala will be generated inside the folder relation-deep/src/main/scala/relation/deep.

Simple Compilation

In order to use this deep embedding, we can still use the normal shallow syntax, but we have to do so inside of a quasiquotation block dsl"...". To use dsl quasiquotes, we need to extend the QuasiAPI[RelationDSLOpsPackaged, RelationDSLExtOpsPackaged] trait and import its content (alternatively, we can make the current package object extend it), where RelationDSLOpsPackaged and RelationDSLExtOpsPackaged correspond to the deep embeddings of our EDSL (the second one is the one used for extraction).

These are automatically generated by Purgatory after defining our DSL and its dependent constructs using the following trait:

@language
@deep
@quasiquotation
@needs[Relation :: RelationScanner :: Array[_] :: ScalaCore]
trait RelationDSL

ScalaCore contains the deep embedding of the core part of the Scala standard library provided by SC. @quasiquotation specifies that we would like to use quasiquotation to construct and perform pattern matching on the IR nodes. @language specifies that the current trait specifies a (domain-specific) language in the deep embedding.

An example of quasiquote use to define a program follows:

implicit object Context extends RelationDSLOpsPackaged // required by the `dsl` string interpolator

def pgrm = dsl"""  
  val s = Schema("number", "digit")
  val r = Relation.scan("data/R.csv", s, "|")
  r.print
"""

What the dsl macro does is typecheck the shallow expression that we pass it, and transform it to deep embedding nodes (RelationDSLOpsPackaged if it is used in an expression, RelationDSLExtOpsPackaged if it is used in a pattern).

Note: If you have trouble with a quasiquote, there is a page dedicated to debugging quasiquotes.

Finally, in order to generate a program from this deep embedding, we have to extend pardis.compiler.Compiler[RelationDSLOpsPackaged] and define a code generator for it. This is going in the opposite direction as the dsl macro, i.e., from deep embedding to shallow embedding.

object MyCompiler extends Compiler[RelationDSLOpsPackaged] {
  val DSL: RelationDSLOpsPackaged = Context
  
  import sc.pardis.prettyprinter._
  
  val codeGenerator = ...
}

// Compiling our program to proper Scala code:
MyCompiler.compile(pgrm, "src/main/scala/GeneratedApp")

(See file RelationCompiler.scala for an example definition of codeGenerator.)

Step 3: Defining Optimizations

Difference Between Online and Offline Optimizations

There are two main kinds of optimizations: online and offline. Online (or local) optimizations are applied continuously, as soon as a node in the deep embedding is created, while offline optimizations are only applied after constructing the full program in the deep embedding (and thus after online optimizations have been applied), in a predefined order.

SC allows both kinds to be defined, but offline optimizations are simpler to start with. This is because they require no knowledge of the deep embedding, being able to be fully described using quasiquotes. Therefore, we will focus on them in this tutorial. For an explanation of online optimizations and an example, please refer to the appendix.

The Compilation Pipeline

In order to define offline transformations, we need to add them to the pipeline of the MyCompiler object defined earlier. In the following example, we extend MyCompiler's pipeline with a transformer MyTransformer, which definition is detailed below.

import sc.pardis.optimization.RecursiveRuleBasedTransformer

object MyCompiler extends Compiler[RelationDSLOpsPackaged] {

  ...
  
  pipeline += new MyTransformer[RelationDSLOpsPackaged](DSL)
  
}

Offline analysis and rewrite rules can be written in the following general syntax:

class MyTransformer extends RuleBasedTransformer {
    analysis += rule {
      case dsl" .. to match .. " => // store gathered information
    }
    rewrite += rule {
      case dsl" .. to match .. " => dsl" .. to construct .."
    }
  } 
}

Partial Evaluation of Schema

We have a nice and simple language for writing queries, but the execution is inefficient because it has to constantly read from the fields of a Schema and manipulate each row as a List[String] with no static size information. It's unfortunate, because we notice that in most applications, the schema is static, and thus the application could be specialized for specific schemas and avoid all that overhead.

The first step is to match schemas that are actually static. For this, we write a Scala extractor that uses quasiquotes:

object StaticSchema {
  def unapply(schema: Rep[Schema]): Option[Schema] = schema match {
    case dsl"Schema($xs*)" =>
      val names = xs map { case Constant(x) => x  case _ => return None }
      Some(new Schema(names.toList))
    case _ => None
  }
}

It takes as input a Rep[Schema] and returns a Schema if successful (notice: not a Rep[Schema]). For it to the extractor to be successful, every part of the schema has to be a constant (i.e., known at compile-time).

Having static Schemas is useful because we can now associate them with tailored record types that more efficiently store and access row data. Creating a record type is done with the __new[Rec](fields) function, and can be used with a special syntax inside of quasiquotes. In the same transformation, we also lower the representation of relations to simple arrays (see the list tutorial for more info on lowerings).

For implementing joins, we will also need to keep a static (compile-time) mapping of schema objects to their corresponding schema:

val symbolSchema = mutable.Map[Rep[_], Schema]()

Here is the code for the specialized implementation of project, implemented as a transformation:

rewrite += symRule {
  case rel @ dsl"(${ArrFromRelation(arr)}: Relation).project(${StaticSchema(schema)})" => 
    symbolSchema += rel -> schema
    implicit val recTp: TypeRep[Rec] = new RecordType[Rec](getClassTag, None)
    def copyRecord(e: Rep[Any]): Rep[Rec] =
      __new[Rec](schema.columns.map(column => (column, false, field[String](e, column))): _*)
    val newArr = dsl"new Array[Rec]($arr.length)"
    dsl"""
      (0 until $arr.length) foreach ${ __lambda[Int,Unit]((x: Rep[Int]) =>
        dsl"$newArr($x) = ${ copyRecord( dsl"$arr($x)" ) }"
      )}
    """
    newArr
}

Notice how the (0 until $arr.length) part is a dsl block used merely for its side effects (which translate to side effects in the generated program). The ArrFromRelation extractor is defined in a way analogous to ArrFromList in the list tutorial.

The code above can be simplified thanks to meta-function splicing, which allows us to splice in a meta function, of type Rep[A] => Rep[B] as if it was a Rep[A => B], removing the need for the ugly __lambda conversion function:

rewrite += symRule {
  case rel @ dsl"(${ArrFromRelation(arr)}: Relation).project(${StaticSchema(schema)})" => 
    symbolSchema += rel -> schema
    implicit val recTp: TypeRep[Rec] = new RecordType[Rec](getClassTag, None)
    val copyRecord = (e: Rep[Any]) =>
      __new[Rec](schema.columns.map(column => (column, false, field[String](e, column))): _*)
    dsl"""
      val newArr = new Array[Rec]($arr.length)
      for (i <- 0 until $arr.length) newArr(i) = $copyRecord($arr(i))
      newArr
    """
}

The simplified solution above does apply copyRecord in the object language (as opposed to applying it in the meta language as in the initial code), so it looks like it would add some runtime overhead. Fortunately, in the compiler described here we apply beta reduction automatically by mixing the trait BasePartialEvaluation in the DSL IR, so the overhead will be removed before the program is generated.

Record Store to Column Store

The code above stores each row of the database in a Scala case class (a "record" allocated on the heap). Alternatively, it is also possible to use the column store layout, whereby we store each column of the database in a disctinct array.

In order to do that, we will associate to each original relation an array of the columns as well as and the number of tuples it contains.

type Column = Array[String]
type LoweredRelation = (Rep[Array[Column]], Rep[Int])

As a simple example, consider again the implementation of projection. A nice advantage of column store is that it allows us to reuse the columns of the relation we are projecting from. The code should look like the following:

def relationProject(rel: Rep[Relation], schema: Schema, resultSchema: Schema): LoweredRelation = {
  val (arr,size) = getRelationLowered(rel)
  val nColumns = resultSchema.size
  val res: Rep[Array[Column]] = dsl"new Array[Column]($nColumns)"
  
  for (c <- 0 until nColumns)
    dsl"$res($c) = $arr(${schema.columns.indexOf(resultSchema.columns(c))})"
  
  res -> size
}

Notice that we are iterating on the columns at compile time (the for loop is outside of a dsl block), which will insert the statements corresponding to each iteration into the generated program.

Now, sometimes we need to iterate over the rows in the generated program and, for each row, iterate over the columns. We can still do the latter statically, by using the same technique as shown in the previous pragraph. In the context of the implementation of print, it would look like:

def relationPrint(rel: Rep[Relation]): Unit = {
  val (arr,size) = getRelationLowered(rel)
  val schema = getRelationSchema(rel)
  dsl"""
    for { i <- 0 until $size } {
      val str = ${ index: Rep[Int] =>
        schema.columns.indices map {
          case 0 => dsl"$arr(0)($index)"
          case c => dsl""" "|" + $arr($c)($index) """
        } reduceOption ((a,b) => dsl"$a + $b") getOrElse dsl"${""}"
      }(i)
      println(str)
    }
  """
}

We iterate over the index of each column (schema.columns.indices), generate a the statements to make one string per element (map { case c => ... $arr($c)($index) ... }) and to add them together (reduceOption ((a,b) => dsl"$a + $b")), and if the result is empty, we simply output an empty string (getOrElse dsl"${""}").

Finally, we could also go one step further and store, for each relation, an array of the representations of each column, instead of the representation of an array of array. In other words, we partially evaluate the outer array, which size is known at compile-time:

type LoweredRelation = (Array[Rep[Column]], Rep[Int])

The complete specializing implementations of record store and the different styles of column store, sharing a unique interface, are availbale here.

Conclusion

We have defined a simple Scala DSL for relational algebra, and implemented optimizations to specialize relations that have static schemas and to use low-level array constructs, thereby getting rid of execution overhead. Moreover, we saw how to generate different implementations depending on which data layout we chose – something that could be determined automatically with a prior analysis of the program.