spark.Rmd

---
title: "Working with other data science platforms"
author: "Steph Locke"
date: "29 November 2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

We can work consume data from many different sources in Power BI, but as well as R and Python, some of the most useful to us are the data science platforms. We're able to consume Spark, Databricks, and Hadoop using Power BI.

- [HDInsight based Spark](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-use-bi-tools)
- [databricks](https://docs.databricks.com/user-guide/bi/power-bi.html)

## Working with databricks
Once we have a Spark or databricks cluster we can use the Spark connector to talk to a cluster.

```
https://westeurope.azuredatabricks.net:443/sql/protocolv1/o/7856645940864675/simples
```

We need a user name and password - these can be tokens for ease.

```
token
dapif31cca4bf07da43c52eca8f0a119c653
```

This can work in *direct query* mode so data stays on the databricks cluster.

## Exercise
1. Import the flights data from the Databricks cluster
2. Use (or construct!) a model that predicts flight arrival delays and use this to score the data from Spark
3. Visualise the results