WaveformInCSV.Rmd

---
title: "WaveformInCSV"
author: "Rob Donald"
date: "`r format(Sys.time(), '%A %d %B %Y, %H:%M')`"
output:
  html_document:
    toc: true
    toc_depth: 2
    toc_float: true
---

# Introduction

This analysis will look at extracting waveform data that is encoded in the 
columns of a .csv file. 

Each row of the .csv file will contain the complete 11 point waveform along
with meta data that describes the conditions at the time the waveform was captured.

This is an example of a complex data structure contained within a seemingly
simple .csv file. This looks a lot trickier than it turns out to be :D.

If you want to get to the answer quickly, use the menu to select 

+ Construct data for ggplot()
    + Using dplyr::gather() in two lines

```{r setup, include=FALSE}
#knitr::opts_chunk$set(echo = TRUE,fig.height=8,fig.width=12)
knitr::opts_chunk$set(echo = TRUE)
#options(width=1500)
```

## Libraries

```{r library_setup}
suppressMessages({suppressWarnings({
  library(dplyr)
  library(tidyr)
  library(readr)
  library(ggplot2)
  library(gridExtra)
  library(data.table)
  
  library(RobsRUtils)
  library(futile.logger)
})})
```

# Generate Data

```{r}

all.traces <- NULL

all.decay.profiles <- c(5,10,15,20)

for(decay.profile in all.decay.profiles)
{
    time.ms <- seq(0,100,by=10)
    response <- exp(-(time.ms/decay.profile))
    
    decay.setting <- rep(decay.profile,length(time.ms))
    waveform.df <- data_frame(time.ms,response,decay.setting)
    
    if(is.null(all.traces))
    {
       all.traces <- waveform.df
    }
    else
    {
       all.traces <- bind_rows(all.traces,waveform.df) 
    }
}

```

We now add in some batch and experiment day information. Also add a column to control 
the alpha setting. The alpha setting is a reflection of the experiment day.

```{r}
all.traces$batch <- ifelse(all.traces$decay.setting > 10,1234,5678)
all.traces$exp.day <- ifelse(all.traces$decay.setting %% 10 == 0,'Day 1','Day 2')
all.traces$alpha.setting <- ifelse(all.traces$decay.setting %% 10 == 0,1,0.3)
all.traces$line.type <- ifelse(all.traces$decay.setting %% 10 == 0,'solid','dotdash')
all.traces$line.size <- ifelse(all.traces$decay.setting %% 10 == 0,2,0.5)
```

Let's do an initial plot. 

```{r}
p <- ggplot(data=all.traces,aes(x=time.ms,y=response
                                ,colour = as.factor(decay.setting)
                                ,alpha=alpha.setting
                                ,linetype=line.type))
p <- p + geom_point(size=3)
p <- p + geom_line(aes(size=line.size))
p <- p + labs(title='Experiment Response'
                        ,x='Time (ms)', y='Response proportion'
                        ,colour='Decay Setting')
p <- p + guides(alpha=FALSE,linetype=FALSE,size=FALSE)
p <- p + scale_alpha_continuous(range = c(0.2, 1))
p <- p + scale_linetype_identity()
p <- p + scale_size_identity()

print(p)
```

We now do a plot where we 'facet' (i.e. draw a separate panel) based on batch. 

```{r}
p <- ggplot(data=all.traces,aes(x=time.ms,y=response,colour = as.factor(decay.setting)))
p <- p + geom_point()
p <- p + geom_line()
p <- p + labs(title='Experiment Response [Panel: batch]'
                        ,x='Time (ms)', y='Response proportion'
                        ,colour='Decay Setting')
p <- p + facet_grid(. ~ batch)
print(p)
```

We can do more sub divisions by faceting on both batch and experiment day.

```{r}
p <- ggplot(data=all.traces,aes(x=time.ms,y=response,colour = as.factor(decay.setting)))
p <- p + geom_point()
p <- p + geom_line()
p <- p + labs(title='Experiment Response [Panel: batch, experiment day]'
                        ,x='Time (ms)', y='Response proportion'
                        ,colour='Decay Setting')
p <- p + facet_grid(exp.day ~ batch)
print(p)
```

Now let's save that out in a .csv format where each row is the trace from an 
experimental row along with the meta data (in this case the decay setting, 
batch and experiment day) from that run.

We have four experimental runs:

+ Day 1
    + Batch 1234
        + Decay setting 20
        
+ Day 1
    + Batch 5678
        + Decay setting 10
        
+ Day 2
    + Batch 1234
        + Decay setting 15
        
+ Day 2
    + Batch 5678
        + Decay setting 5        

So this means we will have four rows in our .csv file.

Let's pull out each run's results into an R object. 

```{r}
d1.b1234.ds20 <- filter(all.traces, exp.day == 'Day 1', batch == 1234, decay.setting == 20)
d1.b5678.ds10 <- filter(all.traces, exp.day == 'Day 1', batch == 5678, decay.setting == 10)

d2.b1234.ds15 <- filter(all.traces, exp.day == 'Day 2', batch == 1234, decay.setting == 15)
d2.b5678.ds05 <- filter(all.traces, exp.day == 'Day 2', batch == 5678, decay.setting == 5)
```

For each of these objects we have 11 rows of 5 variables. We need to flatten this data
into a single row. In actual fact we have 3 bits of meta data (decay setting, 
batch and experiment day) and the 11 readings from the waveform 0 to 100 ms. We are going
to construct a 14 column row which we can write out in .csv format.

First we collect the waveform data into a single numeric vector.

```{r}
exp.df <- d1.b1234.ds20
wf <- NULL
num.time.pts <- nrow(exp.df)

for (count in 1:num.time.pts)
{
    wf <- c(wf,exp.df$response[count])
}
```

Now we make another vector with the meta data. We only need to grab the first 
element of the required vector.

```{r}
meta.vec <- c(exp.df$exp.day[1],exp.df$batch[1],exp.df$decay.setting[1])
```

Then we can stick these two vectors together and give them names.

```{r}
full.row <- c(meta.vec,wf)
names(full.row) <- c('ExpDay','Batch','DecaySetting'
                        ,'t=0','t=10','t=20','t=30','t=40','t=50'
                        ,'t=60','t=70','t=80','t=90','t=100')

full.row
```
Let's put the above techniques into a function

```{r}
build_WF_Row <- function(exp.df)
{
    wf <- NULL
    num.time.pts <- nrow(exp.df)
    
    for (count in 1:num.time.pts)
    {
        wf <- c(wf,exp.df$response[count])
    }
    
    meta.vec <- c(exp.df$exp.day[1],exp.df$batch[1],exp.df$decay.setting[1])

    full.row <- c(meta.vec,wf)
    
    # Having got a full row vector we want it as a data frame with
    # a column for each element. The as.is stops the 'Day x' becoming a factor.
    full.row.df <- data.frame(lapply(full.row, type.convert,as.is=TRUE), stringsAsFactors=FALSE)
    
    # We then set the names including the slightly odd t= pattern.
    names(full.row.df) <- c('ExpDay','Batch','DecaySetting'
                        ,'t=0','t=10','t=20','t=30','t=40','t=50'
                        ,'t=60','t=70','t=80','t=90','t=100')
    
    return(full.row.df)
}
```


Now we'll build up a 4 row object with the data from the experiment. If this were real data from a reasonable size experiment we would do this in a for() loop or using lapply but for this example we'll do this manually.

```{r}
full.exp.list <- list()
full.exp.list[[1]] <- build_WF_Row(d1.b1234.ds20)
full.exp.list[[2]] <- build_WF_Row(d1.b5678.ds10)
full.exp.list[[3]] <- build_WF_Row(d2.b1234.ds15)
full.exp.list[[4]] <- build_WF_Row(d2.b5678.ds05)
```
We have a list of data_frames but what we want it a single data_frame. 
Use rbindlist from th data.table package to achieve this. For some 

```{r}
full.exp.df <- rbindlist(full.exp.list)
```


Let's check that looks as we expect.


```{r}
full.exp.df[,c('ExpDay','Batch','DecaySetting','t=0','t=10','t=90','t=100')]
```

# Export the data
We can now write this out to a .csv file so that you can prove to yourself
that it is what you would expect.

I'll assume that your current directory is the project dir.

```{r}
exp.file.name <- 'ExperimentFullData.csv'
write.csv(full.exp.df,file = exp.file.name, row.names = FALSE )
```

# Import the data
What we want to do now is show how you can read in a file in this format
and construct an object that ggplot can use to produce graphs like those above.

```{r}
raw.exp.data <- read_csv(file = exp.file.name)
```

# Construct data for ggplot()

We now show two ways to build up an object that we can use with ggplot().

## The hard way

What we want is a function that can take a row of meta data *and* waveform data
and return a data_frame that can we can use in a bind_rows construct to form an
object for ggplot().

```{r}
buildWaveformAndMetaData<-function(single.row)
{
    # Our first task is to split out the waveform from the metadata.
    
    wf <- select(single.row,contains('t='))
    col.names <- names(wf)
    
    time.ms <- rep(NaN,length(col.names))
    response <- rep(NaN,length(col.names))
    waveform.df <- data_frame(time.ms,response)
    for(count in 1:length(col.names))
    {
      waveform.df$time.ms[count] <- gsub('t=','',col.names[count])
      waveform.df$response[count] <- as.numeric(wf[count])
    }
    
    waveform.df$time.ms <- as.numeric(waveform.df$time.ms)

    meta.data <- select(single.row,-contains('t='))
    col.names <- names(meta.data)
    
    for(column.name in col.names)
    {
        waveform.df[[column.name]] <- rep(meta.data[[column.name]],nrow(waveform.df))
    }
    
    return(waveform.df)
}
```

Let's test this function

```{r}
single.row <- raw.exp.data[1,]

one.row <- buildWaveformAndMetaData(single.row)
one.row
```

Now do it for the whole .csv file.

```{r}

all.experimental.data <- NULL
for(count in 1:nrow(raw.exp.data))
{
    single.row <- raw.exp.data[count,]

    one.row.df <- buildWaveformAndMetaData(single.row)
    
    if(is.null(all.experimental.data))
    {
        all.experimental.data <- one.row.df 
    }
    else
    {
        all.experimental.data <- bind_rows(all.experimental.data,one.row.df)
    }
    
}


```


```{r}
p <- ggplot(data=all.experimental.data,aes(x=time.ms,y=response,colour = as.factor(DecaySetting)))
p <- p + geom_point()
p <- p + geom_line()
p <- p + labs(title='Experimental Response'
                        ,x='Time (ms)', y='Response proportion'
                        ,colour='Decay Setting')
print(p)
```


```{r}
p <- ggplot(data=all.experimental.data,aes(x=time.ms,y=response,colour = as.factor(DecaySetting)))
p <- p + geom_point()
p <- p + geom_line()
p <- p + labs(title='Experimental Response [Panel: batch, experiment day]'
                        ,x='Time (ms)', y='Response proportion'
                        ,colour='Decay Setting')
p <- p + facet_grid(ExpDay ~ Batch)
print(p)
```

## Using dplyr::gather()
 
An alternative approach is to use the gather() function from dplyr.

Here we show using gather() one row at a time.
```{r}

gather.all.experimental.data <- NULL
for(count in 1:nrow(raw.exp.data))
{
    single.row <- raw.exp.data[count,]

    one.row.df <- gather(single.row,value='response',key=time.pt,contains('t='))
    
    if(is.null(gather.all.experimental.data))
    {
        gather.all.experimental.data <- one.row.df 
    }
    else
    {
        gather.all.experimental.data <- bind_rows(gather.all.experimental.data,one.row.df)
    }
    
}

gather.all.experimental.data$time.ms <- as.numeric(gsub('t=','',gather.all.experimental.data$time.pt))

```

## Using dplyr::gather() in two lines

But in actual fact you can use gather() to process the *whole* data frame in a oner. This amazingly reduces the complete operation down to two lines!!

In this first example we use the contains() function to scoop up all the timepoints.
```{r}

long.data <- gather(raw.exp.data,value='response',key=time.pt,contains('t='))
long.data$time.ms <- as.numeric(gsub('t=','',long.data$time.pt))
```
Let's see the plot.

```{r}
p <- ggplot(data=long.data,aes(x=time.ms,y=response,colour = as.factor(DecaySetting)))
p <- p + geom_point()
p <- p + geom_line()
p <- p + labs(title='Experimental Response from Long Data [Panel: batch, experiment day]'
                        ,x='Time (ms)', y='Response proportion'
                        ,colour='Decay Setting')
p <- p + facet_grid(ExpDay ~ Batch)
print(p)
```


In this next example we can specify the start and end points if we are interested in a particular section of the waveform.
```{r}

long.data.v2 <- gather(raw.exp.data,value='response',key=time.pt,`t=0`:`t=50`)
long.data.v2$time.ms <- as.numeric(gsub('t=','',long.data.v2$time.pt))
```

Let's see the plot.

```{r}
p <- ggplot(data=long.data.v2,aes(x=time.ms,y=response,colour = as.factor(DecaySetting)))
p <- p + geom_point()
p <- p + geom_line()
p <- p + labs(title='Experimental Response from Long Data V2 [Panel: batch, experiment day]'
                        ,x='Time (ms)', y='Response proportion'
                        ,colour='Decay Setting')
p <- p + facet_grid(ExpDay ~ Batch)
print(p)
```
Using the long.data object, let's add in the alpha, line type and size columns

```{r}
long.data$alpha.setting <- ifelse(long.data$DecaySetting %% 10 == 0,1,0.6)
long.data$line.type <- ifelse(long.data$DecaySetting %% 10 == 0,'solid','dotdash')
long.data$line.size <- ifelse(long.data$DecaySetting %% 10 == 0,1,0.5)
long.data$point.size <- ifelse(long.data$DecaySetting %% 10 == 0,3,1)
```

Now do a facet plot as above but also show the use of subtitle and caption. Note the caption
also has a theme() line to control size and font face.

```{r}
p <- ggplot(data=long.data,aes(x=time.ms,y=response
                                            ,colour = as.factor(DecaySetting)
                                            ,alpha=alpha.setting
                                            ,linetype=line.type))
p <- p + geom_point(size=3)
p <- p + geom_line(aes(size=line.size))
p <- p + labs(title='Experimental Response from Long Data [Panel: batch, experiment day]'
                        ,subtitle='Using alpha, linetype and size'
                        ,caption='Data Object: long.data'
                        ,x='Time (ms)', y='Response proportion'
                        ,colour='Decay Setting')
p <- p + theme(plot.caption = element_text(size = 6,face = 'italic'))

p <- p + guides(alpha=FALSE,linetype=FALSE,size=FALSE)
p <- p + scale_alpha_continuous(range = c(0.2, 1))
p <- p + scale_linetype_identity()
p <- p + scale_size_identity()

p <- p + facet_grid(ExpDay ~ Batch)

print(p)
```

A slightly different version of this is to put the alpha setting only in the
geom_line(). We also use point.size to control the point.size for the two different conditions. This allows the active points to stand out.


```{r}
p <- ggplot(data=long.data,aes(x=time.ms,y=response
                                            ,colour = as.factor(DecaySetting)
                                            ,linetype=line.type))
p <- p + geom_point(aes(size=point.size))
p <- p + geom_line(aes(size=line.size,alpha=alpha.setting))
p <- p + labs(title='Experimental Response from Long Data [Panel: batch, experiment day]'
                        ,subtitle='Using alpha, linetype and size'
                        ,caption='Data Object: long.data'
                        ,x='Time (ms)', y='Response proportion'
                        ,colour='Decay Setting')
p <- p + theme(plot.caption = element_text(size = 6,face = 'italic'))

p <- p + guides(alpha=FALSE,linetype=FALSE,size=FALSE)
p <- p + scale_alpha_continuous(range = c(0.5, 1))
p <- p + scale_linetype_identity()
p <- p + scale_size_identity()

p <- p + facet_grid(ExpDay ~ Batch)

print(p)
```