forked from bishwaG/Single-cell-RNA-seq-data-analysis-2018
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsession2-alignment.Rmd
650 lines (478 loc) · 20.2 KB
/
session2-alignment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
---
title: "Seurat Alignment"
output: pdf_document
---
This tutorial has been structured based on Seurat tutorial on alignment. We have two datasets from two different single cell plateforms (10X Genimics and Seqwell). Seurat's alignment pipeline allows us to integrate single-cell data across different conditions, technologies, and species.
As a first step open terminal from your application menu and change the directory to `Single-cell-course-2018`
```
cd single-cell-course-2018
git clone https://github.com/bishwaG/Single-cell-RNA-seq-data-analysis-2018.git
mv Single-cell-RNA-seq-data-analysis-2018 analysis
cd analysis
```
Load libraries needed for the analysis.
```{r message=FALSE}
## load libraries
library(Seurat)
library(dplyr)
library(Matrix)
library(ggplot2)
library(plotly)
```
Setting up input output directories.
```{r message=FALSE}
## variables
tenx.data.path <- "../data/10x/"
seqwell.data.path <- "../data/seqwell/"
robj.dir <- "r-objects"
output.dir <- "output"
## create directory
dir.create(robj.dir)
dir.create(output.dir)
```
Loading input gene expression matrix.
```{r}
## load 10X gene expression matrix
tenx.data <- read.table(gzfile(file.path(tenx.data.path,"dge.csv.gz")),
sep="\t",
header=T,
row.names = 1)
## load seqwell data
dt <- file.path(seqwell.data.path,"dt-dge.csv.gz")
nt <- file.path(seqwell.data.path,"nt-dge.csv.gz")
seqwell.dt.data <- read.table(gzfile(dt),
sep="\t",
header=T,
row.names = 1)
seqwell.nt.data <- read.table(gzfile(nt),
sep="\t",
header=T,
row.names = 1)
dim(tenx.data)
dim(seqwell.dt.data)
dim(seqwell.nt.data)
```
# Create seurat objects.
The function `CreateSeuratObject` creates Seurat object from full matrix. If you have sparse matrix from 10X Genomics use the function `Read10X`.
### Parameters:
- `min.cells`: Include genes with detected expression in at least `3` cells.
- `min.genes`: Include cells where at least `200` genes are detected.
```{r}
## Create seurat objects
tenx <- CreateSeuratObject(raw.data = tenx.data,
project="10X",
min.cells = 3,
min.genes = 200)
seqwell.dt <- CreateSeuratObject(raw.data = seqwell.dt.data)
seqwell.nt <- CreateSeuratObject(raw.data = seqwell.nt.data)
## merge seqwell seurat objects
seqwell <- MergeSeurat(seqwell.dt, seqwell.nt,
project = "Seqwell",
add.cell.id1 = "DT",
add.cell.id2 = "NT",
min.cells = 3,
min.genes = 200)
```
# Removing unnecessary objects.
Full matrix takes lot of memory. Therefore it is highly recommended to remove the object that holds gene expression matrix.
### Example:
- Genes = 31,500
- Cells = 730
- Sparse matrix = 16Mb
- Full matrix = 178 Mb
```{r}
## remove unnecessary objects
rm(seqwell.dt.data)
rm(seqwell.nt.data)
rm(seqwell.dt)
rm(seqwell.nt)
```
# Setting Identity class
We want to identify cells by NT and DT samples. It is done using the function `SetAllIdent`. Before using the function we add NT and DT in `orig.ident` in the dataframe `[email protected]`.
```{r}
## Setting identity class for 10x
tenx.ident <- gsub(".*\\.","", rownames([email protected]))
tenx.ident[tenx.ident==1] <- "NT"
tenx.ident[tenx.ident==2] <- "DT"
[email protected]$orig.ident <- factor(tenx.ident, levels = unique(tenx.ident))
tenx <- SetAllIdent(object = tenx, id = "orig.ident")
## Setting identity class for seqwell
seqwell.ident <- gsub("_.*", "",rownames([email protected]))
[email protected]$orig.ident <- factor(seqwell.ident, levels = unique(seqwell.ident))
seqwell <- SetAllIdent(object = seqwell, id = "orig.ident")
```
# Quality control
In this step we remove celss having low and high `nUMI`. Cells having low `nUMI` could be of low quality and having high could be duplets. We also remove celss having high percentage of mitocondrial genes. Before removing cells we calculate percentage of mitocondrial cells and put the information in `percent.mito` column of `meta.data`.
```{r}
tenx.m.genes <- grep(pattern = "^MT-", x = rownames(x = tenx@data), value = TRUE)
tenx.m.per <- Matrix::colSums([email protected][tenx.m.genes, ]) / Matrix::colSums([email protected])
tenx <- AddMetaData(object = tenx, metadata = tenx.m.per, col.name = "percent.mito")
seqwell.m.genes <- grep(pattern = "^MT-", x = rownames(x = seqwell@data), value = TRUE)
seqwell.m.per <- Matrix::colSums([email protected][seqwell.m.genes, ]) / Matrix::colSums([email protected])
seqwell <- AddMetaData(object = seqwell, metadata = seqwell.m.per, col.name = "percent.mito")
```
Now we check the distribution of `nGene`, `nUMI` and `percent.mito`.
```{r}
## Plots
v1 <- VlnPlot(object = tenx,
features.plot = c("nGene", "nUMI", "percent.mito"),
nCol = 3,
do.return = T)
v2 <- VlnPlot(object = seqwell,
features.plot = c("nGene", "nUMI", "percent.mito"),
nCol = 3,
do.return = T)
plot_grid(v1,v2, labels=c("10X", "Seqwell"), ncol = 1)
```
We can also check the distribution in scatter plots.
```{r}
## Scatter plots
sc1 <- ggplot([email protected], aes(x=nUMI, y=percent.mito, colour=orig.ident, fill=orig.ident)) +
geom_point(shape = 21, alpha = 0.5, size = 2)
sc2 <- ggplot([email protected], aes(x=nUMI, y=nGene, colour=orig.ident, fill=orig.ident)) +
geom_point(shape = 21, alpha = 0.5, size = 2)
## similar scatter plots plots for Seqwell
sc3 <- ggplot([email protected], aes(x=nUMI, y=percent.mito, colour=orig.ident, fill=orig.ident)) +
geom_point(shape = 21, alpha = 0.5, size = 2)
sc4 <- ggplot([email protected], aes(x=nUMI, y=nGene, colour=orig.ident, fill=orig.ident)) +
geom_point(shape = 21, alpha = 0.5, size = 2)
plot_grid(sc1,sc2,sc3,sc4, ncol=2, nrow=2, labels=c("10X","10X","Seqwell","Seqwell") )
```
We filter cells using the function `FilterCells`. The parameters `low.thresholds` and `high.thresholds` are for setting up high and low cutoff for filtering. For example, in the following code cells having `nGene < 20` and `nGene > 2500` will be removed.
```{r}
## Filtering
tenx <- FilterCells(object = tenx,
subset.names = c("nGene", "percent.mito"),
low.thresholds = c(200, -Inf),
high.thresholds = c(5000, 0.08))
seqwell <- FilterCells(object = seqwell,
subset.names = c("nGene", "percent.mito"),
low.thresholds = c(200, -Inf),
high.thresholds = c(2500, 0.08))
## Plots
v3 <- VlnPlot(object = tenx,
features.plot = c("nGene", "nUMI", "percent.mito"),
nCol = 3,
do.return = T)
v4 <- VlnPlot(object = seqwell,
features.plot = c("nGene", "nUMI", "percent.mito"),
nCol = 3,
do.return = T)
plot_grid(v3,v4, labels=c("10X", "Seqwell"), ncol = 1)
```
# Normalization and scaling
To make expression of each cells comparable with one another we have to normalize the expression. Seurat uses `LogNormalize` method to normalize gene expression for each cell by total expression. Then it is multiplied by scale factor before performing log transformation.
The function `ScaleData` scales and centers genes and also regress out effects coming from user provided variables.
```{r}
## Normalization, data scaling and finding variable genes
tenx <- NormalizeData(object = tenx)
tenx <- ScaleData(object = tenx,
vars.to.regress = c("nUMI", "percent.mito"),
display.progress = TRUE)
## No need to normalize because MergeSeurat() does it by default.
seqwell <- ScaleData(object = seqwell,
vars.to.regress = c("nUMI", "percent.mito"),
display.progress = TRUE)
```
# Cell cycle effect
In some case we would like to remove effect coming from the expression of cell cycle genes. First we check if cells group based on cell cycle. Seurat first gives score to each cell based on expression of G2/M and S phase. Cells expressing neither are likely not cycling and in G1 phase.
```{r}
## Cell-cycel effect for 10X data
tenx.cc <- CellCycleScoring(object = tenx,
s.genes = cc.genes$s.genes,
g2m.genes = cc.genes$g2m.genes,
set.ident = TRUE)
## Check metadata data.frame. cell scores and phase are added
head([email protected],4)
## Before adjustment
tenx.cc <- RunPCA(object = tenx.cc,
pc.genes = c(cc.genes$s.genes, cc.genes$g2m.genes),
do.print = FALSE)
cc1 <- PCAPlot(object = tenx.cc, do.return=T, plot.title = "10X")
```
Now we adjust the cell cycle by providing cell cucle score to regress out to the function `ScaleData`.
```{r}
## cell cycle correction
tenx.cc <- ScaleData(object = tenx.cc,
## If provided data.use this function uses already scaled data.
data.use = [email protected],
vars.to.regress = c("S.Score", "G2M.Score"),
display.progress = TRUE)
## Run pca using expression of cell cycle genes after adjustment
tenx.cc <- RunPCA(object = tenx.cc,
pc.genes = c(cc.genes$s.genes, cc.genes$g2m.genes),
do.print = FALSE)
cc2 <- PCAPlot(object = tenx.cc, do.return=T, plot.title = "10X")
plot_grid(cc1,cc2, labels=c("Before cc-adjustement", "After cc-adjustemnt"), ncol = 2)
## remove the object
rm(tenx.cc)
```
# Variable genes
Seurat uses highly variable genes in the downstream data analysis. Here we calcualte variable genes seperately for `tenx` and `seqwell` and later we take union of variable genes od two datasets.
```{r}
## Find variable genes
tenx <- FindVariableGenes(object = tenx,
x.low.cutoff = 0.1,
do.plot = TRUE)
seqwell <- FindVariableGenes(object = seqwell,
x.low.cutoff = 0.2,
do.plot = TRUE)
length(x = [email protected])
length(x = [email protected])
## Take highly variable genes
hvg.tenx <- rownames(x = head(x = [email protected] , n = 400))
hvg.seqwell <- rownames(x = head(x = [email protected], n = 400))
hvg.union <- union(x = hvg.tenx, y = hvg.seqwell)
```
# Canonical correlation analysis (CCA)
```{r}
## Canonical correlation analysis to identify common sources of variation between the two datasets.
## RunCCA will also combine the two objects into a single object and stores the canonical correlation
## vectors (the vectors that project each dataset into the maximally correlated subspaces).
## We also store the original dataset identity as a column in [email protected]
## for identification after merging
[email protected][, "library"] <- "10X"
[email protected][, "library"] <- "Seqwell"
sObj <- RunCCA(object = tenx,
object2 = seqwell,
genes.use = hvg.union)
p1 <- DimPlot(object = sObj, reduction.use = "cca", group.by = "library", pt.size = 0.5,
do.return = TRUE)
p2 <- VlnPlot(object = sObj, features.plot = "CC1", group.by = "library", do.return = TRUE)
p3 <- VlnPlot(object = sObj, features.plot = "CC2", group.by = "library", do.return = TRUE)
p1
plot_grid(p2, p3)
## heat map of 12 CCA dimensions
DimHeatmap(object = sObj,
reduction.type = "cca",
cells.use = 500,
num.genes = 30,
dim.use = 1:6,
do.balanced = TRUE)
DimHeatmap(object = sObj,
reduction.type = "cca",
cells.use = 500,
num.genes = 30,
dim.use = 7:12,
do.balanced = TRUE)
## Number of dimensions to include in the analysis
dims.include <- 8
## search for cells whose expression profile cannot be well-explained by low-dimensional CCA,
## compared to low-dimensional PCA.
sObj <- CalcVarExpRatio(object = sObj, reduction.type = "pca", grouping.var = "library",
dims.use = 1:dims.include)
## discard cells where the variance explained by CCA is <2-fold (ratio <
## 0.5) compared to PCA
sObj.all.save <- sObj
sObj <- SubsetData(object = sObj, subset.name = "var.ratio.pca", accept.low = 0.5)
```
## Discarded cells
```{r}
## Discarded cells
sObj.discard <- SubsetData(object = sObj.all.save,
subset.name = "var.ratio.pca",
accept.high = 0.5)
## median gene count
median(x = [email protected][, "nGene"])
## median gene count of discarded cells. Discarded cells have lower gene count
median(x = [email protected][, "nGene"])
## discarded cells
discarded.cells <- colnames(sObj.discard@data)
## expression of LYZ in discarded cells
VlnPlot(object = sObj.discard, features.plot = "LYZ", group.by = "library")
```
# Alignment of CCA subspaces
```{r}
## Now we align the CCA subspaces
sObj <- AlignSubspace(object = sObj,
reduction.type = "cca",
grouping.var = "library",
dims.align = 1:dims.include,
verbose = FALSE)
## Visualize the aligned CCA
p1 <- VlnPlot(object = sObj, features.plot = "GATA4",
group.by = "library",
do.return = TRUE)
p1
```
# Clustering
```{r}
## Run clustering
sObj <- FindClusters(object = sObj,
reduction.type = "cca.aligned",
dims.use = 1:dims.include,
save.SNN = TRUE,
print.output = FALSE,
force.recalc = TRUE)
```
# Visualizing cluster using tSNE
```{r}
## Run tSNE
sObj <- RunTSNE(object = sObj,
reduction.use = "cca.aligned",
dim.embed = 3,
dims.use = 1:dims.include,
do.fast = TRUE)
TSNEPlot(object = sObj,
do.return = TRUE,
pt.size = 0.5,
do.label = TRUE)
```
# Save Seurat object to file
```{r}
## Save object
save(sObj, file=file.path(robj.dir,"sObj.RData"))
```
## 3D tSNE plot
```{r}
## extract three dimensions of tSNE
tsne.dims <- FetchData(sObj, vars.all = c("tSNE_1","tSNE_2","tSNE_3"))
[email protected]$tSNE_1 <- tsne.dims$tSNE_1
[email protected]$tSNE_2 <- tsne.dims$tSNE_2
[email protected]$tSNE_3 <- tsne.dims$tSNE_3
## coloring tSNE using nUMI, nGene, percent.mito etc
ggplot([email protected], aes(x=tSNE_1, y=tSNE_2, colour=nUMI))+geom_point()
ggplot([email protected], aes(x=tSNE_1, y=tSNE_2, colour=nGene))+geom_point()
ggplot([email protected], aes(x=tSNE_1, y=tSNE_2, colour=percent.mito))+geom_point()
ggplot([email protected], aes(x=tSNE_1, y=tSNE_2, colour=orig.ident))+geom_point()
ggplot([email protected], aes(x=tSNE_1, y=tSNE_2, colour=library))+geom_point()
## 3D tSNE plot
Sys.setenv("plotly_username"="singleCell")
Sys.setenv("plotly_api_key"="20VlK6d6Iik9srBaUw3a")
## 3D tSNE plot
p <- plot_ly([email protected],
x = ~tSNE_1,
y = ~tSNE_2,
z = ~tSNE_3,
color = ~res.0.8,
mode="markers",
hoverinfo="text") %>% add_markers()
```
## Expression of gene of interest
```{r}
gene.of.interest <- c("SST","INS")
FeaturePlot(object = sObj,
features.plot = gene.of.interest,
cols.use = c("red", "grey"),
reduction.use = "tsne")
```
## Subsetting tSNE plot.
```{r}
sObj.sub <- SubsetData(sObj, ident.use = c(4,6))
TSNEPlot(sObj.sub)
```
# Characterizing clusters
Clusters can be characterized to specific cell type by looking into the expression of cell surface markers. [This link](https://www.bdbiosciences.com/documents/cd_marker_handbook.pdf) provides list of CD markers and thier expression based on cell types. For automatic cell type characterization check https://sct.lifegen.com/.
```{r}
#FeaturePlot(sObj, features.plot = "CD8", cols.use = c("grey","red"))
```
# DE analysis
Differential expression analysis can be done between two goups of clusters. The function `FindAllMarkers` calculates markers for each identity class (Cluster). It compares a cluster with all other cells.
## Parameter (directly from Seurat's R documentation)
- `min.pct` : only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. Meant to speed up the function by not testing genes that are very infrequently expressed. Default is 0.1
- `test.use` : Denotes which test to use.
- `logfc.threshold` : Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. Default is 0.25 Increasing logfc.threshold speeds up the function, but can miss weaker signals.
```{r message=FALSE, warning=FALSE}
all.markers <- FindAllMarkers(sObj,
logfc.threshold = 0.50,
test.use = "negbinom")
head(all.markers, 4)
```
Sometimes we want compare only two cluster of interest. In that case we we the function `FindMarkers`. If we want to do it for every cluster and save the markers into file we need to use `as.numeric(levels(sObj@ident))` in loop.
```{r message=FALSE, warning=FALSE}
cluster.6.vs.4.markers <- FindMarkers(object = sObj,
ident.1 = 6,
ident.2 = 4,
logfc.threshold = 0.25,
test.use = "negbinom")
## visulizing markers
ggplot(data=cluster.6.vs.4.markers,
aes(x=avg_logFC,
y=-log10(p_val_adj),
colour=as.factor(cluster.6.vs.4.markers$p_val_adj <= 0.05)))+
geom_point(size=1) +
xlab("avg_logFC")+
ylab("-log10(adjusted pvalue)")+
theme_gray()+
guides(colour=FALSE)
```
Visualizing top 20 markers based on `avg_logFC`.
```{r}
neg <- cluster.6.vs.4.markers[order(cluster.6.vs.4.markers$avg_logFC),]
pos <- cluster.6.vs.4.markers[order(cluster.6.vs.4.markers$avg_logFC, decreasing=TRUE),]
## merge neg top10.neg and top10.pos
top20 <- rbind(
head(pos,10),
head(neg,10)
)
## user cluster 6 and 4 cells
cells.use <- row.names([email protected][[email protected]$res.0.8 == 6 | [email protected]$res.0.8 == 4, ])
DoHeatmap(object = sObj,
genes.use = row.names(top20),
cells.use = cells.use,
slim.col.label = TRUE, remove.key = TRUE)
```
# Pathway analysis
Here we user Reactome for pathway analysis. Significant cluater markers (p_val_adj <= 0.05) and their fold change can be fed into Reactome to find enriched pathways.
```{r}
#extract gene and log_FC
sig.markers <- cluster.6.vs.4.markers[cluster.6.vs.4.markers$p_val_adj<=0.05,]
sig.markers$avg_FC <- 2^sig.markers$avg_logFC
sig.markers$genes <- rownames(sig.markers)
sig.markers <- sig.markers[,c("genes","avg_logFC")]
out.file <- "sig.markers.csv"
write.table(sig.markers,file=file.path(output.dir,out.file),
sep="\t",
row.names = F,
col.names = F,
quote=F)
```
```{r}
library(jsonlite)
## Reactome url
url <- "https://reactome.org/AnalysisService/identifiers/projection/?pageSize=1&page=1"
## POST to reactome
cmd <- paste("curl -H \"Content-Type: text/plain\" --data-binary @",file.path(output.dir,out.file)," -X POST --url ",url,sep="")
a <- system(cmd,
show.output.on.console = T,
intern = T)
## convert json to list
pthw.lst <- fromJSON(a,flatten=T)
pthw <- pthw.lst$pathways
## Informative columns
cols <- c("stId",
"name",
"species.name",
"entities.total",
"entities.found",
"entities.ratio",
"entities.pValue",
"entities.fdr",
"reactions.total",
"reactions.found",
"reactions.ratio")
pthw.subset <- pthw[,cols]
head(pthw.subset, 4)
cat("LINK: ",
paste("https://reactome.org/PathwayBrowser/#/DTAB=AN&ANALYSIS=", pthw.lst$summary$token,sep=""),
"\n")
```
# Pseudotime
```{r}
library("slingshot")
## get tSNE embedings
cca.data <- FetchData(sObj, vars.all = c("tSNE_1","tSNE_2"))
## get cluster ids
cnames <- colnames([email protected])
clus <- [email protected][,grep("res",cnames)]
## run slingshot
sce <- slingshot(cca.data,clusterLabels=clus)
## line
lin1 <- getLineages(cca.data, clus, start.clus= '0', end.clus = max(clus))
plot(cca.data, col = factor(clus), pch=16, asp = 1)
lines(lin1, lwd = 3, show.constraints = TRUE)
```
## SessionInfo
```{r}
sessionInfo()
```