-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathKallisto_DESeq2.Rmd
474 lines (320 loc) · 18.2 KB
/
Kallisto_DESeq2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
---
title: "Differential expresion analysis. Kallisto and DESeq2 pipeline"
author: "Álvaro Ruiz Tabas"
date: '2022-07-24'
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Differential expresion analysis. Kallisto and DESeq2 pipeline
(here we have the same that is written in README file)
This script is a general version with the commands lines to import data generated with kallisto and do the differential expression analysis with DESeq2.
IT'S CRUCIAL TO KNOW THAT IN THIS PIPELINE WE ARE NOT CONSIDERING ISOFORMS. WE ARE CONSIDERING EACH DETECTED TRANSCRIPT AS AN INDEPENDENT GENE. In future I want to add how manage the isoforms but here we don't consider it son take this analysis as a first approximation to your data.
You have to change names or add things to fits the script to your data. This is only the backbone of the script.
I highly recommend to keep on hand these articles about DESeq2. They are really helpfull to resolve any doubt:
https://rstudio-pubs-static.s3.amazonaws.com/329027_593046fb6d7a427da6b2c538caf601e1.html#example-1-two-group-comparison
http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
Any question don't doubt and ask.
## 1. Import data, visualization and DESeq2
### 1.1. Packages requieres
```{r Package instalation}
# eval = FALSE para que al knitear no corran estos bloque de instalación de paquetes, poner como TRUE ci es necesaria la instalación.
install.packages(c("BiocManager", "dplyr", "gplots", "ggplot2","ggrepel", "biomaRt", "pheatmap", "reshape2", "RColorBrewer") )
```
```{r BiocManager packages installation}
BiocManager::install(c("limma", "DESeq2", "AnnotationDbi", "org.Mm.eg.db", "ReportingTools", "GO.db", "GOstats", "pathview", "gage", "gageData", "select", "clusterProfiler"))
```
### 1.2. Import data
We use tximport package to generate the dataframe (counts_data) with our abundance.h5 files. We also need a file with all the transcript names. This file is call Nombre_genes.txt and it can be generated copying the first column from any abundance.tsv file (without header).
We also need a CSV file with the name of the samples and its conditions.
```{r Import counts from abundance.h5}
library(DESeq2) #paquete que proporciona m?todos para analizar la expresión diferencial
library(dplyr) # paquete que contiene funciones dise?adas para la manipulaci?n de marcos de datos
library(tximport) #cuantificaciOn de la transcripci?n se ha hecho con Salmon, Kallisto, etc
library(readxl) #leer archivos con extensi?n excel
library(readODS) #leer archivos con extensi?n open document
library(rhdf5) #paquete que permite el intercambio de conjuntos de datos y/o complejos entre R y otro software
library(ggplot2)
library(pheatmap)
library(reshape2)
library(RColorBrewer)
#setwd("Ruta/al/directorio/de/trabajo")
setwd("C:path/to/your/working/directory")
dir<-getwd()
Nombre_genes <- read.table("Nombre_genes.txt", head =FALSE)
# Import Nombre_genes.txt with all the transcript names
Nombre_genes <- cbind(Nombre_genes, Nombre_genes[,1])
# Duplicate the columnas (tximport needs it)
colnames(Nombre_genes) <- c("geneID", "transcriptsID")
head(Nombre_genes)
# Change columns names from Nombre_genes
write.table(Nombre_genes, file="Nombre_Gen_Transcrito.txt", sep="\t", quote = FALSE, row.names = FALSE)
# Export the file
# We have to create a file called Design.txt with the names of the folders were our abundance.h5 files are. I recommend to have 1 folder for replica and call it describing his experimental group. Design.txt must have "Sample" as column name
samples <- read.table("Design.txt", head = TRUE)
# Import Design info
archivos <- file.path(dir, samples$Sample, "abundance.h5")
archivos
# Import all abundance.h5 files according with teh Design.txt folders names were the files are.
names(archivos) <- c("Name_1", "Name_2","...")
archivos
# Asign name to all abundance.h5 files. Highly recommend to use replica number and experimental group. NAMES MUST BE IN THE SAME ORDER THAT YOU HAVE IMPORT THE FILES WHICH CORRESPOND TO THE Design.txt FILE ORDER.
tx2gene <- read.table("Nombre_Gen_Transcrito.txt", header = TRUE)
head(tx2gene)
# Import duplicate names of transcripts
txi <- tximport(archivos, type = "kallisto", tx2gene = tx2gene, countsFromAbundance = "no", txOut = FALSE)
# Create dataframe with counts from the diferents samples
counts_data <- txi[["counts"]]
counts_data <- as.data.frame(counts_data)
# txi has lots of data, we only want counts data. We save it on counts_data variable
head(counts_data)
# Visualizate it
# counts_data <- write.table(counts_data) # <- Se crushea, son muchos datos
```
```{r Import sample imformation}
colData <- read.csv('sample_info.csv')
# Import file with samples and its conditions. KEEP THE SAMPLES ORDER
```
Check if Coldata rows and names from counts_data ante in the same order
```{r Comprobación de "sincronización" entre archivos}
all(colnames(counts_data) %in% rownames(colData))
# Check if both have same names
all(colnames(counts_data) == rownames(colData))
# Check if they are in the same order
```
### 1.3. Counts visualization
```{r Number of counts per sample}
summary(counts_data)
# A summary about the counts of the samples
par(mar=c(8,4,4,1))
barplot(colSums(counts_data)/1e6, names.arg = c("Sample_1", "Sample_2", "..."), xlab = "Samples", ylab = "Millions of counts", main = "Libraries size", col = brewer.pal(4, name = "Accent"))
# Bar plot representation of the counts from samples
```
```{r Logaritmic representation of librery from each cample}
logcountsdata <- log2(1+counts_data)
logcountsdata<- as.data.frame(logcountsdata)
# Dataframe with logaritmic transformation of counts number
hist(logcountsdata$Sample_1, br=20, xlab = "Log (counts)", ylab = "Frecuency", main = "Sample 1 library composition", col = "bisque4", ylim = c(0,80000))
# Repeit this with all the samples you want to see its composition.
```
```{r Scatter plot. Similarit between samples}
plot(logcountsdata$Sample_2, logcountsdata$Sample_1, xlab = "Sample 2 composition", ylab = "Sample 2 composition", main = "Sample 1 vs Sample 2", col = "darkslateblue")
# A 45º angle is obteined if the two samples are just the same. You can do this between replicas or samples from differents groups.
```
### 1.4. DESeq2
```{r Creating DESeq2}
dds <- DESeqDataSetFromMatrix(countData = round (counts_data),
colData = colData, design = ~ Condition1 + Condition2 + Condition1:Condition2)
# Condition1 could be genotype, condition2 could be treatment or whatever depending on you data
dds$Genotipo <- relevel(dds$Condition1, ref = "control_condition1")
dds$Genotipo <- relevel(dds$Condition2, ref = "control_condition2")
# Establish our control for each condition
```
```{r DESeq2 ejecutation}
dds <- DESeq(dds)
# Ejecution
nrow(dds)
# See number of rows. It must be the same than counts_data
```
```{r Filtering DESeq2 results}
dds <- dds[rowSums(counts(dds)) >= 10]
# Filtering to remove low counts rows
nrow(dds)
# See how much rows pass the filter
```
```{r Normalize and export data}
norm_Counts <- counts(dds, normalized = TRUE)
norm_Counts <- as.data.frame(norm_Counts)
nrow(dds)
write.table(norm_Counts, file="Normalized_Counts", sep="\t", quote = FALSE, row.names = TRUE)
```
## 2. DESeq2 result representation
### 2.1. Results explorations
```{r DESeq2 comparations}
resultsNames(dds)
```
```{r Extracting DESq2 results}
# This is a crucial part of the pipeline. Depending on what comparatives do you want to analyze, you have to extract ones or another data. For 2 conditions, lets say Genotype (WT and TG) and treatment (control and treated), the possible comparatives are the next. The things you have to write on contrast or list are the ones that you see running the previous command. I highly recommend to check the article I put in the beggining if there are any doubts (https://rstudio-pubs-static.s3.amazonaws.com/329027_593046fb6d7a427da6b2c538caf601e1.html#example-1-two-group-comparison)
#ANALISIS WT-control vs WT-treated. Effect of the treatment
res <- results(dds, contrast = c("Treatment", "treated", "control"))
#ANALISIS_WT-control vs TG-control. Effect of the genotype
res <- results(dds, contrast = c("Genotype", "TG", "WT"))
# ANALISIS WT-treated vs TG-treated. Differences between WT and TG trated
res <- results(dds, list( c( "Genotype_TG_vs_WT","GenotypoTG.Treatmenttreated")))
# ANALISIS TG-control vs TG-NR. Effect of the treatment on the TG
res <- results(dds, list( c("Treatment_treated_vs_control","GenotypeTG.Treatmenttreated")))
# ANALISIS INTERACTION. Is the effect of the treatment different between genotypes?
res <- results(dds, name = "GenotypeTG.Treatmenttreated")
res
summary(res)
# Visualize results
```
```{r Ordering and exporting results}
res_Ordered <- res[order(res$padj),] # Ordering depending on p adjust value
res_Ordered_DF <- as.data.frame(res_Ordered)
res_Ordered_DF <- na.omit(res_Ordered_DF)
res_Ordered_DF$GENES <- ifelse(res_Ordered_DF$padj <= 0.05, "SIGNIFICANT", "NO SIGNIFICANT")
# Considering significant those with p value adjust less than 0.05
write.table(res_Ordered, file="res_Ordered", sep="\t", quote = FALSE, row.names = TRUE)
head(res_Ordered_DF)
```
```{r Using BiomaRt to add IDs from differents databases}
library(org.Mm.eg.db) # Mus musculus
library(org.Hs.eg.db) # Homo sapiens
library(biomaRt)
library(tidyverse)
res_Ordered_DF$ensembl_transcript_id = gsub("\\..*","", row.names(res_Ordered_DF))
res_Ordered_DF$ensembl_transcript_id_version <- row.names(res_Ordered_DF)
ENSEMBL_IDs <- as.data.frame(res_Ordered_DF$ensembl_transcript_id)
colnames(ENSEMBL_IDs) <- "ensembl_transcript_id"
# input transcript IDs
listEnsembl()
ENSEMBL <- useEnsembl(biomart = "genes") #, mirror = "uswest") If it gives error.
datasets <- listDatasets(ENSEMBL)
# Available data bases
ENSEMBL_CONECTION <- useEnsembl("ensembl", dataset = 'mmusculus_gene_ensembl')#, mirror = 'uswest') if it gives error #Change if you use other organism
attr <- listAttributes(ENSEMBL_CONECTION)
filters <- listFilters(ENSEMBL_CONECTION)
# Filters and attributes available to add differents IDs. Here I add the ensembl and the external name of the genes
ensembl2gene <- getBM(attributes = c("ensembl_transcript_id",
"external_gene_name", "ensembl_gene_id"), # Data to get
filters = "ensembl_transcript_id", # Data type you input
values = ENSEMBL_IDs$ensembl_transcript_id, # Input data
mart = ENSEMBL_CONECTION)
idmap <- merge (x = ENSEMBL_IDs, y = ensembl2gene,
by="ensembl_transcript_id", sort = FALSE) #, all.x=TRUE)
# Merging data to remove genes without some names
res_Ordered_DF <- merge (x = res_Ordered_DF, y = idmap,
by="ensembl_transcript_id", sort = FALSE)
# Merging with results
row.names(res_Ordered_DF) <- res_Ordered_DF$ensembl_transcript_id_version
# Changing row names
write.table(res_Ordered_DF, file = "RESULTADOS_WT_NR_vs_TG_NR.txt", sep = "\t", quote = FALSE)
```
### 2.2. MA plot
```{r MA plot}
DESeq2::plotMA(res, ylim=c(-20,20), xlab = "Counts normaliced mean", ylab = "Log2 Fold Change", main = "MA plot", colSig = "blue2", colNonSig = "darkgray")
```
### 2.3. MA plot (other way)
```{r Other way MA plot}
myColors <- c("darkgray", "blue2")
names(myColors) <- levels(res_Ordered_DF$GENES)
colScale <- scale_colour_manual(name = "GENES",values = myColors)
# Establish colours
MA_plot <- ggplot(res_Ordered_DF, aes(x = log10(baseMean), y = log2FoldChange, color = GENES)) + geom_point() + theme(legend.position = "bottom")
MA_plot <- MA_plot + colScale + labs(title="MA plot") + theme (plot.title = element_text(hjust = 0.5))
MA_plot
```
### 2.4. Vulcano plot
```{r Volcano plot easy}
Vulcano <- ggplot(res_Ordered_DF, main = "Vulcano plot", aes(x = log2FoldChange, y = -log10(padj) , color = GENES)) + geom_point() + theme(legend.position = "bottom")
Vulcano <- Vulcano + colScale + labs(title= "Vulcano plot") + theme (plot.title = element_text(hjust = 0.5))
Vulcano
```
### 2.5. PCA plot and similarity heatmap
```{r PCA plot}
vsd <- vst (dds, blind = FALSE)
plot_PCA <- plotPCA(vsd, intgroup = c("Genotype","Treatment"), returnData = TRUE)
myColors <- c("green", "blueviolet", "red", "black")
names(myColors) <- levels(plot_PCA$group)
colScale <- scale_colour_manual(name = "Grupos",values = myColors)
percentVar <- round(100 * attr(plot_PCA, "percentVar"))
plot_PCA <- ggplot(plot_PCA , aes( x = PC1, y = PC2, color = group)) + geom_point() + xlab(paste0("PC1: ", percentVar[1], "% variance")) + ylab(paste0("PC2: ", percentVar[2], "% variance"))+ labs(title="PCA plot") + theme (plot.title = element_text(hjust = 0.5), legend.position = "bottom") + colScale
plot_PCA
```
```{r Similarity heatmap}
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
colors <- colorRampPalette( rev(brewer.pal(9, "Greens")) )(255)
par(mar=c(0,0,0,4))
pheatmap(sampleDistMatrix,
clustering_distance_rows= sampleDists,
clustering_distance_cols= sampleDists,
col=colors, angle_col = 90, cluster_rows = FALSE, cluster_cols = FALSE, labels_col =
c("Sample1", "Sample2","..."),labels_row = c("Sample1", "Sample2","..."), border_color = NA)
# Heatmap without clustering
pheatmap(sampleDistMatrix,
clustering_distance_rows= sampleDists,
clustering_distance_cols= sampleDists,
col=colors, angle_col = 90, labels_row = c("Sample1", "Sample2","..."), labels_col =
c("Sample1", "Sample2","..."), border_color = NA)
# Heatmap clustering
```
### 2.6. Heatmap
```{r pheatmap}
Sig_genes <- subset(res_Ordered_DF, padj <= 0.05)
# Keep only significant genes
All_sig <- merge(norm_Counts, Sig_genes, by = 0)
# Merge with annotated dataframe
Sig_counts <- All_sig[,2:9]
# Keep only counts
row.names(Sig_counts) <- All_sig$Row.names
# Change row names
pheatmap(log2(Sig_counts + 1), scale = 'row', show_rownames = FALSE, treeheight_row = FALSE, treeheight_col = FALSE, main = "Heat map", labels_col = c("Sample1", "Sample2", "..."), angle_col = 45, cluster_cols = F, border_color = NA)
```
## 3. Funcional enrichment
### 3.1. Exploring main genes
```{r Biggest fold change gene}
library("ggbeeswarm")
topGene = row.names(Sig_genes)[1]
# Keep only gene from first position
Plot_counts <- plotCounts(dds, gene=topGene, intgroup = c("Genotype","Treatment"), returnData = TRUE)
Plot_counts <- ggplot(Plot_counts[1:8,], aes(x = Genotype:Treatment , y = count)) +
scale_y_log10() + geom_beeswarm(cex = 5)
Plot_counts <- Plot_counts + labs(title="Differential exresion of ...", x = "Groups", y = "Counts") + theme (plot.title = element_text(hjust = 0.5))
Plot_counts
```
```{r Top 10 genes heatmap}
Top_10 <- res_Ordered_DF [1:10,]
Top_10 <- merge(Top_10, norm_Counts, by = 0)
row.names(Top_10) <- Top_10$SYMBOL
keep <- c ("Sample1", "Sample2", "...")
Top_10 <- Top_10[,names(Top_10) %in% keep]
pheatmap(log2(Top_10 +1), treeheight_row = FALSE, treeheight_col = FALSE, main = "Top 10 genes heatmap", labels_col = c("Sample1", "Sample2", "..."), angle_col = 45)
```
```{r Top genes (other representation)}
Top_10_m <- melt(as.matrix(Top_10))
# Reordering info
names(Top_10_m) <- c("genes", "Sample", "Exp")
# Rename columns
Top_10_m$Tratamiento <- ifelse(grepl("control", Top_10_m$Sample), "control", "treated")
myColors2 <- c("darkorange", "blueviolet")
names(myColors2) <- levels(Top_10_m$Genotype)
colScale2 <- scale_colour_manual(name = "Treatment",values = myColors2)
TOP_GENES <- ggplot(Top_10_m , aes( x = Tratamiento, y = log2(Exp +1), color = Tratamiento)) + geom_point() + facet_grid(~ genes) + labs(title="Differential expresion of ...", x = "", y = "Log2 counts") + theme (plot.title = element_text(hjust = 0.5), legend.position = "bottom", axis.text.x =element_blank()) + colScale2
TOP_GENES
```
### 3.2. Gene Ontology genes lists
```{r Genes lists}
library(GO.db)
library(GOstats)
sig_lfc <- 0.5
# Establish significative fold change
select_Genes_Up <- unique(All_sig[All_sig$log2FoldChange > sig_lfc, 'ensembl_gene_id'])
select_Genes_Down <- unique(All_sig[All_sig$log2FoldChange < (-sig_lfc), 'ensembl_gene_id'])
# Extrat only genes that overcome the fold change
DE_Genes <- unique(All_sig$ensembl_gene_id) # Keep all diferenctialy expressed genes
cutOff <- 0.01
# El conjunto total de genes
DE_GENES <- as.data.frame(DE_Genes)
write.table(DE_GENES, "DE_GENES.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
GENES_UP <- as.data.frame(select_Genes_Up)
GENES_DOWN <- as.data.frame(select_Genes_Down)
write.table(GENES_UP, "GENES_UP.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(GENES_DOWN, "GENES_DOWN.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
# Saving a txt file with the genes up, down and all that are differential expresed
Nombre_genes$ensembl_transcript_id = gsub("\\..*","", Nombre_genes$geneID)
Nombre_genes <- getBM(attributes = c("ensembl_transcript_id",
"ensembl_gene_id"), # Datos a obtener
filters = "ensembl_transcript_id", # Tipo de datos que aportas
values = Nombre_genes$ensembl_transcript_id, # Input de datos
mart = ENSEMBL_CONECTION)
# Annoting all the genes detected
universal_Genes <- unique(Nombre_genes$ensembl_gene_id)
Background_Genes <- as.data.frame(universal_Genes)
write.table(Background_Genes, "Background.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
# Saving a list with all the detected genes (background)
```
These list of genes can be used in online programs like ShinyGO or panther to do the enrichment analysis. You can also annotate these genes (taht are in ensembl ID) without using R in Biomart website and read about them on uniprot.
There are also a program called IDEP that automatically do all the diferential expresion analysis