-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexercise-sheet-10.Rmd
181 lines (110 loc) · 6.65 KB
/
exercise-sheet-10.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
title: "Exercise sheet 10: RNA-RNA interaction - Seeds"
---
---------------------------------
The following assignments should be solved using the tool `IntaRNA`.
As some advanced `IntaRNA` options are required, you will need to install the tool locally.
There are several ways of installing `IntaRNA` locally (conda, docker, binaries, etc...).
Please follow the instructions suitable for your operating system on [GitHub](https://github.com/BackofenLab/IntaRNA/#installation).
Additionally, you will need to download the <a href="https://raw.githubusercontent.com/BackofenLab/V_RNA-pages/main/assets/misc/query.fa" download="query.fa">query.fa</a> and <a href="https://raw.githubusercontent.com/BackofenLab/V_RNA-pages/main/assets/misc/target.fa" download="target.fa">target.fa</a> files.
These files contain the sequences used for RNA-RNA interaction prediction.
---------------------------------
# Exercise 1 - Hybridization versus Accessibility
<!--- --------------------------------- -->
::: {.question data-latex=""}
Use `IntaRNA` to predict an interaction between the query and the target sequences given within the `FASTA` file.
1. Use hybridization energy only.
2. Use the accessibility-based model.
<!--- --------------------------------- -->
i. What differences do you observe when using both models?
ii. Compare the interaction energies for your results. What do you observe?
:::
#### {.tabset}
##### Hide
##### Hint 1 - Parameters
::: {.answer data-latex=""}
To see the available input parameters use the `--help` (`-h`) or `--fullhelp` options.
:::
##### Hint 2 - Function calls
::: {.answer data-latex=""}
Hybridization:
``IntaRNA -q query.fa -t target.fa --acc=N``
Accessibility:
``IntaRNA -q query.fa -t target.fa``
:::
##### Solution
::: {.answer data-latex=""}
i. When only considering hybridization minimization, the resulting RNA-RNA interaction (RRI) prediction consists of several intermolecular stackings and the interaction region is very long.
On the other hand, if we additionally consider the accessibility of the interaction region, the resulting prediction consists of only one (or a few) stable intermolecular base pair stackings.
ii. When comparing the interaction energies, the energy for the hybridization-only mode is much lower than the accessibility-based one.
The longer the resulting interaction, the lower the energy of $E_{hybrid}$ and thus the more stable the model.
The hybridization approach ignores the potentially necessary dissolving of intramolecular base pairs.
This dissolving of intramolecular base pairs is considered in the accessibility-based approach by adding ED (positive) values.
If we would take the long interaction result of the hybridization approach and evaluate it using the accessibility-based approach, we would end up with the same $E_{hybrid}$.
But the ED terms would be very high, thus leading to an overall higher energy $E$ compared to the result from the accessibility based approach.
The accessibility-based approach looks for the optimal trade-off between $E_{hybrid}$ and ED values, leading to an overall better (and biologically more realistic) result.
In other words, the energy $E$ would be overall higher, but the resulting interaction would be more reasonable as it takes into account intramolecular base pairing.
:::
#### {-}
<!--- --------------------------------- -->
# Exercise 2 - Seed constraints
<!--- --------------------------------- -->
::: {.question data-latex=""}
We want to investigate the effect of using seed constraints in RNA-RNA interaction prediction.
Change the output of `IntaRNA` to `CSV` format and increase the number of sub-optimal predictions that should be returned.
Additionally, make sure that you use the exact prediction mode.
Using the two sequences from before, try out the following seed constraints:
1. No seed constraints
2. Different seed lengths between 2-10 base pairs
<!--- --------------------------------- -->
i. Compare the predicted interactions to each-other
ii. Compare the runtime of the different settings
iii. Compare the number of suboptimal interactions produced by the different settings
:::
#### {.tabset}
##### Hide
##### Hint 1 - Parameters
::: {.answer data-latex=""}
To see the available input parameters use the `--help` (`-h`) or `--fullhelp` options.
:::
##### Hint 2 - Function calls
::: {.answer data-latex=""}
Testing several seed lengths:
```
IntaRNA -q query.fa -t target.fa --mode M --noSeed
IntaRNA -q query.fa -t target.fa --mode M --seedBP 2
IntaRNA -q query.fa -t target.fa --mode M --seedBP 4
IntaRNA -q query.fa -t target.fa --mode M --seedBP 7
IntaRNA -q query.fa -t target.fa --mode M --seedBP 9
```
Counting the resulting sub-optimal sequences:
```
IntaRNA -q query.fa -t target.fa --mode M --noSeed --outMode C -n 1000 | tail +2 | wc -l
IntaRNA -q query.fa -t target.fa --mode M --seedBP 2 --outMode C -n 1000 | tail +2 | wc -l
IntaRNA -q query.fa -t target.fa --mode M --seedBP 4 --outMode C -n 1000 | tail +2 | wc -l
IntaRNA -q query.fa -t target.fa --mode M --seedBP 7 --outMode C -n 1000 | tail +2 | wc -l
IntaRNA -q query.fa -t target.fa --mode M --seedBP 9 --outMode C -n 1000 | tail +2 | wc -l
```
Check the runtime:
```
time IntaRNA -q query.fa -t target.fa --mode M --noSeed --outMode C -n 1000 | wc -l
time IntaRNA -q query.fa -t target.fa --mode M --seedBP 2 --outMode C -n 1000 | wc -l
```
You can additionally add `--outDeltaE 1` to restrict the values of sub-optimal interaction energies accepted in the output.
:::
##### Solution
::: {.answer data-latex=""}
Observation 1:
Counting the number of suboptimal interactions shows the concept of using a seed will reduce the runtime. `--noSeed` results in the highest amount of sub-optimal interaction preditions
Using a larger seed reduces the amount of possible interactions, thus reducing the search-space and the runtime of `IntaRNA`.
If the seed becomes too large, no feasible interactions will be found anymore with an energy value $< 0$.
Observation 2:
We have seen in the lecture that in the biological process starts by forming a stable seed interaction which is then extended into a longer more stable interaction (e.g. sRNA or miRNA interaction).
Therefore, using a seed can help to remove some of the biologically less likely sub-optimal interactions.
Observation 3:
Comparing interaction predictions resulting from using seeds of length $2$ and $4$ shows that some short interactions may not sum up to an energy value below $0$, and are therefore not predicted.
Thus we sometimes observe less sequences for very low seed lengths.
The higher the initial seed length is set, the fewer interactions are predicted as fewer interactions are possible.
:::
#### {-}
<!--- --------------------------------- -->