Skip to content

Tutorial Augment Annotation

Andre Kahles edited this page Jul 15, 2015 · 2 revisions

After successfully getting started on downloading and inspecting the SplAdder example data, you can now complete phase 1 and augment the annotation file. There are many ways to do this. I recommend to start writing a small shell script that will hold all your commands and that we alter and call over and over again, without retyping many of the commands. However, the examples should also work with retyping.

As all the examples shipped with SplAdder will reside in the examples directory, we create a new directory that we can use to play around. To do that git to the SplAdder base directory and create a new directory tutorial:

$>mkdir -p tutorial

This will be the output directory we will use from now on.

To run phase 1 with default parameters, SplAdder only needs three things:

  • an annotation file - parameter -a
  • one or multiple alignment files - parameter -b
  • an output directory - parameter -o

All of these ingredients come with the SplAdder example data or we have already generated them ourselves. So the command for our first SplAdder run would be:

$> python spladder.py -a examples/TAIR10_GFF3_genes.tiny.gff \
                      -b examples/NMD_WT1.tiny.bam,examples/NMD_WT2.tiny.bam,examples/NMD_DBL1.tiny.bam,examples/NMD_DBL2.tiny.bam \
                      -o tutorial \
                      -T n

That could all be written into one line but we have broken it up here for improved readability. Note, that we took annotation and alignment files from the provided example data. Further, when using multiple alignment files, as in our example, we need to separate them with commas without spaces in between them. Finally, we use the option -T no here switch off the default behavior of automatically performing phase 2 (extraction of alternative splicing events). We do this, so we can inspect the result directory step by step.

After running the above command, we can inspect our output directory:

$>ls tutorial
spladder

The directory tutorial now contains a sub-directory spladder. This directory contains internal data structures that represent the augmented splicing graph for each sample.

$>cd tutorial/spladder
$>ls -1
genes_graph_conf3.NMD_DBL1.tiny.pickle
genes_graph_conf3.NMD_DBL2.tiny.pickle
genes_graph_conf3.NMD_WT1.tiny.pickle
genes_graph_conf3.NMD_WT2.tiny.pickle
genes_graph_conf3.merge_graphs.count.pickle
genes_graph_conf3.merge_graphs.pickle

SplAdder stores most of its internal files as pickle files. If you are familiar with python pickles, you can load these files into an interactive python session to explore them. However, this is not necessary for us to proceed and it is ok if these files stay under the hood. We mention them here for completes, so you better understand how SplAdder is built. The file ending in *.count.pickle is actually a file in HDF5 format and has the ending more for historic reasons. Now let us have a quick look at the generated files. The four files that contain our sample names are the splicing graphs that have been built by augmenting the given annotation file with the RNA-Seq data from the respective alignment file. In its default mode, SplAdder will integrate these single graphs into a common splicing graph genes_graph_conf3.merge_graphs.pickle that represents all edges found in the individual samples. Finally genes_graph_conf3.merge_graphs.count.pickle contains quantification information for each of the nodes and edges of the graph. Note that the count file stores counts for a segment graph rather then the splicing graph. The segment graph is generated from the splicing graph by splitting up all nodes into non-overlapping segment nodes. This is mostly done for efficient counting and should not concern us any further here.

SplAdder has several possibilities how to integrate input files. However, this would lead to far for this introductory tutorial and we refer to a later tutorial for working with multiple samples.

One last note on the output files. As you can see, each filename has the confidence level in it. Thus, you can use the same output directory for multiple different confidence levels. Let's just try that:

### called from the SplAdder base directory
$> python spladder.py -a examples/TAIR10_GFF3_genes.tiny.gff \
                      -b examples/NMD_WT1.tiny.bam,examples/NMD_WT2.tiny.bam,examples/NMD_DBL1.tiny.bam,examples/NMD_DBL2.tiny.bam \
                      -o tutorial \
                      -T no \
                      -c 2

We have just added the confidence level parameter to our call. After running it, we get in our result directory:

$> ls -1 tutorial/spladder/
genes_graph_conf2.NMD_DBL1.tiny.pickle
genes_graph_conf2.NMD_DBL2.tiny.pickle
genes_graph_conf2.NMD_WT1.tiny.pickle
genes_graph_conf2.NMD_WT2.tiny.pickle
genes_graph_conf2.merge_graphs.count.pickle
genes_graph_conf2.merge_graphs.pickle
genes_graph_conf3.NMD_DBL1.tiny.pickle
genes_graph_conf3.NMD_DBL2.tiny.pickle
genes_graph_conf3.NMD_WT1.tiny.pickle
genes_graph_conf3.NMD_WT2.tiny.pickle
genes_graph_conf3.merge_graphs.count.pickle
genes_graph_conf3.merge_graphs.pickle

As you can see, the files for our lower confidence level have just been added to the result directory without changing any of the previous files.

Now we are ready to detect alternative events from the splicing graphs that we just generated.

Home > Tutorial

  • [1a: Getting Started] (Tutorial-Getting-Started)
  • [1b: Augment Annotation] (Tutorial-Augment-Annotation)
  • [1c: Detect Events] (Tutorial-Event-Detection)
Clone this wiki locally