Authors: | Mary Piper
Meeta Mistry Data Carpentry contributors |
---|---|
Date: | Wednesday, 10 Jan 2018 |
- This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- The materials used in this lesson is adapted from work that is Copyright © Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).
See the original versions of this content and more at https://hbctraining.github.io/Intro-to-R
The common misconception is that R is a programming language but in fact it is much more than that. Think of R as an environment for statistical computing and graphics, which brings together a number of features to provide powerful functionality.
The R environment combines:
- effective handling of big data
- collection of integrated tools
- graphical facilities
- simple and effective programming language
R is a powerful, extensible environment. It has a wide range of statistics and general data analysis and visualization capabilities.
- Data handling, wrangling, and storage
- Wide array of statistical methods and graphical techniques available
- Easy to install on any platform and use (and it’s free!)
- Open source with a large and growing community of peers
RStudio is freely available open-source Integrated Development Environment (IDE). RStudio provides an environment with many features to make using R easier and is a great alternative to working on R in the terminal.
- Graphical user interface, not just a command prompt
- Great learning tool
- Free for academic use
- Platform agnostic
- Open source
Let's create a new project directory for our "Introduction to R" lesson today.
- Open RStudio
- Go to the File menu and select New Project.
- In the New Project window, choose New Directory. Then, choose Empty Project. Name your new directory Intro-to-R and then "Create the project as subdirectory of:" the Desktop (or location of your choice).
- Click on Create Project.
- After your project is completed, if the project does not automatically open in RStudio, then go to the File menu, select Open Project, and choose Intro-to-R.Rproj.
- When RStudio opens, you will see three panels in the window.
- Go to the File menu and select New File, and select R Script. The RStudio interface should now look like the screenshot below.
The RStudio interface has four main panels:
- Console: where you can type commands and see output. The console is all you would see if you ran R in the command line without RStudio.
- Script editor: where you can type out commands and save to file. You can also submit the commands to run in the console.
- Environment/History: environment shows all active objects and history keeps track of all commands run in console
- Files/Plots/Packages/Help
Before we organize our working directory, let's check to see where our current working directory is located by typing into the console:
getwd()
Your working directory should be the Intro-to-R folder constructed when you created the project. The working directory is where RStudio will automatically look for any files you bring in and where it will automatically save any files you create, unless otherwise specified.
You can visualize your working directory by selecting the Files tab from the Files/Plots/Packages/Help window.
If you wanted to choose a different directory to be your working directory, you could navigate to a different folder in the Files tab, then, click on the More dropdown menu and select Set As Working Directory.
To organize your working directory for a particular analysis, you should separate the original data (raw data) from intermediate datasets. For instance, you may want to create a data/ directory within your working directory that stores the raw data, and have a results/ directory for intermediate datasets and a figures/ directory for the plots you will generate.
Let's create these three directories within your working directory by clicking on New Folder within the Files tab.
When finished, your working directory should look like:
There are a few files that we will be working with in the next few lessons and you can access them using the links provided below. If you right click on the link, and "Save link as..". Choose ~/Desktop/Intro-to-R/data as the destination of the file. You should now see the file appear in your working directory. We will discuss these files a bit later in the lesson.
Download the normalized counts file by right clicking here.
Download metadata file using this link.
NOTE: If the files download automatically to some other location on your laptop, you can move them to the your working directory using your file explorer or finder (outside RStudio), or navigating to the files in the Files tab of the bottom right panel of RStudio
This is more of a housekeeping task. We will be writing long lines of code in our script editor and want to make sure that the lines "wrap" and you don't have to scroll back and forth to look at your long line of code.
Click on "Tools" at the top of your RStudio screen and click on "Global Options" in the pull down menu.
On the left, select "Code" and put a check against "Soft-wrap R source files." Make sure you click the "Apply" button at the bottom of the Window before saying "OK".
Now that we have our interface and directory structure set up, let's start playing with R! There are two main ways of interacting with R in RStudio: using the console or by using script editor (plain text files that contain your code).
The console window (in RStudio, the bottom left panel) is the place where R is waiting for you to tell it what to do, and where it will show the results of a command. You can type commands directly into the console, but they will be forgotten when you close the session.
Let's test it out:
3 + 5
Best practice is to enter the commands in the script editor, and save the script. You are encouraged to comment liberally to describe the commands you are running using #. This way, you have a complete record of what you did, you can easily show others how you did it and you can do it again later on if needed.
The Rstudio script editor allows you to 'send' the current line or the currently highlighted text to the R console by clicking on the 'Run' button in the upper-right hand corner of the script editor. Alternatively, you can run by simply pressing the Ctrl and Enter keys at the same time as a shortcut.
Now let's try entering commands to the script editor and using the comments character # to add descriptions and highlighting the text to run:
# Intro to R Lesson
# Feb 16th, 2016
# Interacting with R
## I am adding 3 and 5. R is fun!
3+5
You should see the command run in the console and output the result.
What happens if we do that same command without the comment symbol #? Re-run the command after removing the # sign in the front:
I am adding 3 and 5. R is fun!
3+5
Now R is trying to run that sentence as a command, and it doesn't work. We get an error in the console Error: unexpected symbol in "I am" which means that the R interpreter did not know what to do with that command.
Interpreting the command prompt can help understand when R is ready to accept commands. Below lists the different states of the command prompt and how you can exit a command:
Console is ready to accept commands: >
If R is ready to accept commands, the R console shows a > prompt.
When the console receives a command, either by directly typing into the console or running from the script editor (Ctrl-Enter), R will try to execute it.
After running, the console will show the results and come back with a new > prompt to wait for new commands.
Console is waiting for you to enter more data: +
If R is still waiting for you to enter more data because it isn't complete yet, the console will show a + prompt. It means that you haven't finished entering a complete command. Often this can be due to you having not 'closed' a parenthesis or quotation.
Escaping a command and getting a new prompt: ESC
If you're in Rstudio and you can't figure out why your command isn't running, you can click inside the console window and press ESC to escape the command and bring back a new prompt >.
Exercise
- Try highlighting only 3 + from your script editor and running it. Find a way to bring back the command prompt > in the console.
R is commonly used for handling big data, and so it only makes sense that we learn about R in the context of some kind of relevant data. We had previously downloaded two files to our working directory. Since we will be working with these files over the course of the workshop, let's take a few minutes to familiarize ourselves with the data.
In this example dataset, we have collected whole brain samples from 12 mice and want to evaluate expression differences between them. The expression data represents normalized count data obtained from RNA-sequencing of the 12 brain samples. This data is stored in a comma separated values (CSV) file as a 2-dimensional matrix, with each row corresponding to a gene and each column corresponding to a sample.
We have another file in which we identify information about the data or metadata. Our metadata is also stored in a CSV file. In this file, each row corresponds to a sample and each column contains some information about each sample.
The first column contains the row names, and note that these are identical to the column names in our expression data file above (albeit, in a slightly different order). The next few columns contain information about our samples that allow us to categorize them. For example, the second column contains genotype information for each sample. Each sample is classified in one of two categories: Wt (wild type) or KO (knockout). What types of categories do you observe in the remaining columns?
R is particularly good at handling this type of categorical data. Rather than simply storing this information as text, the data is represented in a specific data structure which allows the user to sort and manipulate the data in a quick and efficient manner.
Before we move on to more complex concepts and getting familiar with the language, we want to point out a few things about best practices when working with R which will help you stay organized in the long run:
- Code and workflow are more reproducible if we can document everything that we do. Our end goal is not just to "do stuff", but to do it in a way that anyone can easily and exactly replicate our workflow and results. All code should be written in the script editor and saved to file, rather than working in the console.
- The R console should be mainly used to inspect objects, test a function or get help.
- Use # signs to comment. Comment liberally in your R scripts. This will help future you and other collaborators know what each line of code (or code block) was meant to do. Anything to the right of a # is ignored by R. (A shortcut for this is Ctrl + Shift + C if you want to comment an entire chunk of text)
Now that we know how to talk with R via the script editor or the console, we want to use R for something more than adding numbers. To do this, we need to know more about the R syntax.
Below is an example script highlighting the many different "parts of speech" for R (syntax):
- the comments # and how they are used to document function and its content
- variables and functions
- the assignment operator <-
- the = for arguments in functions
NOTE: indentation and consistency in spacing is used to improve clarity and legibility
# Load libraries
library(Biobase)
library(limma)
library(ggplot2)
# Setup directory variables
baseDir <- getwd()
dataDir <- file.path(baseDir, "data")
metaDir <- file.path(baseDir, "meta")
resultsDir <- file.path(baseDir, "results")
# Load data
meta <- read.delim(file.path(metaDir, '2015-1018_sample_key.csv'), header=TRUE, sep="\t", row.names=1)
To do useful and interesting things in R, we need to assign values to variables using the assignment operator, <-. For example, we can use the assignment operator to assign the value of 3 to x by executing:
x <- 3
The assignment operator (<-) assigns values on the right to variables on the left. In RStudio, typing Alt + - (push Alt at the same time as the - key) will write <- in a single keystroke.
A variable is a symbolic name for (or reference to) information. Variables in computer programming are analogous to "buckets," where information can be maintained and referenced. On the outside of the bucket is a name. When referring to the bucket, we use the name of the bucket, not the data stored in the bucket.
In the example above, we created a variable or a "bucket" called x. Inside we put a value, 3.
Let's create another variable called y and give it a value of 5.
y <- 5
When assigning a value to an variable, R does not print anything to the console. You can force to print the value by using parentheses or by typing the variable name.
y
You can also view information on the variable by looking in your Environment window in the upper right-hand corner of the RStudio interface.
Now we can reference these buckets by name to perform mathematical operations on the values contained within. What do you get in the console for the following operation:
x + y
Try assigning the results of this operation to another variable called number.
number <- x + y
Exercises
- Try changing the value of the variable x to 5. What happens to number?
- Now try changing the value of variable y to contain the value 10. What do you need to do, to update the variable number?
Variables can be given almost any name, such as x, current_temperature, or subject_id. However, there are some rules / suggestions you should keep in mind:
- Make your names explicit and not too long.
- Avoid names starting with a number (2x is not valid but x2 is).
- Avoid names of fundamental functions in R (e.g., if, else, for, see this site for a complete list). In general, even if it's allowed, it's best to not use other function names (e.g., c, T, F, mean, data) as variable names. When in doubt check the help to see if the name is already in use.
- Use nouns for object names and verbs for function names.
- Keep in mind that R is case sensitive (e.g., genome_length is different from Genome_length).
- Be consistent with the styling of your code (where you put spaces, how you name variable, etc.). In R, two popular style guides are Hadley Wickham's style guide and Google's.
Variables can contain values of specific types within R. The six data types that R uses include:
- "numeric" for any numerical value
- "character" for text values, denoted by using quotes ("") around value
- "integer" for integer numbers (e.g., 2L, the L indicates to R that it's an integer)
- "logical" for TRUE and FALSE (the Boolean data type)
- "complex" to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that's all we're going to say about them
- "raw" that we won't discuss further
The table below provides examples of each of the commonly used data types:
Data Type | Examples |
---|---|
Numeric: | 1, 1.5, 20, pi |
Character: | “anytext”, “5”, “TRUE” |
Integer: | 2L, 500L, -17L |
Logical: | TRUE, FALSE, T, F |
We know that variables are like buckets, and so far we have seen that bucket filled with a single value. Even when number was created, the result of the mathematical operation was a single value. Variables can store more than just a single value, they can store a multitude of different data structures. These include, but are not limited to, vectors (c), factors (factor), matrices (matrix), data frames (data.frame) and lists (list).
A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It's basically just a collection of values, mainly either numbers,
or characters,
or logical values,
Note that all values in a vector must be of the same data type. If you try to create a vector with more than a single data type, R will try to coerce it into a single data type.
For example, if you were to try to create the following vector:
R will coerce it into:
The analogy for a vector is that your bucket now has different compartments; these compartments in a vector are called elements.
Each element contains a single value, and there is no limit to how many elements you can have. A vector is assigned to a single variable, because regardless of how many elements it contains, in the end it is still a single entity (bucket).
Let's create a vector of genome lengths and assign it to a variable called glengths.
Each element of this vector contains a single numeric value, and three values will be combined together into a vector using c() (the combine function). All of the values are put within the parentheses and separated with a comma.
glengths <- c(4.6, 3000, 50000)
glengths
Note your environment shows the "glengths" variable is numeric and tells you the "glengths" vector starts at element 1 and ends at element 3 (i.e. your vector contains 3 values).
A vector can also contain characters. Create another vector called species with three elements, where each element corresponds with the genome sizes vector (in Mb).
species <- c("ecoli", "human", "corn")
species
---
Exercise
- Create a vector of numeric and character values by combining the two vectors that we just created (glengths and species). Assign this combined vector to a new variable called combined. Hint: you will need to use the combine "c()" function to do this.
- Print the combined vector in the console, what looks different compared to the original vectors?
A factor is a special type of vector that is used to store categorical data. Each unique category is referred to as a factor level (i.e. category = level). Factors are built on top of integer vectors such that each factor level is assigned an integer value, creating value-label pairs.
Let's create a factor vector and explore a bit more. We'll start by creating a character vector describing three different levels of expression:
expression <- c("low", "high", "medium", "high", "low", "medium", "high")
Now we can convert this character vector into a factor using the factor() function:
expression <- factor(expression)
So, what exactly happened when we applied the factor() function?
The expression vector is categorical, in that all the values in the vector belong to a set of categories; in this case, the categories are low, medium, and high. By turning the expression vector into a factor, the categories are assigned integers alphabetically, with high=1, low=2, medium=3. This in effect assigns the different factor levels. You can view the newly created factor variable and the levels in the Environment window.
Exercises
Let's say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.
samplegroup <- c("CTL", "CTL", "CTL", "KO", "KO", "KO", "OE", "OE", "OE")
- Create a vector named samplegroup using the code above. This vector will contain nine elements: 3 control ("CTL") samples, 3 knock-out ("KO") samples, and 3 over-expressing ("OE") samples:
- Turn samplegroup into a factor data structure.
A data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting. A data.frame is similar to a matrix in that it's a collection of vectors of the same length and each vector represents a column. However, in a dataframe each vector can be of a different data type (e.g., characters, integers, factors).
A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.
We can create a dataframe by bringing vectors together to form the columns. We do this using the data.frame() function, and giving the function the different vectors we would like to bind together. This function will only work for vectors of the same length.
df <- data.frame(species, glengths)
Beware of data.frame()’s default behaviour which turns character vectors into factors. Print your data frame to the console:
df
Upon inspection of our dataframe, we see that although the species vector was a character vector, it automatically got converted into a factor inside the data frame (the removal of quotation marks). We will show you how to change the default behavior of a function in the next lesson. Note that you can view your data.frame object by clicking on its name in the Environment window.
A key feature of R is functions. Functions are "self contained" modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.
The general usage for a function is the name of the function followed by parentheses:
function_name(input)
The input(s) are called arguments, which can include:
- the physical object (any data structure) on which the function carries out a task
- specifications that alter the way the function operates (e.g. options)
Not all functions take arguments, for example:
getwd()
However, most functions can take several arguments. If you don't specify a required argument when calling the function, you will either receive an error or the function will fall back on using a default.
The defaults represent standard values that the author of the function specified as being "good enough in standard cases". An example would be what symbol to use in a plot. However, if you want something specific, simply change the argument yourself with a value of your choice.
We have already used a few examples of basic functions in the previous lessons i.e getwd(), c(), and factor(). These functions are available as part of R's built in capabilities, and we will explore a few more of these base functions below.
You can also get functions from external packages or libraries (which we'll talk about in a bit), or even write your own.
Let's revisit a function that we have used previously to combine data c() into vectors. The arguments it takes is a collection of numbers, characters or strings (separated by a comma). The c() function performs the task of combining the numbers or characters into a single vector. You can also use the function to add elements to an existing vector:
glengths <- c(glengths, 90) # adding at the end
glengths <- c(30, glengths) # adding at the beginning
What happens here is that we take the original vector glengths (containing three elements), and we are adding another item to either end. We can do this over and over again to build a vector or a dataset.
Since R is used for statistical computing, many of the base functions involve mathematical operations. One example would be the function sqrt(). The input/argument must be a number, and the output is the square root of that number. Let's try finding the square root of 81:
sqrt(81)
Now what would happen if we called the function (e.g. ran the function), on a vector of values instead of a single value?
sqrt(glengths)
In this case the task was performed on each individual value of the vector glengths and the respective results were displayed.
Let's try another function, this time using one that we can change some of the options (arguments that change the behavior of the function), for example round:
round(3.14159)
We can see that we get 3. That's because the default is to round to the nearest whole number. What if we want a different number of significant digits?
The best way of finding out this information is to use the ? followed by the name of the function. Doing this will open up the help manual in the bottom right panel of RStudio that will provide a description of the function, usage, arguments, details, and examples:
?round
Alternatively, if you are familiar with the function but just need to remind yourself of the names of the arguments, you can use:
args(round)
Even more useful is the example() function. This will allow you to run the examples section from the Online Help to see exactly how it works when executing the commands. Let's try that for round():
example("round")
In our example, we can change the number of digits returned by adding an argument. We can type digits=2 or however many we may want:
round(3.14159, digits=2)
NOTE: If you provide the arguments in the exact same order as they are defined (in the help manual) you don't have to name them:
round(3.14159, 2)
However, it's usually not recommended practice because it involves a lot of memorization. In addition, it makes your code difficult to read for your future self and others, especially if your code includes functions that are not commonly used. (It's however OK to not include the names of the arguments for basic functions like mean, min, etc...). Another advantage of naming arguments, is that the order doesn't matter. This is useful when a function has many arguments.
Exercise
- Another commonly used base function is mean(). Use this function to calculate an average for the glengths vector.
- Use the help manual to identify additional arguments for mean().
CRAN is a repository where the latest downloads of R (and legacy versions) are found in addition to source code for thousands of different user contributed R packages.
Packages for R can be installed from the CRAN package repository using the install.packages function. This function will download the source code from on the CRAN mirrors and install the package (and any dependencies) locally on your computer.
An example is given below for the ggplot2 package that will be required for some plots we will create later on. Run this code to install ggplot2.
install.packages('ggplot2')
Alternatively, packages can also be installed from Bioconductor, another repository of packages which provides tools for the analysis and comprehension of high-throughput genomic data. These packages includes (but is not limited to) tools for performing statistical analysis, annotation packages, and accessing public datasets.
There are many packages that are available in CRAN and Bioconductor, but there are also packages that are specific to one repository. Generally, you can find out this information with a Google search or by trial and error.
To install from Bioconductor, you will first need to install Bioconductor and all the standard packages. This only needs to be done once ever for your R installation.
If you were successful with the installation from CRAN, you do not need to run this
# DO NOT RUN THIS!
source("http://bioconductor.org/biocLite.R")
biocLite()
Once you have the standard packages installed, you can install additional packages using the biocLite.R script. If it's a new R session you will also have to source the script again. Here we show that the same package ggplot2 is available through Bioconductor:
# DO NOT RUN THIS!
biocLite('ggplot2')
---
- This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- The materials used in this lesson is adapted from work that is Copyright © Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).
---