Today we are going to use another brief example of the sort of work you can do with the command line. In this case, we are going to take the most significant ChIP-seq peaks from macs2 and view them in the UCSC genome browser.
A remote computer system to which you can connect via "ssh," either in the terminal in Mac or in PUTTY in Windows. You log in as a user, just as you do with your personal computer, but the interface is usually text-only.
BASH is the "shell" you will use. It is a piece of software, just like Word or Excel or Firefox. You write commands, in the format:
program-name [options] [files]
The BASH interpreter receives these commands, interprets them as best it can, dispatches commands to whatever programs you requested, and reports the result back to you. All command line interfaces are the simple loop: write command, press ENTER, wait as the request is processed, analyze the results.
The shell has a current working directory, where it writes files, interprets relative paths, etc. You list this directory with
pwd
You list the contents of the current working directory with
ls -l
You create a new folder inside the current working directory with
mkdir new_folder
You navigate into the new folder with
cd new_folder
Remember to use <TAB> to autocomplete program names, folder names, and filenames!
You download a file from the internet using the following:
wget https://hpc.nih.gov/~palmercd/tutorial/GSE77625.tar.bz2
Confirm that the file downloaded successfully.
What format is the file you downloaded? You have to use a slightly different command to extract the data:
tar xjvf GSE77625.tar.bz2
What has appeared? Navigate into the resulting directory.
What exists in this directory? Make a copy of one of these files with a nicer name:
cp GSE77625_h3k27ac_chow.bed my_file.bed
Investigate the contents of this file.
head my_file.bed
tail my_file.bed
wc my_file.bed
We want to determine how many peaks exist on each chromosome. To do so, we need some new tools.
There are many ways to potentially solve this problem. One involves sorting the file by the chromosome code, and then counting the number of instances of each unique chromosome label.
sort my_file.bed | head
What is the result? We want to count the number of instances of each chromosome label in the file.
sort my_file.bed | cut -f 1 | head
To get the unique entries, we use the uniq utility:
sort my_file.bed | cut -f 1 | uniq
That was efficient, but didn't seem to actually solve the problem. Try the man page!
man uniq
sort my_file.bed | cut -f 1 | uniq -c
To extract the most interesting peaks, we can sort the file on the fifth column of the file (bigger is better):
sort -k 5,5 my_file.bed | head
What happened?
sort -k 5,5g my_file.bed | head
This sort appears to be in the reverse order of what we want, which has an easy fix:
sort -k 5,5g my_file.bed | tail
Now, we select however many of these we want, and write them to file:
sort -k 5,5g my_file.bed | tail -20 > my_peaks.bed
Now move the file to your local computer.