Welcome to NGS course 2014’s documentation!¶
Contents:
UNIX primer¶
You need to connect to the system, type in commands, keep stuff running while you’re disconnected and then pick up your results.
Connecting to the system¶
To connect from MS Windows we’ll use PuTTY. PuTTY is a simple window that sends everything you type to the remote computer and displays anything the remote computer sends back.
Warning
Clipboard works differently in PuTTY! When you select text PuTTY assumes you want to copy it - so it is automatically copied to clipboard. To paste text you need to press right mouse button.
Warning
Trying to use windows shortcuts - especially Ctrl-C kills your current running program.
Run putty and enter following information:
Host: localhost
Port: 2222
The shell¶
What you see now is the shell. Shell is a program for entering commands. Your shell is bash. Bash is to shell what MS Word is to text editor. You can choose your shell if you need, but most people use bash.
Multiple windows¶
You’re all used to work with multiple windows (in MS Windows;). You can have them in (remote) Linux as well.
screen
Note
The additional benefit is that you can log off, and your programs keep running.
Screen
is controled after you press the master key - ctrl-a
. The next key you
press is a command to screen
.
To create a new window, press ctrl-a c
(create). To flip among your windows press ctrl-a space
(you flip windows often, it’s the biggest key available).
To detach screen
- “keep your programs running and go home” - press ctrl-a d
(detach).
Coming back to work you need to connect to your screen
(-r is for restore).
screen -r
Moving around¶
You need to type these commands in bash - keep your eye on the prompt
- the beginning of the line where you type. Different programs present different
prompts.
pwd # prints current directory path
cd # changes current directory path
ls # lists current directory contents
ll # lists detailed contents of current directory
mkdir # creates a directory
rm # removes a file
rm -r # removes a directory
cp # copies a file/directory
mv # moves a file/directory
locate # tries to find a file by name
ln -s # create symbolic link
Getting help¶
Call me or Vaclav to get any help ;)
Once you know the name of the command that does what
you need, all the details are easily accessible using man
.
To get all possible help about finding text do:
man grep
To find the name of the command that does what you need, use google:
linux search for string
Viewing files¶
less
is the command:
less /data/slavici/00-reads/GSVZDOM02.fastq
Toggle line wrapping by typing -S<enter>
.
Search for sequence ACGT
by typing /ACGT<enter>
. Press n
(next) to
Exit less by typing q
.
Chaining commands¶
You know how to display whole file (less
). What if you want
to display just specific information from the file?
Change the directory, so we don’t have to type so much (press <tab>
often
to spare some typing in bash):
cd /data/slavici
View only first 100 lines:
<00-reads/GSVZDOM02.fastq head -100 | less
View only sequence names (they all start with @):
<00-reads/GSVZDOM02.fastq grep ^@ | less
We can see that the simple assumption was not correct - not only sequence names start with @. Let’s display every fourth line - names are only on fourth lines:
<00-reads/GSVZDOM02.fastq awk '(NR % 4 == 1)' | less
Writing to file instead of looking at it is easy:
<00-reads/GSVZDOM02.fastq awk '(NR % 4 == 1)' > test-file
# check if the data is there ;)
less test-file
# get rid of the file
rm test-file
Chaining is not limited to two commands. I need first 1000 sequence names without the @:
<00-reads/GSVZDOM02.fastq awk '(NR % 4 == 1)' | cut -c2- | head -1000 > second-test
less second-test
Getting out of it all¶
exit # quits current session
Doing useful stuff¶
Moving files around¶
In the /data
directory you’ve got sample data with some precomputed results
for the case some computations fail. You don’t want to overwrite those,
so you will create a ‘clean’ directory with only the input data.
Using links you can access to the same data from different locations:
Warning
Linux uses case sensitive filesystems - File is not file.
# create a sandbox directory
cd /data
mkdir slavici_sandbox
cd slavici_sandbox
# link the data from the original directory
ln -s ../slavici/00-reads
ln -s ../slavici/01-genome
# readgroups is small, we can copy it
cp ../slavici/20-smalt/readgroups.txt
# check if there is everythig we need
ll
ll 00-reads
Installing software¶
There is a canonical software install procedure in UNIX. It can be summarized as
wget -O - http://some.site.com/package.tar.gz | tar xvz
cd package
less R<tab> # usually README or README.txt, type capital R!
./configure
make
sudo make install
Looks easy .. ? Some packages do not have configure
file, you just skip the
configure step then. And some - usualy biological - packages are just weird.
Then you have to look for information in README.*
or INSTALL.txt
.
Let’s try to install two packages - Pipe viewer
and vcflib
.
Pipe viewer is a nice tool you can use to watch the progress of your operations.
It is distributed in standard .tar.gz
form. Vcflib lives at GitHub - this is where
a lot of current open source software resides nowadays.
Pipe viewer: go to google, enter pipe viewer
. Click the ivarch.com
link.
Look for downloads. Right click the pv-1.5.2.tar.gz
link, select copy address
or something similar (depends on your browser). Go to PuTTY, type wget -O -<space>
and right-click your mouse. Then type `` | tar xvz <enter>``
# go to a directory with software
cd ~/sw
# this is a spoiler, you should create the first line yourself
# and do not copy it here
wget -O - http://www.ivarch.com/programs/sources/pv-1.5.2.tar.gz | tar xvz
cd pv<tab>
./configure
make
sudo make install
# test pv
</dev/zero pv > /dev/null
vcflib: go to google, type vcflib
, choose the GitHub link. Find clone url
.
Click the clipboard button. Go to PuTTY, type git clone<space>
and right-click your mouse
in PuTTY window. Press <enter>.
cd ~/sw
git clone --recursive https://github.com/ekg/vcflib.git
cd vcflib
make
vcflib does not have make install
. We need to copy the binaries to $PATH
manually.
sudo cp bin/* /usr/local/bin
Scrimer¶
Scrimer is a pipeline for designing primers from transcriptome. Most of the steps are general to NGS data analysis.
Go to:
http://scrimer.rtfd.org
In the documentation you will find commands to perform individual steps of NGS analysis.
The data in your virtual machine are filtered to a size where the steps won’t take too long - so you should be able to run them all during the session.
Looking at quality checks¶
Steps to take¶
You should try to understand what the commands do (you don’t have to understand how they do it).
Go to /data/slavici_sandbox
directory in your virtual machine.
Set the number of CPUs that can be used (there’s only one in the VM):
CPUS=1
The following headings can be found in the Scrimer manual. Follow instructions there, with additional info given here.
Remove cDNA synthesis adaptors¶
Set the adaptor sequences, that were used during the library preparation. They will be removed from the sequences.
# primers used to synthetize cDNA
# (sequences were found in .pdf report from the company that did the normalization)
PRIMER1=AAGCAGTGGTATCAACGCAGAGTTTTTGTTTTTTTCTTTTTTTTTTNN
PRIMER2=AAGCAGTGGTATCAACGCAGAGTACGCGGG
PRIMER3=AAGCAGTGGTATCAACGCAGAGT
Run the QC on the raw data and on the trimmed data - so you can see the difference.
QC reports produced by fastqc
are html pages. You need to get them to your machine,
because your Linux does not have any graphical display (only text).
Use WinScp to copy them to your machine, after you generate them.
Map reads to reference assembly¶
Follow all the instructions on the page.
Detect and choose variants - Call variants with FreeBayes¶
Skip the line following # the rest, ignore order
.
Graphical tools¶
IGV - Integrative Genomics Viewer¶
Use WinScp
to copy your data from the virtual machine. Run IGV.
Load the data to IGV and look around.
Galaxy¶
The same data that you see in IGV can be visualized online by uploading
to Galaxy
server. I uploaded the data beforehand, so we don’t upload 10x 500 MB
at once.
To see the loaded data, go to:
https://usegalaxy.org/u/liborm/v/ngs-course-2014
To get the data to Galaxy interface, go to:
https://usegalaxy.org/u/liborm/h/ngs-course-2014
To see Galaxy, go to:
http://usegalaxy.org
MetaCentrum¶
The best thing is that now you know almost everything to use MetaCentrum. It is the same as using PuTTY to acces a local virtual machine.
Key differences¶
- you have to register
- you don’t use localhost:2222, but something like skirit.ics.muni.cz:22
- you cannot use
sudo
- you need to allocate computers using
qsub
command - your data is somewhere else than in
/data
- you can have 100 cores instead of 1
Visualizing your data¶
We’ll use RStudio that is installed in the virtual machine. To start RStudio go to (in your browser):
http://localhost:8787
Look around the program - if you ever user R withou RStudio, the difference is big!
Get the data:
# download it here to your machine
https://owncloud.cesnet.cz/public.php?service=files&t=aab865a16555adc995b50e33b148318a
# use WinScp to copy it to virtual machine
# unpack the data
unzip data_viz.zip
Then use RStudio to navigate to the folder where you unpacked the data.
Open multires_profiles.R
.
Use package manager to install gtools
library. (It’s easy.)
Run the scripts, and see GC profiles of various genomes.
You can suggest other types of plots on this kind of data - we can try to create them.
Extra UNIX excersise¶
If you want to plot the same profiles for /data/slavici
, there is a script
base_counts.py
that can sum it for you. But I wrote it in a hurry for a certain
type of data - one gzipped chromosome per file. You can try to convert luscinia_small.fasta
to this format and run the script.
Supplementary material¶
Course materials preparation¶
VirtualBox image¶
Download Debian net install image - use i386 so there is as few problems with virtualization as possible. Not all machines can virtualize x64.
https://www.debian.org/CD/netinst/
Create new VirtualBox machine¶
- Linux/Debian (32 bit)
- 1 GB RAM - this can be changed at the users machine, if enough RAM is available
- 12 GB HDD as system drive (need space for basic system, gcc, rstudio and some data)
- name the machine ‘node’
- users: root:debian, user:user
- setup port forwarding - 22 to 22 (ssh) - 8787 to 8787 (rstudio server)
Log in as root:
apt-get install sudo
usermod -a -G sudo user
Login as user:
# colrize prompt - uncomment force_color_prompt=yes
# add ll alias - uncomment alias ll='ls -l'
# fast sort and uniq
# export LC_ALL=C
# maximal width of man
# export MANWIDTH=120
nano ~/.bashrc
. ~/.bashrc
# everyone likes git and screen
sudo apt-get install git screen
# add important stuff to python
sudo apt-get install python-dev python-pip python-virtualenv
# make a vbox snapshot here 'usable system'
# install CloudBioLinux into virtual environment (not to pollute whole system)
mkdir sw
virtualenv py-cbl
. py-cbl/bin/activate
# cloudbiolinux installation (https://github.com/chapmanb/cloudbiolinux)
git clone git://github.com/chapmanb/cloudbiolinux.git
cd cloudbiolinux
python setup.py build
python setup.py install
# fix problem with distribution (new debian wheezy not yet supported?)
sudo bash -c "echo DISTRIB_CODENAME=wheezy >> /etc/os-release"
# choose a minimal flavor for installing
fab -f fabfile.py -H localhost -p user install_biolinux:flavor=ngs_pipeline_minimal
# to ease problem debugging
sudo apt-get install strace
sudo updatedb
# fix /usr/local/lib missing in search path
sudo bash -c "echo /usr/local/lib >> /etc/ld.so.conf.d/local.conf"
# rebuild the cache
sudo ldconfig
R Studio server:
# install rstudio server
# https://www.rstudio.com/ide/download/server.html
sudo apt-get install gdebi-core
wget http://download2.rstudio.org/rstudio-server-0.98.507-amd64.deb
# get old openssl
wget http://ftp.de.debian.org/debian/pool/main/o/openssl/libssl0.9.8_0.9.8o-4squeeze14_amd64.deb
dpkg -i libssl0.9.8_0.9.8o-4squeeze14_amd64.deb
# update R to some decent version
# http://cran.r-project.org/bin/linux/debian/README.html
sudo bash -c "echo 'deb http://mirrors.nic.cz/R/bin/linux/debian wheezy-cran3/' >> /etc/apt/sources.list"
sudo apt-key adv --keyserver keys.gnupg.net --recv-key 381BA480
sudo apt-get update
sudo apt-get install r-base
sudo R
> update.packages(.libPaths(), checkBuilt=TRUE, ask=F)
# add some packages by hand
curl http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2|tar xvj
cd parallel-20140422/
./configure
make && sudo make install
Prepare data¶
Create a subset of nightingale data on other machine:
Transfer them to VirtualBox:
sudo mkdir /data
sudo chown user:user /data
Create documentation¶
This was not done in the virtual machine, but belongs to the course preparation...
mkdir ngs-course-2014
cd ngs-course-2014
# use default answers to all the questions
sphinx-quickstart
# track the progress with git
git init
git commit -a -m "empty docs and slide"
Spare parts¶
If instaling from remote machine: Use fabricant to install cloudbiolinux: need the 127.0.0.1 otherwise it does not use ssh
fab -f fabfile.py -H 127.0.0.1 --port=2222 -u user -p user install_biolinux:flavor=ngs_pipeline_minimal
# full install - does not work
fab -f fabfile.py -H localhost install_biolinux