Course on Unix and work with genomic data - Prague, November 2020¶
This course is taught regularly as a subject in the winter term at the Department of Zoology, Faculty of Science, Charles University in Prague (MB170C47).
The aim of the course is to introduce the participants to Unix - an interface that is one of the most convenient options for working with big data in text formats. Special attention is given to working with Next Generation Sequencing (NGS) data. Most of the NGS data formats are intentionally text-based because the authors wanted to use Unix for most of the processing. Like that they can focus on the real problem they’re solving without having to program special tools for the mundane tasks, which can all be handled by combining basic Unix tools.
We’ll be using Slack ‘UNIX and NGS’ to share your login credentials and further info - use (this link to join).
Schedule:
Friday 27/11/2020 (14:00-18:00) | |
Afternoon |
|
Saturday 28/11/2020 (10:00-18:00) | |
Morning |
|
Lunch Break (~ 1h) | |
Afternoon |
|
Sunday 29/11/2020 (10:00-18:00) | |
Morning |
|
Lunch Break (~ 1h) | |
Afternoon |
|
Initial instructions:
Installation instructions¶
We will be all connecting to a single remote server hosted by MetaCentrum.
Note
You will connect to the machine differently, depending on what operating system you are using.
Windows¶
Install Git for Windows. We’ll use it to control the remote computer.
When installing Git Bash, go with the default settings, just
make sure you check Git Bash Here
. Keep in mind that the images are from an
older version of the installer.


To set up your terminal run the Git Bash
from Start menu,
run this and exit the terminal (exit
):
curl -sL https://owncloud.cesnet.cz/index.php/s/1B1NnrI4lqQtY9Q/download > ~/.minttyrc
Install WinSCP (look for
Installation package
). WinSCP will be used to transfer files between your
computer and the remote computer.
macOS and Linux¶
ssh
is used to control the virtual computer. It should be already installed in your computer.
Files can be transferred with scp
, rsync
or lftp
(recommended)
from the command line. scp and rsync could be already installed in your system,
if you want to try lftp, you’ll probably have to install it yourself.
Mac users that prefer grapical clients can use something like CyberDuck. Check the answers here.
Time to log in!¶
Try to log in following the instructions in Connecting to the remote machine.
Connecting to the remote machine¶
To control the machine, you need to connect to the ssh service. This is also referred to as ‘logging in’.
Note
You will need a user name and a password to log in. Join our Slack so we can send the credentials to everyone during the course introduction.
In Windows this is done with Git Bash
from the Git for Windows
package. When you run it, you get a so-called terminal window. Use your user
name given in Slack (substitute the ##
with the number you got),
type the following command and press Enter
:
ssh user##@ngs-course.duckdns.org
# when asked about the 'authenticity', type yes
The authenticity of host 'ngs-course.duckdns.org (147.251.21.151)' can't be established.
ECDSA key fingerprint is SHA256:r44ZKJC8HuqKP9T3irEs8h/4ZEWJU8Eym41VcTfOk1I.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Type in your password when prompted with user##@ngs-course.duckdns.org's password:
.
The password entry is ‘silent’, nothing appears as you type - so no one can see
how many characters your password has.

In macOS your terminal program is called ‘Terminal’, in Linux you have several options like ‘Konsole’, ‘xterm’ etc.
On Chromebook, you need to install the Secure Shell App.
Connect to copy files¶
In Windows, WinSCP can be used to copy files to Linux machines.

In Mac OS X or Linux, the most simple command to copy a file into
a home directory of user##
on a remote machine is:
scp myfile user##@ngs-course.duckdns.org:~
Connect to RStudio¶
This is the easiest one, just click this link: Open RStudio. Login with the same credentials you got on Slack.
Course outline:
Session 1: Basic Unix¶
This session will give you all the basics that you need to smoothly move around when using a Unix system (in the text mode!).
Basic orientation¶
Note
Copying and pasting in the Windows terminal (Git for Windows) is different
than in other programs - especially because ctrl+c
means to kill the current
program. To copy text to clipboard just select it with your left mouse button.
To paste from clipboard either click midlle mouse button, or press shift+insert
.
Command line¶
How does it work with commands in command line? Typical command is composed of command name, flag, flag value, path to input file and path to output file.
command -flag(value) input > output
# Specific example
head -n10 file.txt > out.txt
How do I know which flags to use for individual commands? There is several ways. You can
you command documentation using man
, or most of commands have help option -h
or --help
. Below is way to call manual and help.
# using manual
man head
# help
head -h
head --help
Check your keyboard¶
Before we do any serious typing, make sure you know where are the important keys. I’d suggest using English keyboard, if you don’t want to constantly press right alt and five random letters before you find the one you need. You will definitely need those keys:
[] - square brackets
{} - curly brackets (mustache)
<> - angle brackets (less than, greater than)
() - parentheses
~ - tilde
/ - slash
\ - backslash
| - pipe
^ - caret
$ - dollar sign
: - colon
; - semicolon
. - dot
, - comma
# - hash, pound
_ - underscore
- - dash
* - asterisk
! - exclamation mark
? - question mark
& - ampersand
@ - at sign
' - single quote
" - double quote
` - back tick
Be safe when the network fails¶
When you are disconnected from the machine due to any technical problems,
all your running programs are killed. To prevent this, we suggest to use
the screen
tool for all your work:
screen
To safely disconnect from a running screen press ctrl+a d
(d for detach).
To attach again type:
screen -r
You can have simultaneously multiple sessions. In that case you have to select
to which session to reattach. -ls
command can be used to list all
existing sessions and then re-attach to a specific session using -r
and specifying the name of the session:
screen -ls
screen -r XXXX.NNNNNN.XXXX
To kill screen
session we can use:
screen –X -S XXXX.NNNNNN.XXXX quit
Note
Keyboard shortcuts notation: ctrl+a d
means press ctrl
key and a
key
simultaneously and d
key after you release both of the previous keys.
Directory structure¶
Unlike ‘drives’ in Windows, Unix has a single directory tree that starts in
/
(called root directory). Everything can be reached from the root
directory. The next important directory is ~
(called user’s home directory).
It is a shortcut for /home/user
here, /home/..your login name..
in
general.

Your bash session has a working directory that can be changed with cd
(change directory) and printed with pwd
(print working directory). All
filenames and paths you type refer to your working directory (relative paths),
unless you start them with /
(absolute paths).
Try the following commands in the order they are provided, and figure out what they do. Then use your knowledge to explore the directory structure of the virtual machine.
Figure out what these commands do:
pwd
ls
ls /
ls ..
ls ~
cd
cd /
cd ..
cd ~
A neat trick to go back where you’ve been before the last cd
command:
cd -
More in Moving around & manipulation with files and directories.
Moving or copying files and directories¶
touch file.txt # make an empty file.txt
mkdir dir # make a directory
mkdir -p some/sub/directories # make nested directories
rm # remove a file
rm -r # remove a directory
mv # move a file/directory
cp # copy a file/directory
Viewing plain text file content¶
less -SN
tail -n8
head -n8
cat
nano
Work with compressed files¶
# gzipped files (take care, this removes the input file)
gunzip file.txt.gz
# Open gzipped files in pipeline (zcat does not remove the file)
zcat file.txt.gz | less
# Compressed tarball archives (does not remove the archive)
tar -xzvf fastq.tar.gz
Exercise: Prepare fastq files in your working directory
# Go to home directory
cd
# Make a new directory 'fastq'
mkdir projects/fastq && cd projects/fastq
# Copy a fastq archive to the new directory
cp /data-shared/fastq/fastq.tar.gz .
tar -zxvf fastq.tar.gz
ls -sh
Pipes¶
Using the |
(pipe) character you instruct the shell to take the output of
the first command and use it as an input for the second command.
The complement to head
is tail
. It displays last lines of the input. It
can be readily combined with head
to show the second sequence in the file.
cd ~/projects/fastq
head -8 HRTMUOC01.RL12.00.fastq | tail -4 | less
# Neater way to write pipelines
< HRTMUOC01.RL12.00.fastq head -8 | tail -4 | less -S
Globbing¶
Imagine you’ve got 40 FASTQ files. You don’t want to copy and paste all the
names! There is a feature that comes to rescue. It’s called globbing. It
allows you to specify more filenames at once by defining some common pattern.
All your read files have .fastq
extension. *.fastq
means a file named
by any number of characters followed by ‘.fastq’.
cd ~/projects/fastq
ls *.fastq
ls HRTMUOC01.RL12.0?.fastq
ls HRTMUOC01.RL12.0[1-9].fastq
Exercise: How many reads are there?:
We found out that FASTQ files have a particular structure (four lines per read).
To find the total number of reads in our data, we will use another tool, wc
(stands for word count, not for a toilet at the end of the pipeline;). wc
counts words, lines and characters.
Our data is in several separate files. To merge them on the fly we’ll use
another tool, cat
(for conCATenate). cat
takes a list of file names and
outputs a continuous stream of the data that was in the files (there is no way
to tell where one file ends from the stream).
# now double click on each file name in the listing, # and click right mouse button to paste (insert space in between)
cat *.fastq | wc -l
The number that appeared is four times the number of sequences (each sequence takes four lines). And there is even a built-in calculator in bash:
echo $(( XXXX / 4 ))
expr XXXX / 4
Variables¶
CPU=4
echo $CPU
FILE=~/projects/fastq/HRTMUOC01.RL12.00.fastq
echo $FILE
Lists & Loops¶
PARAM=$({0..9})
for v in $PARAM
do
echo $v
done
# One line syntax
for v in $PARAM; do echo $v; done
Installing software¶
The easiest way to install software is via a package manager (eg. apt-get
for all Debian variants). When the required software is not in the repositories,
or one needs the latest version, it’s necessary to take the more difficult path.
The canonical Unix way is:
wget -O - ..url.. | tar xvz # download and unpack the 'tarball' from internet
git clone ..url.. # clone source code from git repository
cd ..unpacked directory.. # set working directory to the project directory
./configure # check your system and choose the way to build it
make # convert source code to machine code (compile it)
sudo make install # install for everyone on this machine
Note
Normal users cannot change (and break) the (Unix) system. There is one special
user in each system called root
, who has the rights to make system wide changes.
You can either directly log in as root, or use sudo
(super user do) to execute
one command as root
.

htop¶
Note
This year the machine is shared among all course participants, so we can’t give super user access to everyone to be sure that no one can accidentally damage the machine.
Installing software from common repository:
sudo apt-get install htop
Bedtools¶
Install software which is not in the common repository. You just need to find a source code and compile it:
wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
tar -zxvf bedtools-2.25.0.tar.gz
cd bedtools2
make
Another common place where you find a lot of software is GitHub. We’ll install
bedtools
from a GitHub repository:
cd ~/sw
# get the latest bedtools
git clone https://github.com/arq5x/bedtools2
This creates a clone of the online repository in bedtools2
directory.
cd bedtools2
make
Exercise¶
Note
- What is the output of this command
cd ~/ && ls | wc -l
?
- The total count of files in subdirectories in home directory
- The count of lines in files in home directory
- The count of files/directories in home directory
- The count of files/directories in current directory
- How many directories this command
mkdir {1999..2001}-{1st,2nd,3rd,4th}-{1..5}
makes (do not use calculator!)?
- 56
- 60
- 64
- 72
- When files created using this command
touch file0{1..9}.txt file{10..30}.txt
, how many files matched byls file?.txt
andls file*0.txt
- 30 and 0
- 0 and 30
- 30 and 4
- 0 and 3
- Which file would match this pattern
ls *0?0.*
?
- file36500.tab
- file456030
- 5460230.txt
- 456000.tab
- Where do we get with this command
cd ~/ && cd ../..
?
- two levels below home directory
- one level above home directory
- to root directory
- two levels above root directory
- What number does this command
< file.txt head -10 | tail -n+9 | wc -l
print? (Assume the file.txt contains a lot of lines)
- 0
- 1
- 2
- 3
Session 2: Genomic data¶
During this course we are going to use various genomic data stored in standard formats that can be easily read and processed with Unix shell commands. This session provides overview of these common data formats. Specialized tools that are optimized for work with these data will be mentioned here as well.
Complex tasks are more easy to be conducted with specialized genomic tools rather than trying to achieve that solely by using built-in Unix command line tools. Not mentioning here very specific bioinformatic tasks like read mapping, assembly and others that simply need a software developed for such purposes.
The advantage is that majority of the specialized genomic software can be run in the command line. The most efficient work is based on combination of both built-in Unix shell commands and specialized genomic tools. Built-in Unix shell commands can be used to direct flow of the data in the bioinformatic pipeline, carry basic processing of the data, and run genomic tools with specified parameters.
Common genomic tools¶
Below is the list of the most widely used genomic tools that can be used for genomic data processing based on type of task that they carry:
Read alignment data¶
- samtools (https://samtools.github.io)
Variant data¶
- vcftools (https://vcftools.github.io/index.html)
- bcftools (https://samtools.github.io/bcftools/)
Annotation data (genome arithmetics)¶
- bedtools (https://bedtools.readthedocs.io/en/latest/)
- bedops (https://github.com/bedops/bedops)
Sequence/Alignment/Tree data¶
- newick-utilities (https://github.com/tjunier/newick_utils/wiki)
- BuddySuite (https://github.com/biologyguy/BuddySuite)
Standard genomic data formats¶
High throughput specs hub: https://samtools.github.io/hts-specs/

FASTQ - Short reads¶
Sequencing instruments produce not only base calls, but usually can assign some quality score to each called base. Fastq contains multiple sequences, and each sequence is paired with quality scores for each base. The quality scores are encoded in text form.
SAM - Reads mapped to reference¶
SAM stands for Sequence Alignment/Mapping format. It includes parts of the original reads, that were mapped to a reference genome, together with the position where they belong to. There is an effective binary encoded counterpart called BAM and even more compressed CRAM.
BED and GFF - Annotations¶
Annotations are regions in given reference genome with some optional additional information. BED is very simple and thus easy to work with for small tasks, GFF (General Feature Format) is a comprehensive format allowing feature nesting, arbitrary data fields for each feature and so on.
VCF - Variants in individuals¶
VCF stands for Variant Call Format. Given a reference and a set of sequenced individuals, VCF is a format to store the differences in these individuals, compared to the reference, efficiently. There is also a binary counterpart BCF.
Session 3: Plain text file processing in Unix¶
This session focuses on plain text file data extraction/modification using built-in Unix tools.
Pattern search & regular expressions¶
grep
command can be used to efficiently match and also retrieve pattern in text files.
grep pattern file.txt # Returns lines matching a pattern
grep -v pattern file.txt # Returns lines not matching a pattern
grep -E regex file.txt # Returns lines not matching a regex
grep -c pattern file.txt # Returns number of lines matching a pattern
grep -B pattern file.txt # Returns number of lines before a line matching a pattern
grep -o pattern file.txt # Returns only matching part of lines
man grep # For other options
But what if we want match a variable pattern, e.g. differing length or content? Regular expressions are exactly the tool that we need.
^A # match A at the beginning of line
A$ # match A at the end of line
[0-9] # match numerical characters
[A-Z] # match alphabetical characters
[ATGC] # match A or T or G or C
. # match any character
A* # match A letter 0 or more times
A{2} # match A letter exactly 2 times
A{1,} # match A letter 1 or more times
A+ # match A letter 1 or more times (extended regular expressions)
A{1,3} # match A letter at least 1 times but no more than 3 times
AATT|TTAA # match AATT or TTAA
\s # match whitespace (also TAB)
grep -E
is a useful tool to search for patterns using a mini-language called
regular expressions. You can use the egrep
shorthand, which means the same.
Word and line count¶
wc
command states for word count provides a quick summary statistics on the plain text file
content. Bytes, words and lines can be counted in defined files.
wc -c file.txt # Number of bytes in a file
wc -w file.txt # Number of words in a file
wc -l file.txt # Number of lines in a file
To calculate the number of lines in every file listed:
wc -l *.txt
- Count the number of variants in the file
< /data-shared/vcf_examples/luscinia_vars_flags.vcf.gz zcat |
grep -v '^#' |
wc -l
- Count the number of variants passing/failing the quality threshold
< /data-shared/vcf_examples/luscinia_vars_flags.vcf.gz zcat |
grep -v '^#' |
grep 'PASS' |
wc -l
< /data-shared/vcf_examples/luscinia_vars_flags.vcf.gz zcat |
grep -v '^#' |
grep 'FAIL' |
wc -l
- Count the number of variants on the chromosome Z passing the quality threshold
< /data-shared/vcf_examples/luscinia_vars_flags.vcf.gz zcat |
grep -v '^#' |
grep 'PASS' |
grep '^chrZ\s' |
wc -l
Retrieve & count unique records¶
Often times we face a problem of how many unique records we have in a file or how many there are instances of every unique item.
Unix provides efficient way to cut (cut
) desired columns and retrieve unique
records for selected column (sort -u
). Additionaly, we can count the instances (uniq -c
).
Typical examle of use in bioinformatics is to count the number of genes or SNPs per contig or chromosome.
To select specified columns cut
command can be used. By default, cut
use whitespace as separator. When there is need to distinguish between
standard whitespece and TAB
(i.e. TAB
-separated files) then -d $'\t'
delimiter has to be set explicitly. When all columns except a specific column(s)
are supposed to be selected --complement
flag can be used.
cut -f1-3 file.txt
cut -d $'\t' -f1-3 file.txt
cut --complement -f4 file.txt # Select all columns except column 4
Content of the file can be sorted based on the specified column (-k1,1
)
or range of columns (-k1,3
). When the data needs to be sorted numerically (-n
)
or in reverse order (-r
) appropriate flags need to be added. Similarly to cut
command sort
recognizes as separator any whitespace. When TAB
is used as a separator,
to enforce distinction from the possible whitespaces used in the file,
-d $'\t'
flag has to be used explicitly.
sort -k1,1 file.txt # Sort based on first column
sort -k1,1 -k2,2nr file.txt # Sort first based on first column, then on second column numerically in reversed order
sort -k1,3 file.txt # Sort based on range of columns
sort
command can also be used to retrieve the unique records using flag -u
.
When counts of instances for every unique items are supposed to be provided uniq -c
command should be used in combination with sort
as records before sent to uniq
have to be sorted.
sort -u file.txt # Retrieve unique records in the file
< file.txt sort | uniq -c # Count number of instances for every unique item
Exercise: Which chromosome has the highest and the least number of variants?
< data-shared/luscinia_vars_flags.vcf grep -v '^#' |
cut -f1 |
sort |
uniq -c |
sort -k1,1n
Exercise: Get the first six base pairs from every read
cat *.fastq |
grep -E "^[ACGT]+$" |
cut -c1-6
sort |
uniq -c |
sort -k1,1nr |
less
String extraction and replacement¶
Another common task in bioinformatics is a string extraction and/or replacement. Often times we need to extract specific piece of information from a complex data.
Typical example is the extraction of a specific value according to a TAG in gff3
or VCF file. As positions of TAGS can differ from line to line, using cut
command is simply not possible. Matching using sed
based on a TAG is
the only possibility. regex
can be used to match appropriate pattern using sed
.
Another typical task is a replacing of delimiters. tr
command is very
well suited for this task. -d
flag can be used to remove specific characters
from the file. The whole classes can be replaced which can be for instance
used to change uppercase to lowercase or vice versa.
For extraction of repeating strings grep -o
is the most efficient tool. In bioinformatics
it can be easily used to match and retrieve microsatellites from the sequence for instance.
Note
Difference between sed
and tr
:
tr
(from TRansliterate) replaces (or deletes) individual characters:
Ideal for removing line ends (tr -d "\n"
) or replacing some
separator to TAB (tr ";" "\t"
).
sed
replaces (or deletes) complex patterns.
Typical usage of tr
is as follows:
tr "\t" ";" file.txt # To replace TAB separator to semicolon
tr -d "\n" file.txt # Remove new line characters (``\n``)
tr "[A-Z]" "[a-z]" # Replace uppercase to lowercase
Exercise: What is the number of samples in the VCF file?
< data-shared/luscinia_vars_flags.vcf grep -v '^##' |
head -n1 |
cut --complement -f1-9 |
tr "\t" "\n" |
wc -l
To match a specific string in the file sed
can use regex
similar
to grep
command. However, to use full regex
functionality
and simplify the regex syntax, extended regular expression flag
r
(-E
for Mac OSX) has to be used.
Comparison of use of standard sed
and use of sed
with extended
regular expressions sed -r
:
sed 's/pattern/replacement/'
# Replace one or more A or C or G or T by N
# Standard sed
sed 's/^[ACGTN]\{6\}/NNNNNN/'
# The same thing using extended regular expressions:
sed -r 's/^[ACGTN]{6}/NNNNNN/'
sed
can be used also for string extraction. Matched string designated
to be extracted has to be marked in rounded brackets (string)
and passed to the output with following notation: \#
where # character
states for the position starting with 1 in the matched string (i.e. there can be
multiple extractions from one matched string).
# Returns TTTGGG
echo 'AAATTTCCCGGG' | sed -r 's/A+(T+)C+(G+)/\1\2/'
Note
sed -r
(text Stream EDitor) can do a lot of things, however,
pattern replacement and extraction is the best thing to use it for.
The ‘sed language’ consists of single character commands, and it is no fun
to code and even less fun to read (what does sed 'h;G;s/\n//'
do?;).
Use awk
for more complex processing (see next session).
grep -o
extracts only matching parts of the string. This command can be used
to exctract repeating patterns (i.e. very usefull for extraction of microsatellite
sequences).
# Match AT di-nucleotide twice or more times
grep -o -E "(AT){2,}"
# Match GTC tri-nuleotide twice or more times
grep -o -E "(GTC){2,}"
# Match any repeating pattern
grep -o -E "([ATGC]{1,})\1+"
Exercise: What is the number of SNPs per chromosome in the VCF file??
Don’t use cut
command!
FILE=/data-shared/vcf_examples/luscinia_vars_flags.vcf.gz
< $FILE zcat |
grep -o -E '^chr[Z1-9]+' |
sort |
uniq -c |
sort -k1,1nr
Exercise: Microsatellites statistics
Extract all AT dinucleotides repeating at least twice and calculate their frequency distribution in the whole dataset.
cat *.fastq |
grep -E "^[ACGT]+$" |
grep -o -E "(AT){2,}" |
sort |
uniq -c |
less
Join & paste data¶
The final part of this session is joining and pasting data. Here, we seek to merge multiple
files into one. join
command corresponds to standard JOIN
command known from SQL
language.
It joins two files based on specific key column. It assumes that both files contain a column representing
keys the are in both files. Both files must be sorted by the key before any joining task.
join
command has the same functionality as a standard JOIN
meaning that supports INNER JOIN
,
LEFT JOIN
, RIGHT JOIN
and FULL OUTER JOIN
(Join types).
By default the column considered to be key is the first column in both input files. As already mentioned the key column needs to be sorted in a same way in both files.
sort -k1,1 file1.txt > file1.tmp
sort -k1,1 file2.txt > file2.tmp
join file1.tmp file2.tmp > joined-file.txt
If key column is at different position it needs to be specified on the input
using -1
and -2
flags:
sort -k2,2 file1.txt > file1.tmp # key column on the 2nd position
sort -k3,3 file2.txt > file2.tmp # key column on the 3rd position
join -12 -23 file1.tmp file2.tmp > joined-file.txt
To specify that the join
is supposed to print unpaired lines corresponding to left, right
and full outer join, specification of the file to print unpaired lines from has to be done using
-a
flag. Also, -e
flag sets value for missing values
# Left join
join -a1 -e NA file1.tmp file2.tmp > joined-file.txt
# Right join
join -a2 -e NA file1.tmp file2.tmp > joined-file.txt
# Full outer join
join -a1 -a2 -e NA file1.tmp file2.tmp > joined-file.txt
Another command that can be used to merge two or more files is paste
. paste
as opposed to
join
simply align files by column (corresponding to cbind
in R
). No key column
is needed as it assumes one to one correspondence between the two files.
paste file1.txt file2.txt > file-merged.txt
paste
command can also be used for smart file transpositions. paste
by default
expects input multiple files per one line. However, when only one file provided with other
possible file possitions filed with -
the command paste
takes the further columns
from next lines of the only file provided. This feature enables to transpose multiple lines
into one line.
Example:
file.txt
item-line1
item-line2
item-line3
item-line4
< filte.txt paste - -
item-line1 item-line2
item-line3 item-line4
This feature can be used in bioinformatics to convert ``.fastq`` files into ``.tab`` type separated files with one read per line. We will use this functionality in upcoming session.
Exercise: Convert FASTQ file to TAB separated file with each read on one line
cat *.fastq |
paste - - - - |
cut --complement -f3 \
> reads.tab
Exercise¶
How many bases were sequenced?¶
wc
can count characters (think bases) as well. But to get a reasonable number,
we have to get rid of the other lines that are not bases.
One way to do it is to pick only lines comprising of letters A, C, G, T and N.
There is a ubiquitous mini-language called regular expressions that can be used
to define text patterns. A line comprising only of few possible letters is
a text pattern. grep
is the basic tool for using regular expressions:
cat *.fastq | grep '^[ACGTN]*$' | less -S
Check if the output looks as expected. This is a very common way to work - build a part of
the pipeline, check the output with less
or head
and fix it or add more commands.
Now a short explanation of the ^[ACGTN]*$
pattern (grep
works one line a time):
^
marks beginning of the line - otherwisegrep
would search anywhere in the line- the square brackets (
[]
) are a character class, meaning one character of the list,[Gg]rep
matchesGrep
andgrep
- the
*
is a count suffix for the square brackets, saying there should be zero or more of such characters $
marks end of the line - that means the whole line has to match the pattern
To count the bases read, we extend our pipeline:
cat *.fastq | grep '^[ACGTN]*$' | wc -c
The thing is that this count is not correct. wc -c
counts every character,
and the end of each line is marked by a special character written as \n
(n
for newline). To get rid of this character, we can use another tool, tr
(transliterate). tr
can substitute one letter with another (imagine you
need to lowercase all your data, or mask lowercase bases in your Fasta file).
Additionally tr -d
(delete) can remove characters:
cat *.fastq | grep '^[ACGTN]*$' | tr -d "\n" | wc -c
Note
If you like regular expressions, you can hone your skills at http://regex.alf.nu/.
Session 4: Unix - “Real” Project¶
Here you will learn several concepts needed for running real projects:
- use one screen sessions per project (if you’re juggling more projects)
- keep your code in a text file
- keep your data files separate from your code
- make your intent more legible by wrapping code in functions and script files
- make your research shareable and reproducible via
git
- write handy scripts in
awk
- scale to multiple files and speed up your processing
Keep your stuff ordered¶
Let’s pretend we’re starting to work on something serious, a new project:
cd ~/projects
# a new project dir, separate dir for data in one shot
mkdir -p unix-advanced/data
# set your work dir to the 'current' project dir
cd unix-advanced
# definitely run screen to be safe
# screen sessions can be named, which helps you
screen -S advanced
# prepare source data
cp /data-shared/bed_examples/Ensembl.NCBIM37.67.bed data/
# add a new window with ctrl+a c
# note and save all the (working) code
nano workflow.sh
Note
The nano bottom bar has a lot of weird symbols. ^O
means type ctrl-o
,
pressing both of the keys at once. M-U
means alt-u
, again pressing both
simultaneously. For some reason, Write Out
means Save in usual parlance.
Now you’ll be typing your code in the nano window and pasting it to the other window where the shell is running. You’ll be writing code on your own now, so you need to keep track of what you created. I’ll occasionally post solutions to Slack.
awk (pronounced [auk])¶
awk
is most often used instead of cut
, when the fields are separated
by spaces and padded to a fixed width awk
can ignore the whitespace -
and where cut
also falls short, awk
can reorder the columns:
Hands on!
# test the smart <tab> auto-complete!
INPUT=data/Ensembl.NCBIM37.67.bed
# $INPUT contains some genome annotations
# look around the file a bit
# - there are chromosome ids in the first column
# - we want to count the annotations per each chromosome
<$INPUT cut -f # complete the rest!!
But what if you wanted to have a table which mentions chromosomes first and
then the counts? Enter awk
. Every line (called record
) is split
into fields
, which are assigned to variables $1
for first field,
$2
for second etc. The whole line is in $0
if needed. awk
expects
the program as first argument:
Hands on!
# note the single quotes, they're important because of $
your_code_here | awk '{print $2, $1}'
Additionally you can add conditions when the code is executed:
Hands on!
your_code_here | awk '($2 < 10) {print $2, $1}'
Or even leave out the code body, which is the same as one {print $0}
statement - that is print the matching lines:
Hands on!
your_code_here | awk '($2 < 10)'
There are some other variables pre-filled for each line, like
record number NR
(starting at 1) and number of fields NF
.
# NF comes handy when checking if it's okay to
# process a file with (say) cut
<$INPUT awk '{print NF}' | uniq
Let’s play with some fastq files. Extract first five files to data
:
INPUT=/data-shared/fastq/fastq.tar.gz
<$INPUT tar tz | head -5 | xargs tar xvf $INPUT -C data
Look at the data with less
- these are reads from 454, with varying read lengths.
Let’s check the lengths:
<data/HRTMUOC01.RL12.01.fastq paste - - - - | awk '{print $1, length($2)}' | head
We could do a length histogram easily now… But let’s filter on the length:
Hands on!
<data/HRTMUOC01.RL12.01.fastq paste - - - - | # can you figure out?
# and we'd like to have a valid fastq file on the output
# - what if we replaced all the \t with \n (hint: tr)
Functions in the Shell¶
This creates a command called uniqt
that will behave as uniq -c
, but
there will be no padding (spaces) in front of the numbers, and numbers will be
separated by <tab>, so you can use it with cut
will work.
uniqt() { uniq -c | sed -r 's/^ *([0-9]+) /\1\t/' ;}
Now test it:
<data/Ensembl.NCBIM37.67.bed cut -f1 | sort | uniqt | head
You can see that the basics of the syntax are your-name() { command pipeline ;}
.
If you want to pass some arguments into the function, use $1
, $2
etc.:
test-function() { echo First argument: $1 ;}
test-function my-argument
Now create a function called fastq-min-length
, with one argument
(use $1
in the body of the function) giving the minimal length:
Hands on!
fastq-min-length() { paste - - - - | your_code_here ;}
# which will be used like this:
<data/HRTMUOC01.RL12.01.fastq fastq-min-length 90 > data/filtered.fastq
We’ll go through the ‘quoting hell’ and some methods to solve it here briefly.
Awk uses $1
for something else than the shell, we need to protect it with
single quotes, but we still need to get through shell’s $1
somehow…
Awk’s -v
argument helps in this case - use it like awk -v min_len=$1
'(length($2) > min_len)'
.
Note
Let’s pop-open the matryoshka. What is terminal, what is a shell, what is Bash?
The program which takes care of collecting your keystrokes and rendering
the colored characters which come from the server is called a terminal.
Famous terminals are mintty
(that’s what you’re using in Windows now),
Konsole
, Terminal App
… The next doll inside is ssh
. It takes
care of encrypted communication with the remote server. An interesting
alternative for geeks is mosh
(google it yourself;). Now you need a
program to talk to on the remote side - that is the shell. We’re using
bash
now, sometimes you can meet the simpler cousin sh
, and the kool
kids are doing zsh
. To recap, Bash is to shell what Firefox is to
browser.
Shell Scripts¶
Another way to organize your code is to put it into a separate file
called a ‘script file’. It begins with a shebang
line, telling the computer
which language is the script in. Bash shebang is #! /bin/bash
.
Take care to give a descriptive name to your script:
nano fastq-filter-length.sh
Copy and paste the following code block into the nano editor, save it with ctrl+o
and switch to another bash window in screen.
Hands on!
#!/bin/bash
# your_code_here
echo Replace me with real code!
echo Arguments: $1 $2
# to stay with the 'tool concept'
# expect input on stdin and output the results to stdout
We need to mark the file as executable and test it:
chmod +x fastq-filter-length.sh
# check with ls, filter_fastq.sh should be green now
# and using ll you should see the 'x' (eXecutable) permission
ls
ll
# and run it (the ./ is important!)
./fastq-filter-length.sh
Note
You can check file permissions by typing ll
instead of ls
.
rwx
stand for Read, Write, eXecute, and are repeated three times,
for User, Group, and Others. The two names you see next to the
permissions are file’s owner user and group.
You can change the permissions - if you have the permission to do so -
by e.g. chmod go+w
- “add write permission to group and others”.
Now collect your code from above (contents of your function, not the whole
function) and paste it below the shebang. Don’t forget to remove the debug echo
parts - otherwise your script will spoil it’s output with some useless chatter.
# when the final code is there, you need to give it input (and maybe save the output):
<data/HRTMUOC01.RL12.01.fastq ./fastq-filter-length.sh 90 > data/filtered.fastq
Code management and sharing via GIT¶
Once you start writing your scripts, you’ll soon find yourself handling files
named like script.sh
, script_bak.sh
, script_previous_previous.sh
,
script_last_working.sh
.. etc. There is one cure for them all: git
.
It was originally created by Linus Torvalds, the author of the Linux kernel,
for managing the source code of Linux, it slowly gained popularity in other
communities.
What git
does is managing versions of a directory tree. The managed subtree
is called a repository. Each saved version is called a commit.
You usually create a commit when you got a working version of your code. By
allowing you to go back to any committed version git effectively removes the
need for all _previous_working.sh
copies of your code.
Note
Please note the difference between git
and Google Docs - Google Docs keeps
track of all versions of a particular file. git
keeps track of manually
selected versions (snapshots) of a whole directory. That makes sense when
script1.sh
calls script2.sh
, and they have to match.
You will be using git
to hand in the final exam, so please take care to set
it up correctly. Once on every new machine you need to tell git who you are,
because the commits are ‘signed’ by the author.
git config --global user.email "you@example.com"
git config --global user.name "Your Name"
To be able to share your code, you need an account on a ‘social coding’ site like GitHub. If you don’t have a GitHub account, please get one. You don’t have to do this right now, but if you believe could use some help, now it’s the best time.
To upload your code, you need to add a key to GitHub here. This is how you generate the needed key:
# generate the private and public keys
ssh-keygen -t ed25519
# show the public key to copy and paste to github
cat ~/.ssh/id_ed25519.pub
And now after a tedious setup let’s reap the benefits. We’ll store the current
version of your scripts in unix-advanced
project, ignoring the data.
# make sure we're in ~/projects/unix-advanced
pwd
# tell git we want to track this directory
git init
# tell git that we don't want to track and store the huge data
# (git is not good at storing big data)
echo 'data*' >> .gitignore
# check what git sees in our brand new repo
git status
# add a single file
git add workflow.sh
# make a commit
# add some descriptive message
git commit -m 'solution to ngs-course exercises"
This is all you need to track your versions locally. If you want to publish your creations, make a backup for yourself, or move your code to some shared machine which will do bigger computations, you need to push it somewhere. If you already have the GitHub account, you can create a repo on the website, add it as a remote to your local repo and push. You don’t have to be shy, GitHub allows you to create a private repo.
# use the commands suggested by GitHub to add a remote
# ...
# then push
git push
When using git, you can gradually learn about more concepts and commands, as you find the need for them. To give you a head start:
git pull
updates your local repo if the remote is newer- by pulling other’s changes over yours, you’ll soon encounter merge
git stash
can be used to “hide” local changes duringpull
to avoid a commit and following merge,git stash pop
brings them backgit checkout -b new-name
andgit branch some-name
allow you to keep more simultaneous versions in one repo and switch between them
Multi-file, multi-core processing¶
Multi-file processing is best done with find
and xargs
. That’s basic
Unix. If you install parallel
, it substitutes xargs
and does much
better job, having ‘nicer’ syntax, and makes multi-file multi-core processing
a breeze.
Let’s check the basic concepts - find
converts directory structure to
‘data’ (stdout), xargs
converts stdin to command line(s).
# Investigate!
find data -type f
find data -type f | xargs echo
find data -type f | xargs -I{} echo File: {} found!
parallel
runs one instance of the command per each CPU in your machine.
Regrettably your virtual machine has only one CPU, so this won’t help
much. But modern machines do have four and more CPUs, and then it really
helps.
Do control the number of jobs (-j
) only when sharing the machine with
someone, or when you’re sure that your task is IO bound. Otherwise
parallel
does a good job choosing the number of tasks to run for you.
Note
Parallelizing things IS difficult. There’s no discussion about that. There are some rules of thumb, which can help - but if you want to squeeze out the maximum performance from your machine, it’s still a lot of ‘try - monitor performance - try again’ cycles.
To get good performance it is important to know what happens during data processing: First the data is loaded from hard drive to memory, then from memory to the CPU, the CPU does the calculation, then the results have to get to the memory and saved to the hard drive again. Different workloads take different amounts of time in each step.

In general, you need a work unit which takes much longer to calculate than
it takes to load the data from the hard drive (compare times of pv data >
/dev/null
to pv data | your-task > /dev/null
), usually a good work
unit takes on the order of minutes. When disk access seems to be the
limiting factor, you can try to compress the data with some fast compressor
like lz4
. Do not parallelize disk intensive tasks, it will make
things only slower! If you still want to use parallel
’s syntax, use
parallel -j1
to use only single core.
The most powerful thing about parallel is it’s substitution strings like
{.}
, {/}
, {#}
- check man parallel
.
parallel echo Ahoj ::: A B C
parallel --dry-run echo Ahoj ::: A B C
parallel echo File: {} found! ::: data/*.fastq
parallel echo File: {/} found! ::: data/*.fastq
parallel echo File: {/.} found! ::: data/*.fastq
Note
If your data is a single file, but the processing of one line is not
dependent on the other lines, you can use the split
command to create
several files each with defined number of lines from the original file.
Session 5: Graphics session¶
A picture is worth a thousand words.
Especially when your data is big. We’ll try to show you one of the easiest ways to get nice pictures from your Unix. We’ll be using R, but we’re not trying to teach you R. R Project is huge, and mostly a huge mess. We’re cherry picking just the best bits;)
Summarization¶
R is best for working with ‘tables’. That means data, where each line
contains the same amount of ‘fields’, delimited by some special character
like ;
or <tab>
. The first row can contain column names. VCF is
almost a nice tabular file. The delimiter is <tab>
, but there is some mess
in the beginning of the file:
</data-shared/mus_mda/00-popdata/popdata_mda.vcf.gz zcat | less -S
Prepare the input file¶
Our input dataset is huge, and we want to use only some subset of the animals.
Let’s choose few european individuals. They have RDS, KCT, MWN
and
BAG
in their names. Each programmer is lazy and prefers to avoid mistakes
by letting the machine do the work - let’s find the right columns with Unix
tools.
# we'll be reusing the same long file name, store it into a variable
IN=/data-shared/mus_mda/00-popdata/popdata_mda.vcf.gz
# create a new 'project' directory in data
mkdir -p ~/projects/plotvcf/data
cd ~/projects/plotvcf
# check the file once again, and find the obligatory VCF columns
<$IN zcat | less -S
# let the computer number the colums (colnames)
<$IN zcat |
grep -v '^##' |
head -1 |
tr "\t" "\n" |
nl |
less
# the obligatory columns are 1-9
# now find the sample columns according to the pattern
<$IN zcat |
grep -v '^##' |
head -1 |
tr "\t" "\n" |
nl |
grep -e RDS -e KCT -e MWN -e BAG |
awk '{print $1}' |
paste -sd,
# it's 67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85
# now decompress with zcat and subset the data with cut
<$IN zcat |
cut -f1-9,67-85 \
> data/popdata_mda_euro.vcf
We want to get rid of the comment lines starting with ##
, and keep the
line starting with #
as column names (getting rid of the #
itself):
IN=data/popdata_mda_euro.vcf
# get rid of the '##' lines (quotes have to be there, otherwise
# '#' means a comment in bash)
<$IN grep -v '##' | less -S
# better formatted output
<$IN grep -v '##' | column -t | less -S
# it takes too much time, probably column -t is reading too much of the file
# we're good with first 1000 lines
<$IN grep -v '##' | head -1000 | column -t | less -S
# good, now trim the first #
<$IN grep -v '##' | tail -c +2 | less -S
# all looks ok, store it (tsv for tab separated values)
# skip the very first character with tail
<$IN grep -v '##' | tail -c +2 > data/popdata_mda_euro.tsv
Now open RStudio¶
Just click this link (ctrl-click to keep this manual open): Open RStudio.
In R Studio choose File > New file > R Script
. R has a working directory
as well. You can change it with setwd
. Type this code into the newly
created file:
setwd('~/projects/plotvcf')
With the cursor still in the setwd
line, press ctrl+enter
. This copies
the command to the console and executes it. Now press ctrl+s
, and save
your script as plots.R
. It is a better practice to write all your commands
in the script window, and execute with ctrl+enter
. You can comment them
easily, you’ll find them faster than in .Rhistory
…
Load and check the input¶
We’ll be using a specifc subset of R, recently packaged into Tidyverse:
library(tidyverse)
Tabular data is loaded by read_tsv
. On a new line,
type read_tsv
and press F1
. Help should pop up. We’ll be using the
read.delim
shorthand, that is preset for loading <tab>
separated data
with US decimal separator:
read_tsv('data/popdata_mda_euro.tsv') -> d
A new entry should show up in the ‘Environment’ tab. Click the arrow and
explore. Also click the d
letter itself.
You can see that CHROM
was encoded as a number only and it was loaded as
integer
. But in fact it is a factor, not a number (remember e.g.
chromosome X). Fix this in the mutate
command, loading the data again
and overwriting d
. The (smart) plotting would not work well otherwise:
read_tsv('data/popdata_mda_euro.tsv') %>%
mutate(CHROM = as.factor(CHROM)) ->
d
First plot¶
We will use the ggplot2
library. The ‘grammatical’ structure of the
command says what to plot, and how to represent the values. Usually the
ggplot
command contains the reference to the data, and graphic elements
are added with + geom_..()
. There are even some sensible defaults - e.g.
geom_bar
of a factor sums the observations for each level of the factor:
ggplot(d, aes(CHROM)) + geom_bar()
This shows the number of variants in each chromosome. You can see here, that we’ve included only a subset of the data, comprising chromosomes 2 and 11.
Summarize the data¶
We’re interested in variant density along the chromosomes. We can simply break the chromosome into equal sized chunks, and count variants in each of them as a measure of density.
There is a function round_any
in the package plyr
, which given
precision rounds the numbers. We will use it to round the variant position to
1x10^6 (million base pairs), and then use this rounded position as the block
identifier. Because the same positions repeat on each chromosome, we need to
calculate it once per each chromosome. This is guaranteed by group_by
.
mutate
just adds a column to the data.
You’re already used to pipes from the previous exercises. While it’s not
common in R, it is possible to build your commands in a similar way thanks to
the magrittr
package. The name of the package is an homage to the Belgian
surrealist René Magritte and his most popular painting.

Ceci n’est pas une pipe. This is not a pipe.
Although the magrittr %>%
operator is not a pipe, it behaves like one. You
can chain your commands like when building a bash pipeline:
# 'bash-like' ordering (source data -> operations -> output)
d %>%
group_by(CHROM) %>%
mutate(POS_block = plyr::round_any(POS, 1e6)) ->
dc
# the above command is equivalent to
dc <- mutate(group_by(d, CHROM), POS_block = plyr::round_any(POS, 1e6))
Now you can check how the round_any
processed the POS
value. Click the
dc
in the Environment tab and look for POS_block
. Looks good, we can go on.
The next transformation is to count variants (table rows) in each block (per chromosome):
You can use View
in R Studio instead of less
in bash.
dc %>%
group_by(CHROM, POS_block) %>%
summarise(nvars = n()) %>%
View
Note
To run the whole block at once with ctrl+enter
, select it before you press the shortcut.
If the data look like you expected, you can go on to plotting:
dc %>%
group_by(CHROM, POS_block) %>%
summarise(nvars = n()) %>%
ggplot(aes(POS_block, nvars)) +
geom_line() +
facet_wrap(~CHROM, ncol = 1)
Now you can improve your plot by making the labels more comprehensible:
dc %>%
group_by(CHROM, POS_block) %>%
summarise(nvars=n()) %>%
ggplot(aes(POS_block, nvars)) +
geom_line() +
facet_wrap(~CHROM, ncol = 1) +
ggtitle("SNP denisty per chromosome") +
ylab("number of variants") +
xlab("chromosome position")
If you prefer bars instead of a connected line, it’s an easy swap with ggplot.
dc %>%
group_by(CHROM, POS_block) %>%
summarise(nvars = n()) %>%
ggplot(aes(POS_block, nvars)) +
geom_col() +
facet_wrap(~CHROM, ncol = 1) +
ggtitle("SNP denisty per chromosome") +
ylab("number of variants") +
xlab("chromosome position")
This could have saved us some more typing:
ggplot(d, aes(POS)) +
geom_histogram() +
facet_wrap(~CHROM, ncol = 1) +
ggtitle("SNP denisty per chromosome") +
ylab("number of variants") +
xlab("chromosome position")
ggplot
warned you in the Console:
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
You can use binwidth
to adjust the width of the bars, setting it to 1x10^6
again:
ggplot(d, aes(POS)) +
geom_histogram(binwidth=1e6) +
facet_wrap(~CHROM, ncol = 1) +
ggtitle("SNP denisty per chromosome") +
ylab("number of variants") +
xlab("chromosome position")

Tidy data¶
To create plots in such a smooth way like in the previous example the data has to loosely conform to some simple rules. In short - each column is a variable, each row is an observation. You can find more details in the Tidy data paper.
The vcf as is can be considered tidy when using the CHROM
and POS
columns. Each variant (SNP) is a row. The data is not tidy when using variants
in particular individuals. All individual identifiers should be in single
column (variable), but there are several columns with individual names. This
is not ‘wrong’ per se, this format is more concise. But it does not work well
with ggplot
.
Now if we want to look at genotypes per individual, we need the genotype as a
single variable, not 18. gather
takes the values from multiple columns
and gathers them into one column. It creates another column where it stores
the originating column name for each value.
d %>%
gather(individual, genotype, 10:28 ) ->
dm
Look at the data. Now we can plot the counts of reference/heterozygous/alternative alleles easily.
ggplot(dm, aes(individual, fill = genotype)) + geom_bar()
Again, most of the code is (usually) spent on trying to make the plot look better:
ggplot(dm, aes(individual, fill = genotype)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))

Now try to change parts of the command to see the effect of various parts. Delete
, fill = genotype
(including the comma), execute. A bit boring. We can get much
more colours by colouring each base change differently:
# base change pairs, but plotting sometnihg else than we need (probably)
ggplot(dm, aes(individual, fill = paste(REF, ALT))) + geom_bar()
What could be interesting is the transitions to transversions ratio, for each individual:
# transitions are AG, GA, CT, TC
# transversions is the rest
transitions <- c("A G", "G A", "C T", "T C")
dm %>%
mutate(vartype = paste(REF, ALT) %in% transitions %>% ifelse("Transition", "Transversion")) %>%
ggplot(aes(individual, fill=vartype)) +
geom_bar()
# works a bit, but not what we expected
# now count each homozygous ref as 0,
# heterozygous as 1 and homozygous alt as 2
# filter out uncalled
dm %>%
filter(genotype != './.') %>%
mutate(vartype = paste(REF, ALT) %in% transitions %>% ifelse("Transition", "Transversion"),
score = ifelse(genotype == '0/0', 0, ifelse(genotype == '0/1', 1, 2))) %>%
group_by(individual, vartype) %>%
summarise(score = sum(score)) %>%
spread(vartype, score) %>%
mutate(TiTv = Transition / Transversion) %>%
ggplot(aes(individual, TiTv)) +
geom_point() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
Note
Take your time to look at the wonderful cheat sheets compiled by the company behind RStudio!
Session 6: Read quality¶
Many steps of genomic data processing have some associated quality value for their results. Here we will briefly check the first and last of those. But there is no simple way to set your quality thresholds. You have to recognize completely bad data. But after that there is a continuum. Sometimes you just need an idea of the underlying biology. Find some variants for further screening. Sometimes you’re trying to pinpoint particular variant causing a disease.
Each read that comes out of the (now common) sequencing machines like Illumina or Ion Torrent has a quality score assigned with each of the bases. This is not true for the upcoming NanoPore or almost forgotten SOLiD machines, that are reading more bases a time.
Phred encoding¶
The quality in Fastq files is encoded in Phred quality score, a number on a logarithmic scale, similar to decibels.
Phred quality Probability of error 20 1 in 100 40 1 in 10,000 60 1 in 1,000,000
Each Phred number is in turn encoded as a single character, so there is
straightforward mapping between the bases and the quality scores. The
most common mapping is Phred+33
:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
| | | | |
0.2......................26...31........41
FastQC¶
FastQC is a nice tool that you run with your data and get nice graphical reports on the quality. It is not completely installed in your images, so you can try to run a tool that was just unpacked (this is how this tool is distributed, there is no install script - a daily bread of a bioinformatician;). You have to use the full path to the tool to run it:
# make a project directory for the qualities
cd
mkdir -p projects/quality
cd projects/quality
mkdir 00-reads
</data-shared/fastq/fastq.tar.gz tar xz -C 00-reads
mkdir 01-quality
~/sw/FastQC/fastqc -o 01-quality --noextract 00-reads/HRTMUOC01.RL12.01.fastq
Now transfer the .html
files from the virtual machine to yours.
Open the files on your machine. You should see a report with plots
like this:


Parsing Fastq and decoding Phred¶
To understand better what is in the FastQC plots, we will try to reproduce the plots using Unix and ggplot. You should be able to understand the following pipeline, at least by taking it apart with the help of head or less. A brief description:
paste
merges every four lines into oneawk
selects only reads longer than 50 basessed
replaces the leading ‘@’ with an empty stringhead
takes first 1,000 sequencesawk
creates a Phred decoding table inBEGIN
section, then uses the table (quals
) to decode the values, outputs one row for each base (see ‘tidy data’)
mkdir 02-quality-handmade
IN=00-reads/HRTMUOC01.RL12.01.fastq
<$IN paste - - - - |
awk 'length($2) > 50' |
sed 's/^@//' |
head -1000 |
awk 'BEGIN {
OFS="\t";
for(i = 33; i < 127; i++) quals[sprintf("%c", i)] = i - 33;
}
{
l = length($2)
for(i = 1; i <= l; i++) {
print $1, i, l - i, substr($2, i, 1), quals[substr($4, i, 1)];}
}'\
> 02-quality-handmade/quals.tsv
Quality by position¶
The first of the FastQC plots shows a summary of base qualities according to position in the read. But it does not show quality scores for all possible positions, they are grouped into classes of similar importance. The further the base in the read, the bigger the group.
Fire up R Studio by clicking the link.
Create a file where your plotting code will live, File > New file > R Script
,
move to the directory where you’re working now (don’t forget to use tab completion):
setwd('~/projects/quality')
Now save it as plots.R
. (Doing setwd first offers the correct directory to save the file.)
First we will read in the data.
library(tidyverse)
read_tsv("02-quality-handmade/quals.tsv",
col_names=c("seq", "pos", "end_pos", "base", "qual")) ->
d
We did not include column names in the data file, but it is easy to provide
them during the load via col_names
argument. Let’s look at base quality
values for first 10 sequences:
d$seq %>% unique %>% head(10) -> sel
d %>%
filter(seq %in% sel) %>%
ggplot(aes(pos, qual, colour = seq, group = seq)) +
geom_line()
The qualities on sequence level don’t seem to be very informative. They’re rather noisy. A good way to fight noise is aggregation. We will aggregate the quality values using boxplots and for different position regions. First set up the intervals:
# fastqc uses bins with varying size:
# 1-9 by one, up to 75 by 5, up to 300 by 50, rest by 100
c(0:9,
seq(14, 50, by = 5),
seq(59, 100, by = 10),
seq(149, 300, by = 50),
seq(400, 1000, by=100)) ->
breaks
# create nice labels for the intervals
data.frame(
l = breaks %>% head(-1),
r = breaks %>% tail(-1)) %>%
mutate(
diff = r - l,
lab = ifelse(diff > 1, paste0(l + 1, "-", r), as.character(r))) ->
labs
Check the breaks
and labs
variables. In the FastQC plot there are vertical quality zones,
green, yellow and red. To replicate this, we need the values of the limits:
# data for quality zones
data.frame(
ymin = c(0, 20, 28),
ymax = c(20, 28, 40),
colour=c("red", "orange", "green")) ->
quals
# check if the quality zones look reasonably
ggplot(quals, aes(ymin=ymin, ymax=ymax, fill=colour)) +
geom_rect(alpha=0.3, xmin=-Inf, xmax=Inf) +
scale_fill_identity() +
scale_x_discrete()
Now we can use the breaks to create position bins:
d %>%
mutate(bin=cut(pos, breaks, labels = labs$lab)) ->
dm
# plot the qualities in the bins
ggplot(dm, aes(bin, qual)) +
geom_boxplot(outlier.colour = NA) +
ylim(c(0, 45))
Zones and boxplots look ok, we can easily combine those two into one plot.
That’s pretty easy with ggplot. We use theme
to rotate the x labels, so
they’re all legible. In real world application the qualities are binned first,
and then the statistics are calculated on the fly, so it is not necessary to
load all the data at once.
ggplot(dm) +
geom_rect(aes(ymin = ymin, ymax = ymax, fill = colour),
xmin = -Inf,
xmax = Inf,
alpha=0.3,
data = quals) +
scale_fill_identity() +
geom_boxplot(aes(bin, qual), outlier.colour = NA, fill = "yellow") +
geom_smooth(aes(bin, qual, group = 1), colour = "blue") +
theme(axis.text.x = element_text(angle = 40, hjust = 1))

Now we can do the base frequency plot. We already have the position bins, so just throw ggplot at it:
ggplot(dm, aes(bin, fill = base)) + geom_bar()
We’re almost there, just need to normalize the values in each column so they sum up to 1. Ggplot can do it for us:
ggplot(dm, aes(bin, fill = base)) + geom_bar(position = "fill")
If you still want to get the line chart, you need to calculate the relative frequencies yourself:
dm %>%
group_by(bin, base) %>%
summarise(count = n()) %>%
group_by(bin) %>%
mutate(`relative frequency` = count / sum(count)) ->
dfreq
dfreq %>%
ggplot(aes(bin, `relative frequency`, colour = base, group = base)) +
geom_line(size = 1.3)
What is better about the bar chart, and what is better about the line chart?
FastQC exercise¶
Hands on!
Now run the FastQC quality check for all reads in 00-reads
. Write the commands on your own.
Use globbing patterns! Or try to write an alternative command with find
and parallel
.
Note
When checking quality of multiple fastq files, there is MultiQC - it takes the output of multiple FastQC runs and generates a nice summary. You can try to run MultiQC as a homework:
# run the multiqc on the fastqc results
multiqc -o 03-multiqc 01-quality
Session 7: Genome assembly¶
We’ll be assembling a genome of E. coli from paired Illumina reads. We’re
using Velvet
, which is a bit obsolete, but nice for teaching purposes.
Runs quite fast and does not consume that much memory. State-of-the art
assembler today is for example Spades.
mkdir -p ~/projects/assembly
cd ~/projects/assembly
# link the shared data to current project (no copying)
ln -s /data-shared/ecoli/reads 00-reads
# look what's around
ll
Velvet is used in two phases, the first phase prepares the reads, the second phase does the assembly itself. Open the Velvet manual. When using anything more complex than a Notepad you will actually save your time by reading the manual. Surprised?;)
Also - run screen
at this point, because you want to continue working,
while Velvet will be blocking one of the consoles.
# load and prepare the reads
velveth 03-velvet-k21 21 -fastq -short -separate 00-reads/MiSeq_Ecoli_MG1655_50x_R1.fastq 00-reads/MiSeq_Ecoli_MG1655_50x_R2.fastq
# do the assembly - you can flip through the manual in the meantime..
velvetg 03-velvet-k21
Running the whole assembly process is actually that simple. What is not simple is deciding, whether your assembly is correct and whether it is the best one you can get with your data. There is actually a lot of trial and error involved if you’re decided to get the best one. People usually combine several assemblers, test several settings for each assembler and then combine the best runs from each of the assemblers with another assembler..;)
The best criteria for evaluating your assembly are usually external - N50 is nice, but does not tell you much about chimeric contigs for example. So overlaps with another assembly of some close species, the number of genes that can be found using protein sequences from a close species are good metrics.
# check the expected (assembly) coverage
~/sw/velvet_1.2.10/contrib/estimate-exp_cov/velvet-estimate-exp_cov.pl 03-velvet-k21/stats.txt | less
On the other hand when it’s bad, any metrics will do - the reported N50 of 94 basepairs means there is something [terribly] wrong. Let’s try to use the information on read pairs.
velveth 04-velvet-k31-paired 31 -fastq -shortPaired -separate 00-reads/MiSeq_Ecoli_MG1655_50x_R1.fastq 00-reads/MiSeq_Ecoli_MG1655_50x_R2.fastq
velvetg 04-velvet-k31-paired -exp_cov auto -ins_length 150 -read_trkg yes
# Final graph has 749 nodes and n50 of 64026, max 182119, total 4551702, using 1508279/1546558 reads
# check the expected (assembly) coverage
~/sw/velvet_1.2.10/contrib/estimate-exp_cov/velvet-estimate-exp_cov.pl 04-velvet-k31-paired/stats.txt | less
# check observerd insert length
~/sw/velvet_1.2.10/contrib/observed-insert-length.pl/observed-insert-length.pl 04-velvet-k31-paired | less
The observed-insert-length.pl
calculates suggestions for -ins_length
and -ins_length_sd
parameters to velvetg
, so let’s try if the suggestions
improve the assembly:
velvetg 04-velvet-k31-paired -exp_cov 35 -cov_cutoff 0 -ins_length 297 -ins_length_sd 19.4291558645302 -read_trkg yes
# Final graph has 26162 nodes and n50 of 38604, max 155392, total 5412695, using 1501742/1546558 reads
velvetg 04-velvet-k31-paired -exp_cov auto -ins_length 297 -ins_length_sd 19.4291558645302 -read_trkg yes
# Final graph has 696 nodes and n50 of 95393, max 209376, total 4561418, using 1508281/1546558 reads
Now the run is really fast, because most of the work is already done. The N50 improved significantly, and we also got ~50 contigs less, which could mean a better assembly. When there is already some reference sequence available, we can compare the size of the reference sequence with our assembly - the more similar, the better;) 4.6 Mbp is quite close to 4,561,418 … nice.
Using Mauve
we can align the result with the reference E. coli genome:

Session 8: Bioinformatic pipelines¶
Mouse Gene Ontology enrichment analysis pipeline¶
Bioinformatic pipeline to carry Gene Ontology enrichment analysis for low and high divergence genes among two house mouse subspecies.
SNP variants for two mouse strains (PWD/PhJ, WSB/EiJ) were downloaded from the Mouse Genome Project FTP site (Mouse Genome Project). PWD/PhJ and WSB/EiJ represent Mus musculus musculus and Mus musculus domesticus subspecies, respectively.
The aim is to identify genes with high relative divergence between the two strains and carry Gene Ontology enrichment analysis for genes according to the divergence.
First, you have to clone the repository from the GitHub:
# Go to you project directory
cd ~/projects
# Clone repository
git clone https://github.com/vjanousk/mouse-go-analysis.git
# Go to the local repository
cd mouse-go-analysis
Install required software¶
For the mouse Gene Ontology pipeline three specialized genomic tools are necessary:
bedtools
, vcftools
or bcftools
. If they are not installed, the script below will
install these tools.
./install.sh
Prepare directories¶
We are going to prepare directories that are use to store source data as well as
intermediate steps and final resulting data. We will create a new data
directory
in your project directory
mkdir -p data/00-source-data data/01-divergence data/02-go
Define variables¶
Several types of variables defined. Filtering parameters provide thresholds on filtering quality and number of genes used at different stages of the pipeline.
# Filtering parameters
quality=50
readdepth=10
minnumgenes=30
# Working directories (removed from the git in .gitignore)
wd_source=data/00-source-data
wd_divergence=data/01-divergence
wd_go=data/02-go
# Source files
sourcevcf=/data-shared/mouse_go/mgp.v5.snps.dbSNP142.clean.X.vcf.gz
sourcegenes=/data-shared/mouse_go/MGI.gff3.gz
go2genes=/data-shared/mouse_go/gene_association.mgi.gz
goterms=/data-shared/mouse_go/go_terms.mgi.gz
# Processed source files
cds_db=$wd_source/mgi-cds.bed
go_db=$wd_source/go2genes.txt
# Divergence analysis output files:
annotation=$wd_divergence/annotation.tab
divergencevcf=$wd_divergence/out-vars.vcf
divergence=$wd_divergence/divergence.bed
# GO enrichment by high and low divergence regions
div_go=$wd_go/divergence_by_go.txt
Prepare CDS & GO databases¶
MGI.gff3.gz
represents a full report containing detailed information on genes,
mRNAs, exons and CDS. For the divergence analysis only CDS are needed. CDS database
is prepared in this step and .gff3
is converted to .bed
to work more easily with
the CDS data.
src/make_cds_database.sh $sourcegenes $cds_db
go_terms.mgi.gz
and gene_association.mgi.gz
represents GO terms and association
between genes and GO terms IDs provided by Mouse Genome Informatics
(Mouse Genome Informatics) and Gene Ontology
Consortium (Gene Ontology). In the command below joined
dataset of list of genes with GO term enrichment is prepared.
src/make_go_database.sh $go2genes $goterms $go_db
Run the pipeline step-by-step¶
1. Selecting SNPs that are divergent between the two strains
Other criteria used for selection is the PHRED quality and read depth (DP).
Divergent SNPs are identified using Fst function built in the vcftools
. SNPs
are considered to be divergent when Fst equals 1.
src/get_divergent_variants.sh \
$quality \
$readdepth \
$sourcevcf \
$annotation \
$divergencevcf
2. Calculate the per gene divergence
Once the list of divergent SNPs between the two strains and the CDS database are created,
the divergence per gene can be calculated. Combination of bedtools
tools and awk
commands is used to find SNPs overlapping CDS parts of the genes and calculate sums
and relative divergence by genes.
src/calculate_per_gene_divergence.sh \
$divergencevcf.gz \
$cds_db \
$divergence
3. Calculate the average relative divergence by Gene Ontology category
Per-gene relative divergences are used to calculate the average relative divergence
for individual GO terms. Combinatino of the built-in Unix join
and sort
commands
is used along with groupby that is part of the bedtools
tools suite. GO dataset
is joined to dataset on with gene relative divergences. The average for every GO term
is then calculated omitting low prevalence GO terms.
src/divergence_by_go.sh \
$divergence \
$go_db \
$minnumgenes \
$div_go
4. Prepare a barplot showing results of the GO enrichment analysis
To plot the results of the GO enrichment analysis Rscript
is used. Library ggplot2
is the most suitable tool to provide fast and efficient plot.
Rscript src/plot.R
Alternatively, we can open the .R
file in R Studio and plot the graph there.
Resulting ggplot graph¶

Run the whole pipeline at once¶
Now we can try to run the whole pipeline at once using pipeline.sh
shell script
using different set of parameters.
# Filtering parameters
quality=50
readdepth=10
minnumgenes=30
# GO enrichment by high and low divergence regions
div_go=$wd_go/divergence_by_go-ver2.txt
# Run the pipeline
./pipeline.sh \
$quality \
$readdepth \
$minnumgenes \
$sourcevcf \
$annotation \
$divergencevcf \
$cds_db \
$divergence \
$go_db \
$div_go
Session 9: Variant quality¶
In this part you will be working on your own. You’re already familiar with the VCF format and some reformatting and plotting tools. There is a file with variants from several nightingale individuals:
/data-shared/vcf_examples/luscinia_vars.vcf.gz
Your task now is:
- pick only data for chromosomes
chr1
andchrZ
- extract the sequencing depth
DP
from theINFO
column - extract variant type by checking if the
INFO
column containsINDEL
string - load these two columns together with the first six columns of the VCF into R
- explore graphically
- barchart of variant types
- boxplot of qualities for INDELs and SNPs (use
scale_y_log10()
if you don’t like the outliers) - histogram of qualities for INDELs and SNPs (usescale_x_log10()
,facet_wrap()
) - what is the problem?
And a bit of guidance here:
- create a new project directory in your
projects
- get rid of the comments (they start with
#
, that is^#
regular expression) - filter lines based on chromosomes (
grep -e 'chr1\s' -e 'chrZ\s'
) - extact the first 6 columns (
cut -f1-6
) - extract
DP
column (egrep -o 'DP=[^;]*' | sed 's/DP=//'
) - check each line for
INDEL
(awk '{if($0 ~ /INDEL/) print "INDEL"; else print "SNP"}'
) - merge the data (columns) before loading to R (
paste
) - add column names while loading the data with
read_tsv(..., col_names=c(...))
Good luck! (We will help you;)
Additional reference materials:
Links¶
Here is few sources that you can go through, when you get bored during the course…
Advanced Bash¶
Bash command help re-imagined¶
Some fun¶
R Studio¶
Visual design¶
A few more tips¶
This is a collection of tips, that may help to overcome the initial barrier of working with a ‘foreign’ system. There is a lot of ways to achieve the solution, those presented here are not the only correct ones, but some that proved beneficial to the authors.
Easiest ways to get Unix¶
To get the most basic Unix tools, you can download an install Git for Windows. It comes with a nice terminal emulator, and installs to your right-click menu as ‘Git Bash here’ - which runs terminal in the folder that you clicked. Git itself is meant for managing versions of directories, but it cannot live without the Unix environment, so someone did the hard work and packaged it all nicely together.
If you need more complete Unix environment, there are currently several options. If you have a recent version of Windows 10 (yes, there are different versions of Window 10), you can enable ‘Windows Subsystem for Linux (WSL)’ and then install Ubuntu or Debian from the Windows Store. It’s a marvel of engineering to connect two operating systems, you’re getting a ‘real’ Linux.
In older Windows, you can use Cygwin. It is quite complete, but it can’t replace native Unix, as you’ll find sooner or later.
Another easy way of getting a Unix environment in Windows is to install a basic Linux into a virtual machine. Our previous courses used this method. It’s much more convenient that the dual boot configurations, and the risk of completely breaking your computer is lower. You can be using Unix while having all your familiar stuff at hand. The only downside is that you have to transfer all the data as if the image was a remote machine.
It’s much more convenient to use a normal terminal like PuTTY to connect to the machine rather than typing the commands into the virtual screen of VirtualBox - It’s usually lacking clipboard support, you cannot change the size of the window, etc.
Mac OS X and Linux are Unix based, you just have to know how to start your
terminal program (konsole
, xterm
, Terminal
…).
Essentials¶
Always use screen
for any remote work. Not using screen will cause your
jobs being interrupted when the network link fails (given you’re working remotely),
and it will make you keep your home computer running even if your calculation is running
on a remote server.
Track system resources usage with htop
. System that is running low on memory won’t
perform fast. System with many cores where only one core (‘CPU’) is used should be used for
more tasks - or can finish your task much faster, if used correctly.
Data organization¶
Make a new directory for each project. Put all your data into subdirectories. Use
symbolic links to reference huge data that are reused by more projects in your current
project directory.
Prefix your directory names with 2 digit numbers, if your projects have more than few
subdirectories. Increase the number as the data inside is more and more ‘processed’.
Keep the code in the top directory. It is easy to distinguish data references just by
having [0-9]{2}-
prefix.
Example of genomic pipeline data directory follows:
00-raw --> /data/slavici/all-reads
01-fastqc
10-trim-adaptors
13-fastqc
20-assembly-newbler
30-map-reads-gmap
31-map-reads-bwa
50-variants-samtools
Take care to note all the code used to produce all the intermediate data files. This has two benefits: 1) your results will be really reproducible 2) it will save you much work when doing the same again, or trying different settings
If you feel geeky, use git
to track your code files. It will save you from having 20 versions
of one script - and you being completely lost a year later, when trying to figure out which one
was the one that was actually working.
Building command lines¶
Build the pipelines command by command, keeping | less -S
(or | head
if you don’t expect lines
of the output to be longer than your terminal width) at the end. Every time you check if the
output is what you expect, and only after that add the next command. If there is a sort
in
your pipeline, you have to put head
in front of the sort
, because otherwise sort has to process
all the data before it gives out any output.
I (Libor) do prefer the ‘input first’ syntax (<file command | command2 | command3
>out
) which improves legibility, resembles the real world pipeline (garden
hose, input tap -> garden hose -> garden sprinkler) more, and when changing
the input file names when reusing the pipeline, the names are easier to find.
Wrap your long pipelines on |
- copy and paste to bash still works, because bash knows there
has to be something after |
at the end of the line. Only the last line has to be escaped with \
,
otherwise all your output would go to the screen instead of a file.
<infile sort -k3,3 |
uniq -c -s64 |
sort -k1rn,1 \
>out
You can get a nice progress bar if you use pv
(pipe viewer) instead of cat
at the beginning
of the pipeline. But again, if there is a sort
in your pipeline, it has to consume all the data
before it starts to work.
Use variables instead of hard-coded file names / arguments, especially when the name is used more times in the process, or the argument is supposed to be tuned:
FILE=/data/00-reads/GS60IET02.RL1.fastq
THRESHOLD=300
# count sequences in file
<$FILE awk '(NR % 4 == 2)' | wc -l
# 42308
# count sequences longer that
<$FILE awk '(NR % 4 == 2 && length($0) > $THRESHOLD)' | wc -l
# 14190
Parallelization¶
Many tasks, especially in Big Data and NGS, are ‘data parallel’ - that means
you can split the data in pieces, compute the results on each piece separately
and then combine the results to get the complete result. This makes very easy
to exploit the full power of modern multi core machines, speeding up your
processing e.g. 10 times. GNU parallel
is a nice tool that helps to
parallelize bash pipelines, check the manual for some examples: man
parallel_tutorial
.
Reference manual for Unix introduction¶
Basic orientation in Unix¶
Multiple windows (screen)
You’re all used to work with multiple windows (in MS Windows;). You can have them in Unix as well. The main benefit, however, is that you can log off and your programs keep running.
To go into a screen mode type:
screen
Once in screen you can control screen itself after you press the master key (and
then a command): ctrl+a key
. To create a new window within the screen mode,
press ctrl+a c
(create). To flip among your windows press ctrl+a space
(you flip windows often, it’s the biggest key available). To detach screen (i.e.
keep your programs running and go home), press ctrl+a d
(detach).
To open a detached screen type:
screen -r # -r means restore
To list running screens, type:
screen -ls
Controlling processes (htop/top)
htop
or top
serve to see actual resource utilization for each running
process. Htop is much nicer variant of standard top
. You can sort the
processes by memory usage, CPU usage and few other things.
Getting help (man)
Just any time you’re not sure about program option while building a command
line, just flip to next screen window (you’re always using screen for serious
work), and type man
and name of the command you want to know more about:
man screen
Moving around & manipulation with files and directories¶
Basic commands to move around and manipulate files/directories.
pwd # prints current directory path
cd # changes current directory path
ls # lists current directory contents
ll # lists detailed contents of current directory
mkdir # creates a directory
rm # removes a file
rm -r # removes a directory
cp # copies a file/directory
mv # moves a file/directory
locate # tries to find a file by name
Usage:
cd
To change into a specific subdirectory, and make it our current working directory:
cd go/into/specific/subdirectory
To change to parent directory:
cd ..
To change to home directory:
cd
To go up one level to the parent directory then down into the directory2:
cd ../directory2
To go up two levels:
cd ../../
ls
To list also the hidden files and directories (-a
) in current in given
folder along with human readable (-h
) size of files (-s
), type:
ls -ash
mv
To move a file data.fastq from current working directory to directory
/home/directory/fastq_files
, type:
mv data.fastq /home/directory/fastq_files/data.fastq
cp
To copy a file data.fastq from current working directory to directory
/home/directory/fastq_files
, type:
cp data.fastq /home/directory/fastq_files/data.fastq
locate
This quickly finds a file by a part of its name or path. To locate a file named data.fastq type:
locate data.fastq
The locate
command uses a database of paths which is automatically updated
only once a day. When you look for some recent files you may not find them. You
can manually request the update:
sudo updatedb
Symbolic links
Symbolic links refer to some other files or directories in a different location. It is useful when one wants to work with some files accessible to more users but wants to have them in a convenient location at the same time. Also, it is useful when one works with the same big data in multiple projects. Instead of copying them into each project directory one can simply use symbolic links.
A symbolic link can are created by:
ln -s /data/genomes/luscinia/genome.fa genome/genome.fasta
Exploring and basic manipulation with data¶
less
Program to view the contents of text files. As it loads only the part of a the file that fits the screen (i.e. does not have to read entire file before starting), it has fast load times even for large files.
To view text file while disabling line wrap and add line numbers add options
-S
and -N
, respectively:
less -SN data.fasta
To navigate within the text file while viewing use:
Key Command Space bar Next page b Previous page Enter key Next line /<string> Look for string <n>G Go to line <n> G Go to end of file h Help q Quit
cat
Utility which outputs the contents of a specific file and can be used to concatenate and list files. Sometimes used in Czech as translated to ‘kočka’ and then made into a verb - ‘vykočkovat’;)
cat seq1_a.fasta seq1_b.fasta > seq1.fasta
head
By default, this utility prints first 10 lines. The number of first n lines can
be specified by -n
option (or by -..number..
).
To print first 50 lines type:
.. code-block:: bash
head -n 50 data.txt
# is the same as head -50 data.txt
# special syntax prints all but last 50 lines head -n -50 data.txt
tail
By default, this utility prints last 10 lines. The number of last n lines can be
specified by -n
option as in case of head.
To print last 20 lines type:
tail -n 20 data.txt
To skip the first line in the file (e.g. to remove header line of the file):
tail -n +2 data.txt
grep
This utility searches a text file(s) for lines matching a text pattern and
prints the matching lines. To match given pattern it uses either specific string
or regular expressions. Regular expressions enable for a more generic pattern
rather than a fixed string (e. g. search for a
followed by 4 numbers
followed by any capital letter - a[0-9]{4}[A-Z]
).
To obtain one file with list of sequence IDs in multiple fasta files type:
grep '>' *.fasta > seq_ids.txt
To print all but #-starting lines from the vcf file use option -v
(print
non-matching lines):
grep -v ^# snps.vcf > snps.tab
The ^#
mark means beginning of line followed directly by #
.
wc
This utility generates set of statistics on either standard input or list of text files. It provides these statistics:
- line count (
-l
)- word count (
-w
)- character count (
-m
)- byte count (
-c
)- length of the longest line (
-L
)If specific word provided it returns count of this word in a given file.
To obtain number of files in a given directory type:
ls | wc -l
The
|
symbol is explained in further section.cut
Cut out specific columns (fields/bytes) out of a file. By default, fields are separated by TAB. Otherwise, change delimiter using
-d
option. To select specific fields out of a file use-f
option (position of selected fields/columns separated by commas). If needed to complement selected fields (i.e. keep all but selected fields) use--complement
option.Out of large matrix select all but first column and row representing IDs of rows and columns, respectively:
< matrix1.txt tail -n +2 | cut --complement -f 1 > matrix2.txtsort
This utility sorts a file based on whole lines or selected columns. To sort numerically use
-n
option. Range of columns used as sorting criterion is specified by-k
option.Extract list of SNPs with their IDs and coordinates in genome from vcf file and sort them based on chromosome and physical position:
< snps.vcf grep ^# | cut -f 1-4 | sort -n -k2,2 -k3,3 > snps.tabuniq
This utility takes sorted lists and provides unique records and also counts of non-unique records (
-c
). To have more numerous records on top of output use-r
option forsort
command.Find out count of SNPs on each chromosome:
< snps.vcf grep ^# | cut -f 2 | sort | uniq -c > chromosomes.tabtr
Replaces or removes specific sets of characters within files.
To replace characters a and b in the entire file for characters c and d, respectively, type:
tr 'ab' 'cd' < file1.txt > file2.txtMultiple consecutive occurrences of specific character can be replaced by single character using
-s
option. To remove empty lines type:tr -s '\n' < file1.txt > file2.txt
To replace lower case to upper case in fasta sequence type:
tr "[:lower:]" "[:upper:]" < file1.txt > file2.txt
Building commands¶
Globbing
Refers to manipulating (searching/listing/etc.) files based on pattern matching using specific characters.
Example:
ls # a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta ls *.fasta # seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fastaCharacter
*
in previous example replaces any number of any characters and it indicates tols
command to list any file ending with “.fasta”.However, if we look for fastq instead, we get no result:
ls *.fastq #
Character
?
in following example replaces just right the one character (a/b) and it indicates to ls functions to list files containing seq2_ at the beginning, any single character in the middle (a/b) and ending with “.fasta”ls # a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta ls seq2_?.fasta # seq2_a.fasta seq2_b.fastals # a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta ls seq2_[ab].fasta # seq2_a.fasta seq2_b.fastaOne can specifically list altering characters (a,b) using brackets
[]
. One may also be more general and list all files having any alphabetical character[a-z]
or any numerical character[0-9]
:ls # a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta ls seq[0-9]_[a-z].fasta # seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta
TAB completition
Using key TAB one can finish unique file names or paths without having to fully type them. (try and see)
From this perspective it is important to think about names for directories in advance as it can spare you a lot time in future. For instance, when processing data with multiple steps one can use numbers at beginnings of names:
- 00-beginning
- 01-first-processing
- 02-second-processsing
- …
Variables
Unix environment enables to use shell variables. To set primer sequence
'GATACGCTACGTGC'
to variablePRIMER1
in a command line and print it on screen usingecho
, type:PRIMER1=GATACGCTACGTGC echo $PRIMER1 # GATACGCTACGTGC
Note
It is good habit in Unix to use capitalized names for variables: PRIMER1
not primer1
.
Producing lists
What do these commands do?
touch file-0{1..9}.txt file-{10..20}.txt
touch 0{1..9}-{a..f}.txt {10..12}-{a..f}.txt
touch 0{1..9}-{jan,feb,mar}.txt {10..12}-{jan,feb,mar}.txt
Exercise:
Program runs 20 runs of simulations for three datasets (hm, ss, mm) using three different sets of values: small (sm), medium sized (md) and large (lg). There are three groups of output files, which should go into subdirectory A, B and C. Make a directory for each dataset-set of parameters-run-subdirectory. Count the number of directories.
Producing lists of subdirectories
mkdir –p {2013..2015}/{A..C}
mkdir –p {2013..2015}/0{1..9}/{A..C} {2013..2015}/{10..12}/{A..C}
Pipes
Unix environment enables to chain commands using pipe symbol
|
. Standard output of the first command serves as standard input of the second one, and so on.ls | head -n 5
Subshell
Subshell enables to run two commands and capture the output into single file. It can be helpful in dealing with data files headers. Use of subshell enables to remove header, run the set of operations on the data, and later insert the header back to file. The basic syntax is:
(command1 file1.txt && command2 file1.txt) > file2.txtTo sort data file based on two columns without including header type:
(head -n 1 file1.txt && tail -n +2 file1.txt | sort -n -k1,1 -k2,2) > file2.txtSubshell can be used also to preprocess multiple inputs on the fly (saving useless intermediate files):
paste <(< file1.txt tr ' ' '\t') <(<file2.txt tr '' '\t') > file3.txt
Advanced text manipulation (sed)¶
sed
“stream editor” allows you to change file line by line. You can substitute text, you can drop lines, you can transform text… but
the syntax can be quite opaque if you’re doing anything more than substituting foo with bar in every line (sed 's/foo/bar/g'
).
More complex data manipulation (awk)¶
awk
enables to manipulate text data in a very complex way. In fact, it is a simple programming language with functionality similar to regular programming languages. As such it enables enormous variability in ways of how to process text data.
It can be used to write a short script and which can be chained along with Unix commands in one pipeline. The biggest power of awk is that it’s line oriented and saves you lot of boilerplate code that you would have to write in other languages, if you need moderately complex processing of text files. The basic structure of the script is divided into three parts and any of these three parts may or may not be included in the script (according to the intention of user). The first part 'BEGIN{}'
conducts operation before going through the input file, the middle part '{}'
goes throughout the input file and conducts operations on each line separately. The last part 'END{}'
conducts operation after going through the input file.
The basic syntax:
< data.txt awk 'BEGIN{<before data processing>} {<process each line>} END{<after all lines are processed>}' > output.txt
Built-in variables
awk has several built-in variables which can be used to track and process data without having to program specific feature.
The basic four built-in variables:
FS
- input field separatorOFS
- output field separatorNR
- record (line) numberNF
- number of fields in record (in line)
There is even more built-in variables that we won’t discuss here: RS
, ORS
, FILENAME
, FNR
Use of built-in variables:
awk splits each line into columns based on white space. When a different delimiter (e.g. TAB) is to be used, it can be specified using -F
option. If you want to keep this custom Field Separator in the output, you have to set the Output Field Separator as well (there’s no command line option for OFS):
< data.txt awk -F $'\t' 'BEGIN{OFS=FS}{print $1,$2}' > output.txtThis command takes file data.txt, extract first two TAB delimited columns of the input file and print them TAB delimited into the output file output.txt. When we look more closely on the syntax we see that the TAB delimiter was set using
-F
option. This option corresponds to theFS
built-in variable. As we want TAB delimited columns in the output file we passFS
toOFS
(i.e. ouput field separator) in theBEGIN
section. Further, in the middle section we print out first two columns which can be extracted by numbers with$
symbol ($1
,$2
). The numbers correspond to position of the column in the input file. We could, of course, use for this operation thetr
command which is even simpler. However, the awk enables to conduct any other operation on given data.Note
The complete input line is stored in
$0
.
The NR
built-in variable can be used to capture each second line in a file type:
< data.txt awk '{ if(NR % 2 == 0){ print $0 }}' > output.txt
The
%
symbol represents modulo operator which returns the remainder of division. Theif()
condition is used to decide on whether the modulo is 0 or not.Here is a bit more complex example of how to use
awk
. We write a command which retrieves coordinates of introns from coordinates of exons.Example of input file:
GeneID Chromosome Exon_Start Exon_End ENSG00000139618 chr13 32315474 32315667 ENSG00000139618 chr13 32316422 32316527 ENSG00000139618 chr13 32319077 32319325 ENSG00000139618 chr13 32325076 32325184 ... ... ... ...The command is going to be as follows:
When we look at the command step by step we first remove header and sort data based on GeneID and Exon_Start columns:
< exons.txt tail -n +2 | sort -k1,1 -k3,3n | ...Further, we write a short script using awk to obtain coordinates of introns:
... | awk -F $'\t' 'BEGIN{OFS=FS}{ if(NR==1){ x=$1; end1=$4+1; }else{ if(x==$1) { print $1,$2,end1,$3-1; end1=$4+1; }else{ x=$1; end1=$4+1; } } }' > introns.txtIn the
BEGIN{}
part we set TAB as output field separator. Further, usingNR==1
test we set GeneID for first line intox
variable and intron start into end1 variable. Otherwise we do nothing. For others recordsNR > 1
conditionx==$1
test if we are still within the same gene. If so we print exon end from previous line (end1
) as intron start and exon start of current line we use as intron end. Next, we set new intron start (i.e. exon end from current line) into end1. If we have already moved into new onex<>$1
) we repeat procedure for the first line and print nothing waiting for next line.
Joining multiple files + subshell¶
Use paste
, join
commands.
Note
Shell substitution is a nice way to pass a pipeline in a place where a file is expected, be it input or output file (Just use the appropriate sign). Multiple pipelines can be used in a single command:
cat <( cut -f 1 file.txt | sort -n ) <( cut -f 1 file2.txt | sort -n ) | less
Use nightingale FASTQ file
- Join all nightingale FASTQ files and create a TAB separated file with one line per read
# repeating input in paste causes it to take more lines from the same source
cat *.fastq | paste - - - - | cut -f 1-3 | less
Make a TAB-separated file having four columns:
- chromosome name
- number of variants in total for given chromosome
- number of variants which pass
- number of variants which fails
# Command 1
< data/luscinia_vars_flags.vcf grep -v '^#' | cut -f 1 |
sort | uniq -c | sed -r 's/^ +//' | tr " " "\t" > data/count_vars_chrom.txt
# Command 2
< data/luscinia_vars_flags.vcf grep -v '^#' | cut -f 1,7 | sort -r |
uniq -c | sed -r 's/^ +//' | tr " " "\t" | paste - - |
cut --complement -f 2,3,6 > data/count_vars_pass_fail.txt
# Command 3
join -1 2 -2 3 data/count_vars_chrom.txt data/count_vars_pass_fail.txt | wc -l
# How many lines did you retrieved?
# You have to sort the data before sending to ``join`` - subshell
join -1 2 -2 3 <( sort -k2,2 data/count_vars_chrom.txt ) \
<( sort -k3,3 data/count_vars_pass_fail.txt ) | tr " " "\t" > data/count_all.txt
All three commands together using subshell:
# and indented a bit more nicely
IN=data/luscinia_vars_flags.vcf
join -1 2 -2 3 \
<( <$IN grep -v '^#' |
cut -f 1 |
sort |
uniq -c |
sed -r 's/^ +//' |
tr " " "\t" |
sort -k2,2 ) \
<( <$IN grep -v '^#' |
cut -f 1,7 |
sort -r |
uniq -c |
sed -r 's/^ +//' |
tr " " "\t" |
paste - - |
cut --complement -f 2,3,6 |
sort -k3,3 ) |
tr " " "\t" \
> data/count_all.txt
Helpful commands (dir content and its size, disc usage)¶
ls -shaR # list all contents of directory (including subdirectories)
du -sh # disc usage (by directory)
df -h # disc free space
ls | wc -l # what does this command do?
locate # find a file/program by name
Important NGS formats¶
A selection of the most commonly used formats in NSG data processing pipelines.
High throughput specs hub: https://samtools.github.io/hts-specs/

FASTQ - Short reads¶
Sequencing instruments produce not only base calls, but usually can assign some quality score to each called base. Fastq contains multiple sequences, and each sequence is paired with quality scores for each base. The quality scores are encoded in text form.
SAM - Reads mapped to reference¶
SAM stands for Sequence Alignment/Mapping format. It includes parts of the original reads, that were mapped to a reference genome, together with the position where they belong to. There is an effective binary encoded counterpart called BAM and even more compressed CRAM.
BED and GFF - Annotations¶
Annotations are regions in given reference genome with some optional additional information. BED is very simple and thus easy to work with for small tasks, GFF (General Feature Format) is a comprehensive format allowing feature nesting, arbitrary data fields for each feature and so on.
VCF - Variants in individuals¶
VCF stands for Variant Call Format. Given a reference and a set of sequenced individuals, VCF is a format to store the differences in these individuals, compared to the reference, efficiently. There is also a binary counterpart BCF.
Additional exercises¶
These are tasks that do not fit any particular session, but we still consider them interesting enough to share them with you.
FizzBuzz¶
The most classical of all programmer interview tasks. Print the numbers form
1
to 100
, but replace numbers divisible by 3
with Fizz
, numbers divisible
by 5
with Buzz
and numbers divisible by both 3
and 5
by FizzBuzz
.
Counting heads¶
This is a nice example where bash can be used to solve a combinatorial problem by enumerating all the possibilities. And it is a long pipeline, so you have a lot of code to examine;)
Eight people are sitting around a circular table. Each has a coin. They all flip their coins. What is the probability that no two adjacent people will get heads?
The basic idea is that there is not much possibilities (only 2 to the power of 8, that is 256). We can just enumerate all the combinations and check if there is two adjacent heads.
This is the final solution, take your time to take it apart to see what each piece does.
(echo "obase=2;"; printf "%d\n" {0..255}) | bc | # generate numbers 0-255 in binary
sed 's/^/0000000/' | egrep -o '.{8}$' | # pad the output to 8 characters
sed 'h;G;s/\n//' | # double the pattern on each line to simulate ring
grep -v 11 | # ignore all patterns where two heads (1) are next to each other
wc -l # count the rest
To get a more understandable code, we can split it to functional parts. Then we can just play and try different implementations of the parts:
generate () { (echo "obase=2;"; printf "%d\n" {0..255}) | bc ;}
pad () { sed 's/^/0000000/' | egrep -o '.{8}$' ;}
ring () { sed p | paste - - | tr -d "\t" ;}
generate | pad | ring | grep -v 11 | wc -l
These are alternative solutions - you can paste them one by one, and check if the pipe is still working.
generate () { (echo "obase=2;"; echo {0..255} | tr " " "\n") | bc ;}
generate () { (echo "obase=2;"; echo {0..255} | xargs -n1) | bc ;}
generate () { (echo "obase=2;"; printf "%d\n" {0..255}) | bc ;}
pad () { awk '{print substr(sprintf("0000000%s", $0), length);}' ;}
pad () { sed 's/^/0000000/' | rev | cut -c-8 | rev ;}
pad () { sed 's/^/0000000/' | egrep -o '.{8}$' ;}
ring () { sed p | paste - - | tr -d "\t" ;}
ring () { sed 'h;G;s/\n//' ;}
generate | pad | ring | grep -v 11 | wc -l
The question was asking for the probability, thats one more division:
echo "scale=3;$( generate | pad | ring | grep -v 11 | wc -l ) / 256" | bc
Solutions by participants¶
One way to get a shorter (but much slower) solution is to ignore the binary conversion altogether, just use a huge list of decimal numbers and filter out anything that does not look like binary. Few variants follow:
seq -w 0 11111111 | grep ^[01]*$ | awk '!/11/ && !/^1.*1$/' | wc -l
seq -w 0 11111111 | grep ^[01]*$ | grep -v -e 11 -e ^1.*1$ | wc -l
seq -w 0 11111111 | awk '/^[01]*$/ && !/11/ && !/^1.*1$/' | wc -l
seq -w 0 11111111 | awk '!/[2-9]/ && !/11/ && !/^1.*1$/' | wc -l
seq -w 0 11111111 | grep -v -e [2-9] -e 11 -e ^1.*1$ | wc -l
I believe there are still a few ways to make it shorter;)
Dimensionality reduction¶
Methods like PCA and MDS (sometimes called PCoA to increase the confusion) are usually regarded as black box by many. Here we try to present a simple example that should help with getting a better idea on what are these magic boxes actually doing.
Load and visualize your data set¶
Now you can go to R and load the data:
setwd('~/projects/banana')
d <- read.csv("webapp/data/rotated.csv")
Plot the data to look what we’ve got:
library(ggplot2)
ggplot(d, aes(x, y)) + geom_point() + coord_equal()
Correct the distortion¶
Maybe you can already recognize what’s in your data. But it appears to be a bit .. rotated. Here is a code for 3d rotation of points, copy, paste and run it in your R session:
# create a 3d rotation matrix
# https://www.math.duke.edu/education/ccp/materials/linalg/rotation/rotm3.html
rotX <- function(t) matrix(c(cos(t), sin(t), 0, -sin(t), cos(t), 0, 0, 0, 1), nrow=3)
rotY <- function(t) matrix(c(1, 0, 0, 0, cos(t), sin(t), 0, -sin(t), cos(t)), nrow=3)
rotZ <- function(t) matrix(c(cos(t), 0, -sin(t), 0, 1, 0, sin(t), 0, cos(t)), nrow=3)
rot3d <- function(tx, ty, tz) rotX(tx) %*% rotY(ty) %*% rotZ(tz)
# rotate a data frame with points in rows
rot3d_df <- function(df, tx, ty, tz) {
rmx <- rot3d(tx, ty, tz)
res <- data.frame(t(apply(df, 1, function(x) rmx %*% as.numeric(x))))
colnames(res) <- colnames(df)
res
}
Now try to rotate the object a bit, so we can see it better. Try to find good values for the rotation yourself (numbers are in radians, 0..2*PI makes sense):
dr <- rot3d_df(d, .9, .1, 2)
ggplot(dr, aes(x, y)) + geom_point() + coord_equal()
Enter PCA. It actually finds the best rotation for you. Even in a way that the first axis has the most variability (longest side of the object), the second axis has the maximum of the remaining variability etc.
pc <- prcomp(as.matrix(dr))
ggplot(data.frame(pc$x), aes(PC1, PC2)) + geom_point() + coord_equal()
ggplot(data.frame(pc$x), aes(PC1, PC3)) + geom_point() + coord_equal()
ggplot(data.frame(pc$x), aes(PC2, PC3)) + geom_point() + coord_equal()
MDS¶
Metric MDS (multidimensional scaling) with euclidean distance equals to PCA. We will use the non-metric variant here, which tries to keep only the order of pairwise distances, not the distances themselves. You prefer MDS when you want to use a different distance than euclidean - we’re using manhattan (taxicab) distance here:
library(MASS)
dmx <- dist(dr, "manhattan")
mds <- isoMDS(dmx)
ggplot(data.frame(mds$points), aes(X1, X2)) + geom_point() + coord_equal()
Shiny¶
And now there is something you definitely wanted, while you were trying to find the good values for rotation of your object:
setwd('webapp')
Now File > Open
, and open server.R
. There should be a green Run App
button at the top right of the editor window. Click that button!
Supplemental information:
Course materials preparation¶
This section contains the steps that we did to produce the materials that course participants got ready-made. That is the linux machine image, online documentation and the slide deck.
Online documentation¶
Login to https://github.com. Create a new project called ngs-course, with a default readme file.
Clone the project to local machine and initialize sphinx docs. Choose SSH
clone link in GitHub.
git clone git@github.com:libor-m/ngs-course.git
cd ngs-course
# use default answers to all the questions
# enter project name and version 1.0
sphinx-quickstart
Now track all files created by sphinx-quickstart in current directory with git and publish to GitHub.
git add .
git commit -m 'empty sphinx project'
# ignore _build directory in git
echo _build >> .gitignore
git add .gitignore
git commit -m 'ignore _build directory'
# publish the first docs
# setting up argument less git pull with '-u'
git push -u origin master
To get live view of the documents, login to https://readthedocs.org. Your GitHub account can be paired with
Read the Docs account in Edit Profile/Social Accounts, then you can simply ‘import’ new projects
from your GitHub with one click. Import the new project and wait for it to build. After the build
the docs can be found at http://ngs-course.readthedocs.org (or click the View
button).
Now write the docs, commit and push. Rinse and repeat. Try to keep the commits small, just one change a time.
git add _whatever_new_files_
git commit -m '_your meaningful description of what you did here_'
git push
References that may come handy:
Adding another instance¶
Check out the version which will serve as starting material, create and publish new branch.
git pull
git checkout praha-january-2016
git checkout -b praha-january-2017
git push -u origin praha-january-2017:praha-january-2017
Log in to Read the Docs, go to Admin > Versions, make the new version ‘Active’, set as the default version in Admin > Advanced.
If the version (branch) is not visible yet, do a force build of some previous version to get a fresh checkout.
Check if webhooks are set up both in ReadTheDocs > Project > Admin > Integratinos and in GitHub > Settings > Webhooks.
Cloud image¶
Create new machine¶
We expect ~16 participants. To make things simple we’ll host them all on a single instance.
Follow the Meta Cloud quick start. Briefly:
add ssh keys
add SSH and ICMP security rules (more rules later)
Compute > Instance > Launch instance, fill this in the wizard dialog
- Debian (64 bit)
- flavor hpc.16core-32ram
- 32 GB RAM - little less than 2 GB per user
- 16 vCPUs - keep 2 of the allowed 18 for the testing instance
- 160 GB HDD as system drive (need space for basic system, gcc, rstudio and produced data * N participants)
more rules in security group - HTTP to set up let’s encrypt cert - 443 for secured RStudio - 60k-61k for mosh - 5690 rstudio + shiny
Debian conifg¶
SSH to the machine - read the IP in the OpenStack interface and log in with debian user name.
ssh debian@${INSTANCE_IP}
# start as super user
sudo su
# Prague time zone
dpkg-reconfigure tzdata
# find fastest mirror
apt install netselect-apt
# patch it in sources.list
vi /etc/sources.list
# upgrade all
apt update
apt upgrade
# keep the sources list over reboot
# +apt_preserve_sources_list: true
vi /etc/cloud/cloud.cfg
# install the basic tools for more configuration work
apt install vim screen mosh git
# log in as debian
su debian
# create an ssh key
ssh-keygen -t ed25519
# checkout dotfiles
git clone git@github.com:libor-m/dotfiles.git
# link vim config
ln -s dotfiles/vim/.vimrc .
# back to root shell
exit
# link vim config for root
cd
ln -s ~debian/dotfiles/vim/.vimrc .
Now it should be easy to work as debian user, with vim configured even for sudo.
Tiny fixes to make work as debian pleasurable.
# colrize prompt - uncomment force_color_prompt=yes
# add ll alias - uncomment alias ll='ls -l'
# export MANWIDTH=120
vi ~/.bashrc
. ~/.bashrc
Set up the user skeleton, so the newly created users will be set up as needed. Fancy login message will sure help;)
sudo su
# colrize prompt - uncomment force_color_prompt=yes
# add ll alias - uncomment alias ll='ls -l'
# fast sort and uniq
# export LC_ALL=C
# maximal width of man
# export MANWIDTH=120
# # wget impersonating normal browser
# # good for being tracked with goo.gl for example
# alias wgets='H="--header"; wget $H="Accept-Language: en-us,en;q=0.5" $H="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" $H="Connection: keep-alive" -U "Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2" --referer=/ '
vi /etc/skel/.bashrc
# some screen settings
cat > /etc/skel/.screenrc << 'EOF'
hardstatus alwayslastline
hardstatus string '%{= kG}[%{G}%H%? %1`%?%{g}][%= %{= kw}%-w%{+b yk} %n*%t%?(%u)%? %{-}%+w %=%{g}][%{B}%d.%m. %{W}%c%{g}]'
defscrollback 20000
startup_message off
EOF
# basic RStudio ide config
# obtained by configuring one instance for liborm and then copying the
# resulting file
mkdir -p /etc/skel/.config/rstudio
cat > /etc/skel/.config/rstudio/rstudio-prefs.json <<'EOF'
{
"save_workspace": "never",
"font_size_points": 11,
"editor_theme": "Solarized Dark",
"panes": {
"quadrants": [
"TabSet1",
"TabSet2",
"Source",
"Console"
],
"tabSet1": [
"Environment",
"History",
"Files",
"Connections",
"Build",
"VCS",
"Tutorial",
"Presentation"
],
"tabSet2": [
"Plots",
"Packages",
"Help",
"Viewer"
],
"console_left_on_top": false,
"console_right_on_top": false
},
"posix_terminal_shell": "bash"
}
EOF
# MOTD
cat > /etc/motd <<"EOF"
_ __ __ _ ___ ___ ___ _ _ _ __ ___ ___
| '_ \ / _` / __|_____ / __/ _ \| | | | '__/ __|/ _ \
| | | | (_| \__ \_____| (_| (_) | |_| | | \__ \ __/
|_| |_|\__, |___/ \___\___/ \__,_|_| |___/\___|
|___/
EOF
exit
Install some basic software
sudo apt install pv curl wget jq locate
# build tools
sudo apt install build-essential pkg-config autoconf
# add important stuff to python
sudo apt install python-dev python-pip python-virtualenv
# java because of fastqc
# sudo apt install openjdk-8-jre-headless
# let's try default jre
sudo apt install default-jre-headless
Set up a dynamic DNS to get some nice login name.
cd
ln -s dotfiles/duckdns
cat duckdns/duck.cron
# add the printed line to crontab
crontab -e
This is what it takes to create a basic usable system in VirtualBox. We can shut
it down now with sudo shutdown -h now
and take a snapshot of the machine. If
any installation goes haywire from now on, it’s easy to revert to this basic
system.
Install R and RStudio¶
R is best used in RStudio - server version can be used in web browser.
mkdir ~/sw
cd ~/sw
# install latest R
# https://cran.r-project.org/bin/linux/debian/
sudo bash -c "echo 'deb http://cloud.r-project.org/bin/linux/debian buster-cran40/' > /etc/apt/sources.list.d/cran.list"
sudo apt install dirmngr
sudo apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF'
sudo apt update
sudo apt install r-base
sudo apt install libxml2-dev libcurl4-openssl-dev libssl-dev
sudo R
> update.packages(.libPaths(), checkBuilt=TRUE, ask=F)
> install.packages(c("tidyverse", "shiny", "reshape2", "vegan"))
> quit(save="no")
# RStudio with prerequisities
sudo apt install gdebi-core
wget https://download2.rstudio.org/server/bionic/amd64/rstudio-server-1.3.1093-amd64.deb
sudo gdebi rstudio-server-*.deb
# and fix upstart config
# https://support.rstudio.com/hc/en-us/community/posts/200780986-Errors-during-startup-asio-netdb-error-1-Host-not-found-authoritative-
# remove 2 from [2345]
sudo nano /usr/lib/rstudio-server/extras/upstart/rstudio-server.conf
# install nginx as a front end
# snapd is needed for certbot ;(
sudo apt install nginx snapd
# test if http is accessible from local browser
# simple nginx proxy config for rstudio
sudo su
cat > /etc/nginx/sites-enabled/ngs-course.duckdns.org <<'EOF'
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
server {
location / {
proxy_pass http://localhost:8787;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_read_timeout 20d;
}
server_name ngs-course.duckdns.org;
listen 80;
}
EOF
# remove the default site
rm /etc/nginx/sites-enabled/default
# test and reload
nginx -t
nginx -s reload
# test if RStudio login page is visible at http
# .. we'll use the non-sudo account to access rstudio later
# secure with certbot
# (snap paths are somehow broken..and restarting the whole system is soo windows98)
/snap/bin/certbot --nginx
Install additional software¶
There are packages that are not in the standard repos, or the versions in the repos is very obsolete. It’s worth it to install such packages by hand, when there is not much dependencies.
mkdir -p ~/sw
# install a tar with the most common method
inst-tar() {
cd ~/sw
wget -O - "$1" | tar xj
# extract possible dir name from the tar path
cd $( echo "$1" | egrep -o '/[^-/]+-' | sed 's/^.//;s/$/*/' )
./configure
make && sudo make install
}
# pipe viewer
inst-tar http://www.ivarch.com/programs/sources/pv-1.6.6.tar.bz2
# parallel
inst-tar http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2
# tabtk
cd ~/sw
git clone https://github.com/lh3/tabtk.git
cd tabtk/
# no configure in the directory
make
# no installation procedure defined in makefile
# just copy the executable to a suitable location
sudo cp tabtk /usr/local/bin
# fastqc
cd ~/sw
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_*.zip
rm fastqc_*.zip
chmod +x FastQC/fastqc
# vcftools
cd ~/sw
wget -O - https://github.com/vcftools/vcftools/tarball/master | tar xz
cd vcftools*
./autogen.sh
./configure
make && sudo make install
# samtools
inst-tar https://github.com/samtools/samtools/releases/download/1.11/samtools-1.11.tar.bz2
# bcftools
inst-tar https://github.com/samtools/bcftools/releases/download/1.11/bcftools-1.11.tar.bz2
# htslib (tabix)
inst-tar https://github.com/samtools/htslib/releases/download/1.11/htslib-1.11.tar.bz2
# bwa
cd ~/sw
wget -O - https://github.com/lh3/bwa/releases/download/v0.7.17/bwa-0.7.17.tar.bz2 | tar xj
cd bwa*
make
sudo cp bwa /usr/local/bin
# copy the man
sudo bash -c "<bwa.1 gzip > /usr/share/man/man1/bwa.1.gz"
# velvet
cd ~/sw
wget -O - https://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz | tar xz
cd velvet*
make
sudo cp velveth velvetg /usr/local/bin
# bedtools
cd ~/sw
wget -O - https://github.com/arq5x/bedtools2/releases/download/v2.29.2/bedtools-2.29.2.tar.gz | tar xz
cd bedtools2/
make && sudo make install
# clean up
rm -rf bcftools-*/ bedtools2/ bwa-*/ htslib-*/ parallel-*/ pv-*/ samtools-*/ tabtk/ vcftools-vcftools-*/
TODO - future proofing of the installs with getting the latest - but release - quality code with something like this (does not work with tags yet):
gh-get-release() { echo $1 | cut -d/ -f4,5 | xargs -I{} curl -s https://api.github.com/repos/{}/releases/latest | jq -r .tarball_url | xargs -I{} curl -Ls {} | tar xz ;}
Check what are the largest packages:
dpkg-query -Wf '${Installed-Size}\t${Package}\n' | sort -n
Create the user accounts¶
For a multi-user machine, we need the low-privileged accounts and at least a quota to prevent DoS by overfilling the disk.
Name the accounts user01 to user22:
sudo su
cd
# aptitude search '?provides(wordlist)'
apt install wamerican
# generate some funny passwords
</usr/share/dict/words egrep "^[a-z]{5,8}$" |
sort -R |
paste -d' ' - - - |
head -22 |
nl -w2 -n'rz' |
sed 's/^/user/' \
> users.tsv
# use `adduser` as debian alternative
# --gecos '' --disabled-password to get unattended run
adduser --gecos '' --disabled-password liborm
adduser --gecos '' --disabled-password janouse1
usermod -a -G sudo liborm
usermod -a -G sudo janouse1
# normal users
<users.tsv cut -f1 | xargs -n1 adduser --gecos '' --disabled-password
# use chpasswd to update the passwords
<users.tsv tr "\t" ":" | chpasswd
# add quotas
# https://www.digitalocean.com/community/tutorials/how-to-set-filesystem-quotas-on-debian-10
apt install quota
# add ,usrquota to / mount
vi /etc/fstab
mount -o remount /
quotacheck -ugm /
quotaon -v /
<users.tsv cut -f1 | xargs -I{} setquota -u {} 8G 10G 0 0 /
# copy-paste users.tsv to shared google sheet
# delete on disk
rm users.tsv
Sample datasets¶
Use data from my nightingale project, subset the data for two selected chromosomes.
# see read counts for chromosomes
samtools view 41-map-smalt/alldup.bam | mawk '{cnt[$3]++;} END{for(c in cnt) print c, cnt[c];}' | sort --key=2rn,2
# extract readnames that mapped to chromosome 1 or chromosome Z
mkdir -p kurz/00-reads
samtools view 41-map-smalt/alldup.bam | mawk '($3 == "chr1" || $3 == "chrZ"){print $1;}' | sort > kurz/readnames
parallel "fgrep -A 3 -f kurz/readnames {} | grep -v '^--$' > kurz/00-reads/{/}" ::: 10-mid-split/*.fastq
# reduce the genome as well
# http://edwards.sdsu.edu/labsite/index.php/robert/381-perl-one-liner-to-extract-sequences-by-their-identifer-from-a-fasta-file
perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(chr1 chrZ)}print if $c' 51-liftover-all/lp2.fasta > kurz/20-genome/luscinia_small.fasta
# subset the vcf file with grep
# [the command got lost;]
Transfer the data to user directory (root cannot log in remotely):
# on host machine
cd somewhere.../data-pack
VM=ngs-course.duckdns.org
scp -r data-shared "debian@${VM}:~"
scp -r home/user/projects "debian@${VM}:~"
On the remote machine:
# make the shared data 'shared'
sudo mv ~/data-shared /
# change permissons back to 'read only' for user
sudo chown -R root:root /data-shared
Cleanup¶
# update the file database
sudo updatedb
# remove history not to confuse users
sudo su
history -cw
# ctrl-d
history -cw
Slide deck¶
Libor’s slide deck was created using Adobe InDesign (you can get the CS2 version almost legally for free). Vasek’s slide deck was created with Microsoft Powerpoint. Images are shamelessly taken from the internet, with the ‘fair use for teaching’ policy ;)
Slide decks¶
Unix - Introduction
(Libor)
Basics of Unix
(Vasek)
Genomic data
(Vasek)
Plain text file processing
(Vasek)
Graphics session
(Libor)
Bioinformatic pipelines
(Vasek)