Reference manual for Unix introduction¶

Basic orientation in Unix¶

Multiple windows (screen)

You’re all used to work with multiple windows (in MS Windows;). You can have them in Unix as well. The main benefit, however, is that you can log off and your programs keep running.

To go into a screen mode type:

screen

Once in screen you can control screen itself after you press the master key (and then a command): ctrl+a key. To create a new window within the screen mode, press ctrl+a c (create). To flip among your windows press ctrl+a space (you flip windows often, it’s the biggest key available). To detach screen (i.e. keep your programs running and go home), press ctrl+a d (detach).

To open a detached screen type:

screen -r  # -r means restore

To list running screens, type:

screen -ls

Controlling processes (htop/top)

htop or top serve to see actual resource utilization for each running process. Htop is much nicer variant of standard top. You can sort the processes by memory usage, CPU usage and few other things.

Getting help (man)

Just any time you’re not sure about program option while building a command line, just flip to next screen window (you’re always using screen for serious work), and type man and name of the command you want to know more about:

man screen

Moving around & manipulation with files and directories¶

Basic commands to move around and manipulate files/directories.

pwd    # prints current directory path
cd     # changes current directory path
ls     # lists current directory contents
ll     # lists detailed contents of current directory
mkdir  # creates a directory
rm     # removes a file
rm -r  # removes a directory
cp     # copies a file/directory
mv     # moves a file/directory
locate # tries to find a file by name

Usage:

cd

To change into a specific subdirectory, and make it our current working directory:

cd go/into/specific/subdirectory

To change to parent directory:

cd ..

To change to home directory:

cd

To go up one level to the parent directory then down into the directory2:

cd ../directory2

To go up two levels:

cd ../../

ls

To list also the hidden files and directories (-a) in current in given folder along with human readable (-h) size of files (-s), type:

ls -ash

mv

To move a file data.fastq from current working directory to directory /home/directory/fastq_files, type:

mv data.fastq /home/directory/fastq_files/data.fastq

cp

To copy a file data.fastq from current working directory to directory /home/directory/fastq_files, type:

cp data.fastq /home/directory/fastq_files/data.fastq

locate

This quickly finds a file by a part of its name or path. To locate a file named data.fastq type:

locate data.fastq

The locate command uses a database of paths which is automatically updated only once a day. When you look for some recent files you may not find them. You can manually request the update:

sudo updatedb

Symbolic links

Symbolic links refer to some other files or directories in a different location. It is useful when one wants to work with some files accessible to more users but wants to have them in a convenient location at the same time. Also, it is useful when one works with the same big data in multiple projects. Instead of copying them into each project directory one can simply use symbolic links.

A symbolic link can are created by:

ln -s /data/genomes/luscinia/genome.fa genome/genome.fasta

Exploring and basic manipulation with data¶

less

Program to view the contents of text files. As it loads only the part of a the file that fits the screen (i.e. does not have to read entire file before starting), it has fast load times even for large files.

To view text file while disabling line wrap and add line numbers add options -S and -N, respectively:

less -SN data.fasta

To navigate within the text file while viewing use:

Key Command

Space bar Next page

b Previous page

Enter key Next line

/<string> Look for string

<n>G Go to line <n>

G Go to end of file

h Help

q Quit

cat

Utility which outputs the contents of a specific file and can be used to concatenate and list files. Sometimes used in Czech as translated to ‘kočka’ and then made into a verb - ‘vykočkovat’;)

cat seq1_a.fasta seq1_b.fasta > seq1.fasta

head

By default, this utility prints first 10 lines. The number of first n lines can be specified by -n option (or by -..number..).

To print first 50 lines type:

.. code-block:: bash

head -n 50 data.txt

# is the same as head -50 data.txt

# special syntax prints all but last 50 lines head -n -50 data.txt

tail

By default, this utility prints last 10 lines. The number of last n lines can be specified by -n option as in case of head.

To print last 20 lines type:

tail -n 20 data.txt

To skip the first line in the file (e.g. to remove header line of the file):

tail -n +2 data.txt

grep

This utility searches a text file(s) for lines matching a text pattern and prints the matching lines. To match given pattern it uses either specific string or regular expressions. Regular expressions enable for a more generic pattern rather than a fixed string (e. g. search for a followed by 4 numbers followed by any capital letter - a[0-9]{4}[A-Z]).

To obtain one file with list of sequence IDs in multiple fasta files type:

grep '>' *.fasta > seq_ids.txt

To print all but #-starting lines from the vcf file use option -v (print non-matching lines):

grep -v ^# snps.vcf > snps.tab

The ^# mark means beginning of line followed directly by #.

wc

This utility generates set of statistics on either standard input or list of text files. It provides these statistics:

line count (-l)

word count (-w)

character count (-m)

byte count (-c)

length of the longest line (-L)

If specific word provided it returns count of this word in a given file.

To obtain number of files in a given directory type:
ls | wc -l
The | symbol is explained in further section.

cut

Cut out specific columns (fields/bytes) out of a file. By default, fields are separated by TAB. Otherwise, change delimiter using -d option. To select specific fields out of a file use -f option (position of selected fields/columns separated by commas). If needed to complement selected fields (i.e. keep all but selected fields) use --complement option.

Out of large matrix select all but first column and row representing IDs of rows and columns, respectively:
< matrix1.txt tail -n +2 | cut --complement -f 1 > matrix2.txt
sort

This utility sorts a file based on whole lines or selected columns. To sort numerically use -n option. Range of columns used as sorting criterion is specified by -k option.

Extract list of SNPs with their IDs and coordinates in genome from vcf file and sort them based on chromosome and physical position:
< snps.vcf grep ^# | cut -f 1-4 | sort -n -k2,2 -k3,3 > snps.tab
uniq

This utility takes sorted lists and provides unique records and also counts of non-unique records (-c). To have more numerous records on top of output use -r option for sort command.

Find out count of SNPs on each chromosome:
< snps.vcf grep ^# | cut -f 2 | sort | uniq -c > chromosomes.tab
tr

Replaces or removes specific sets of characters within files.

To replace characters a and b in the entire file for characters c and d, respectively, type:
tr 'ab' 'cd' < file1.txt > file2.txt
Multiple consecutive occurrences of specific character can be replaced by single character using -s option. To remove empty lines type:
tr -s '\n' < file1.txt > file2.txt
To replace lower case to upper case in fasta sequence type:
tr "[:lower:]" "[:upper:]" < file1.txt > file2.txt

Building commands¶

Globbing

Refers to manipulating (searching/listing/etc.) files based on pattern matching using specific characters.

Example:
ls
# a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta
ls *.fasta
# seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta
Character * in previous example replaces any number of any characters and it indicates to ls command to list any file ending with “.fasta”.

However, if we look for fastq instead, we get no result:
ls *.fastq
#
Character ? in following example replaces just right the one character (a/b) and it indicates to ls functions to list files containing seq2_ at the beginning, any single character in the middle (a/b) and ending with “.fasta”
ls
# a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta
ls seq2_?.fasta
# seq2_a.fasta seq2_b.fasta
ls
# a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta
ls seq2_[ab].fasta
# seq2_a.fasta seq2_b.fasta
One can specifically list altering characters (a,b) using brackets []. One may also be more general and list all files having any alphabetical character [a-z] or any numerical character [0-9]:
ls
# a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta
ls seq[0-9]_[a-z].fasta
# seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta

TAB completition

Using key TAB one can finish unique file names or paths without having to fully type them. (try and see)

From this perspective it is important to think about names for directories in advance as it can spare you a lot time in future. For instance, when processing data with multiple steps one can use numbers at beginnings of names:

00-beginning

01-first-processing

02-second-processsing

…

Variables

Unix environment enables to use shell variables. To set primer sequence 'GATACGCTACGTGC' to variable PRIMER1 in a command line and print it on screen using echo, type:
PRIMER1=GATACGCTACGTGC
echo $PRIMER1
# GATACGCTACGTGC

Note

It is good habit in Unix to use capitalized names for variables: PRIMER1 not primer1.

Producing lists

What do these commands do?

touch file-0{1..9}.txt file-{10..20}.txt
touch 0{1..9}-{a..f}.txt {10..12}-{a..f}.txt
touch 0{1..9}-{jan,feb,mar}.txt {10..12}-{jan,feb,mar}.txt

Exercise:

Program runs 20 runs of simulations for three datasets (hm, ss, mm) using three different sets of values: small (sm), medium sized (md) and large (lg). There are three groups of output files, which should go into subdirectory A, B and C. Make a directory for each dataset-set of parameters-run-subdirectory. Count the number of directories.

Producing lists of subdirectories

mkdir –p {2013..2015}/{A..C}
mkdir –p {2013..2015}/0{1..9}/{A..C} {2013..2015}/{10..12}/{A..C}

Pipes

Unix environment enables to chain commands using pipe symbol |. Standard output of the first command serves as standard input of the second one, and so on.
ls | head -n 5

Subshell

Subshell enables to run two commands and capture the output into single file. It can be helpful in dealing with data files headers. Use of subshell enables to remove header, run the set of operations on the data, and later insert the header back to file. The basic syntax is:
(command1 file1.txt && command2 file1.txt) > file2.txt
To sort data file based on two columns without including header type:
(head -n 1 file1.txt && tail -n +2 file1.txt | sort -n -k1,1 -k2,2) > file2.txt
Subshell can be used also to preprocess multiple inputs on the fly (saving useless intermediate files):
paste <(< file1.txt tr ' ' '\t') <(<file2.txt tr '' '\t') > file3.txt

Advanced text manipulation (sed)¶

sed “stream editor” allows you to change file line by line. You can substitute text, you can drop lines, you can transform text… but the syntax can be quite opaque if you’re doing anything more than substituting foo with bar in every line (sed 's/foo/bar/g').

More complex data manipulation (awk)¶

awk enables to manipulate text data in a very complex way. In fact, it is a simple programming language with functionality similar to regular programming languages. As such it enables enormous variability in ways of how to process text data.

It can be used to write a short script and which can be chained along with Unix commands in one pipeline. The biggest power of awk is that it’s line oriented and saves you lot of boilerplate code that you would have to write in other languages, if you need moderately complex processing of text files. The basic structure of the script is divided into three parts and any of these three parts may or may not be included in the script (according to the intention of user). The first part 'BEGIN{}' conducts operation before going through the input file, the middle part '{}' goes throughout the input file and conducts operations on each line separately. The last part 'END{}' conducts operation after going through the input file.

The basic syntax:

< data.txt awk 'BEGIN{<before data processing>} {<process each line>} END{<after all lines are processed>}' > output.txt

Built-in variables

awk has several built-in variables which can be used to track and process data without having to program specific feature.

The basic four built-in variables:

FS - input field separator

OFS - output field separator

NR - record (line) number

NF - number of fields in record (in line)

There is even more built-in variables that we won’t discuss here: RS, ORS, FILENAME, FNR

Use of built-in variables:

awk splits each line into columns based on white space. When a different delimiter (e.g. TAB) is to be used, it can be specified using -F option. If you want to keep this custom Field Separator in the output, you have to set the Output Field Separator as well (there’s no command line option for OFS):

< data.txt awk -F $'\t' 'BEGIN{OFS=FS}{print $1,$2}' > output.txt
This command takes file data.txt, extract first two TAB delimited columns of the input file and print them TAB delimited into the output file output.txt. When we look more closely on the syntax we see that the TAB delimiter was set using -F option. This option corresponds to the FS built-in variable. As we want TAB delimited columns in the output file we pass FS to OFS (i.e. ouput field separator) in the BEGIN section. Further, in the middle section we print out first two columns which can be extracted by numbers with $ symbol ($1, $2). The numbers correspond to position of the column in the input file. We could, of course, use for this operation the tr command which is even simpler. However, the awk enables to conduct any other operation on given data.

Note

The complete input line is stored in $0.

The NR built-in variable can be used to capture each second line in a file type:

< data.txt awk '{ if(NR % 2 == 0){ print $0 }}' > output.txt
The % symbol represents modulo operator which returns the remainder of division. The if() condition is used to decide on whether the modulo is 0 or not.

Here is a bit more complex example of how to use awk. We write a command which retrieves coordinates of introns from coordinates of exons.

Example of input file:
GeneID            Chromosome   Exon_Start   Exon_End
ENSG00000139618   chr13        32315474     32315667
ENSG00000139618   chr13        32316422     32316527
ENSG00000139618   chr13        32319077     32319325
ENSG00000139618   chr13        32325076     32325184
...               ...          ...          ...
The command is going to be as follows:

When we look at the command step by step we first remove header and sort data based on GeneID and Exon_Start columns:
< exons.txt tail -n +2 | sort -k1,1 -k3,3n | ...
Further, we write a short script using awk to obtain coordinates of introns:
... | awk -F $'\t' 'BEGIN{OFS=FS}{
         if(NR==1){
           x=$1; end1=$4+1;
         }else{
           if(x==$1) {
               print $1,$2,end1,$3-1; end1=$4+1;
           }else{
               x=$1; end1=$4+1;
           }
         }
       }' > introns.txt
In the BEGIN{} part we set TAB as output field separator. Further, using NR==1 test we set GeneID for first line into x variable and intron start into end1 variable. Otherwise we do nothing. For others records NR > 1 condition x==$1 test if we are still within the same gene. If so we print exon end from previous line (end1) as intron start and exon start of current line we use as intron end. Next, we set new intron start (i.e. exon end from current line) into end1. If we have already moved into new one x<>$1) we repeat procedure for the first line and print nothing waiting for next line.

Joining multiple files + subshell¶

Use paste, join commands.

Note

Shell substitution is a nice way to pass a pipeline in a place where a file is expected, be it input or output file (Just use the appropriate sign). Multiple pipelines can be used in a single command:

cat <( cut -f 1 file.txt | sort -n ) <( cut -f 1 file2.txt | sort -n ) | less

Use nightingale FASTQ file

Join all nightingale FASTQ files and create a TAB separated file with one line per read

# repeating input in paste causes it to take more lines from the same source
cat *.fastq | paste - - - - | cut -f 1-3 | less

Make a TAB-separated file having four columns:
1. chromosome name
2. number of variants in total for given chromosome
3. number of variants which pass
4. number of variants which fails

# Command 1
< data/luscinia_vars_flags.vcf grep -v '^#' | cut -f 1 |
sort | uniq -c | sed -r 's/^ +//' | tr " " "\t" > data/count_vars_chrom.txt

# Command 2
< data/luscinia_vars_flags.vcf grep -v '^#' | cut -f 1,7 | sort -r |
uniq -c | sed -r 's/^ +//' | tr " " "\t" | paste - - |
cut --complement -f 2,3,6 > data/count_vars_pass_fail.txt

# Command 3
join -1 2 -2 3 data/count_vars_chrom.txt data/count_vars_pass_fail.txt | wc -l

# How many lines did you retrieved?

# You have to sort the data before sending to ``join`` - subshell
join -1 2 -2 3 <( sort -k2,2 data/count_vars_chrom.txt ) \
<( sort -k3,3 data/count_vars_pass_fail.txt ) | tr " " "\t" > data/count_all.txt

All three commands together using subshell:

# and indented a bit more nicely
IN=data/luscinia_vars_flags.vcf
join -1 2 -2 3 \
    <( <$IN  grep -v '^#' |
      cut -f 1 |
      sort |
      uniq -c |
      sed -r 's/^ +//' |
      tr " " "\t" |
      sort -k2,2 ) \
    <( <$IN grep -v '^#' |
      cut -f 1,7 |
      sort -r |
      uniq -c |
      sed -r 's/^ +//' |
      tr " " "\t" |
      paste - - |
      cut --complement -f 2,3,6 |
      sort -k3,3  ) |
  tr " " "\t" \
> data/count_all.txt

Helpful commands (dir content and its size, disc usage)¶

ls -shaR # list all contents of directory (including subdirectories)
du -sh # disc usage (by directory)
df -h # disc free space
ls | wc -l # what does this command do?
locate # find a file/program by name