Reference manual for Unix introduction ====================================== Basic orientation in Unix ------------------------- **Multiple windows (screen)** You're all used to work with multiple windows (in MS Windows;). You can have them in Unix as well. The main benefit, however, is that you can log off and your programs keep running. To go into a screen mode type:: screen Once in screen you can control screen itself after you press the master key (and then a command): ``ctrl+a key``. To create a new window within the screen mode, press ``ctrl+a c`` (create). To flip among your windows press ``ctrl+a space`` (you flip windows often, it's the biggest key available). To detach screen (i.e. keep your programs running and go home), press ``ctrl+a d`` (detach). To open a detached screen type: .. code-block:: bash screen -r # -r means restore To list running screens, type:: screen -ls **Controlling processes (htop/top)** ``htop`` or ``top`` serve to see actual resource utilization for each running process. Htop is much nicer variant of standard ``top``. You can sort the processes by memory usage, CPU usage and few other things. **Getting help (man)** Just any time you're not sure about program option while building a command line, just flip to next screen window (you're always using screen for serious work), and type ``man`` and name of the command you want to know more about:: man screen .. _moving_around: Moving around & manipulation with files and directories ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Basic commands to move around and manipulate files/directories. .. code-block:: bash pwd # prints current directory path cd # changes current directory path ls # lists current directory contents ll # lists detailed contents of current directory mkdir # creates a directory rm # removes a file rm -r # removes a directory cp # copies a file/directory mv # moves a file/directory locate # tries to find a file by name Usage: *cd* To change into a specific subdirectory, and make it our current working directory:: cd go/into/specific/subdirectory To change to parent directory:: cd .. To change to home directory:: cd To go up one level to the parent directory then down into the directory2:: cd ../directory2 To go up two levels:: cd ../../ *ls* To list also the hidden files and directories (``-a``) in current in given folder along with human readable (``-h``) size of files (``-s``), type:: ls -ash *mv* To move a file data.fastq from current working directory to directory ``/home/directory/fastq_files``, type:: mv data.fastq /home/directory/fastq_files/data.fastq *cp* To copy a file data.fastq from current working directory to directory ``/home/directory/fastq_files``, type:: cp data.fastq /home/directory/fastq_files/data.fastq *locate* This quickly finds a file by a part of its name or path. To locate a file named data.fastq type:: locate data.fastq The ``locate`` command uses a database of paths which is automatically updated only once a day. When you look for some recent files you may not find them. You can manually request the update:: sudo updatedb **Symbolic links** Symbolic links refer to some other files or directories in a different location. It is useful when one wants to work with some files accessible to more users but wants to have them in a convenient location at the same time. Also, it is useful when one works with the same big data in multiple projects. Instead of copying them into each project directory one can simply use symbolic links. A symbolic link can are created by:: ln -s /data/genomes/luscinia/genome.fa genome/genome.fasta Exploring and basic manipulation with data ------------------------------------------ *less* Program to view the contents of text files. As it loads only the part of a the file that fits the screen (i.e. does not have to read entire file before starting), it has fast load times even for large files. To view text file while disabling line wrap and add line numbers add options ``-S`` and ``-N``, respectively:: less -SN data.fasta To navigate within the text file while viewing use: +-----------+-------------------+ | Key | Command | +===========+===================+ | Space bar | Next page | +-----------+-------------------+ | b | Previous page | +-----------+-------------------+ | Enter key | Next line | +-----------+-------------------+ | / | Look for string | +-----------+-------------------+ | G | Go to line | +-----------+-------------------+ | G | Go to end of file | +-----------+-------------------+ | h | Help | +-----------+-------------------+ | q | Quit | +-----------+-------------------+ *cat* Utility which outputs the contents of a specific file and can be used to concatenate and list files. Sometimes used in Czech as translated to 'kočka' and then made into a verb - 'vykočkovat';) .. code-block:: bash cat seq1_a.fasta seq1_b.fasta > seq1.fasta *head* By default, this utility prints first 10 lines. The number of first n lines can be specified by ``-n`` option (or by ``-..number..``). To print first 50 lines type:: .. code-block:: bash head -n 50 data.txt # is the same as head -50 data.txt # special syntax prints all but last 50 lines head -n -50 data.txt *tail* By default, this utility prints last 10 lines. The number of last n lines can be specified by ``-n`` option as in case of head. To print last 20 lines type: .. code-block:: bash tail -n 20 data.txt To skip the first line in the file (e.g. to remove header line of the file): .. code-block:: bash tail -n +2 data.txt *grep* This utility searches a text file(s) for lines matching a text pattern and prints the matching lines. To match given pattern it uses either specific string or regular expressions. Regular expressions enable for a more generic pattern rather than a fixed string (e. g. search for ``a`` followed by 4 numbers followed by any capital letter - ``a[0-9]{4}[A-Z]``). To obtain one file with list of sequence IDs in multiple fasta files type: .. code-block:: bash grep '>' *.fasta > seq_ids.txt To print all but #-starting lines from the vcf file use option ``-v`` (print non-matching lines): .. code-block:: bash grep -v ^# snps.vcf > snps.tab The ``^#`` mark means beginning of line followed directly by ``#``. *wc* This utility generates set of statistics on either standard input or list of text files. It provides these statistics: * line count (``-l``) * word count (``-w``) * character count (``-m``) * byte count (``-c``) * length of the longest line (``-L``) If specific word provided it returns count of this word in a given file. To obtain number of files in a given directory type: .. code-block:: bash ls | wc -l The ``|`` symbol is explained in further section. *cut* Cut out specific columns (fields/bytes) out of a file. By default, fields are separated by TAB. Otherwise, change delimiter using ``-d`` option. To select specific fields out of a file use ``-f`` option (position of selected fields/columns separated by commas). If needed to complement selected fields (i.e. keep all but selected fields) use ``--complement`` option. Out of large matrix select all but first column and row representing IDs of rows and columns, respectively: .. code-block:: bash < matrix1.txt tail -n +2 | cut --complement -f 1 > matrix2.txt *sort* This utility sorts a file based on whole lines or selected columns. To sort numerically use ``-n`` option. Range of columns used as sorting criterion is specified by ``-k`` option. Extract list of SNPs with their IDs and coordinates in genome from vcf file and sort them based on chromosome and physical position: .. code-block:: bash < snps.vcf grep ^# | cut -f 1-4 | sort -n -k2,2 -k3,3 > snps.tab *uniq* This utility takes sorted lists and provides unique records and also counts of non-unique records (``-c``). To have more numerous records on top of output use ``-r`` option for ``sort`` command. Find out count of SNPs on each chromosome: .. code-block:: bash < snps.vcf grep ^# | cut -f 2 | sort | uniq -c > chromosomes.tab *tr* Replaces or removes specific sets of characters within files. To replace characters a and b in the entire file for characters c and d, respectively, type: .. code-block:: bash tr 'ab' 'cd' < file1.txt > file2.txt Multiple consecutive occurrences of specific character can be replaced by single character using ``-s`` option. To remove empty lines type: .. code-block:: bash tr -s '\n' < file1.txt > file2.txt To replace lower case to upper case in fasta sequence type: .. code-block:: bash tr "[:lower:]" "[:upper:]" < file1.txt > file2.txt Building commands ----------------- **Globbing** Refers to manipulating (searching/listing/etc.) files based on pattern matching using specific characters. Example: .. code-block:: bash ls # a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta ls *.fasta # seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta Character ``*`` in previous example replaces any number of any characters and it indicates to ``ls`` command to list any file ending with ".fasta". However, if we look for fastq instead, we get no result: .. code-block:: bash ls *.fastq # Character ``?`` in following example replaces just right the one character (a/b) and it indicates to ls functions to list files containing `seq2_` at the beginning, any single character in the middle (a/b) and ending with ".fasta" .. code-block:: bash ls # a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta ls seq2_?.fasta # seq2_a.fasta seq2_b.fasta .. code-block:: bash ls # a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta ls seq2_[ab].fasta # seq2_a.fasta seq2_b.fasta One can specifically list altering characters (a,b) using brackets ``[]``. One may also be more general and list all files having any alphabetical character ``[a-z]`` or any numerical character ``[0-9]``: .. code-block:: bash ls # a.bed b.bed seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta ls seq[0-9]_[a-z].fasta # seq1_a.fasta seq1_b.fasta seq2_a.fasta seq2_b.fasta **TAB completition** Using key TAB one can finish unique file names or paths without having to fully type them. (try and see) From this perspective it is important to think about names for directories in advance as it can spare you a lot time in future. For instance, when processing data with multiple steps one can use numbers at beginnings of names: * 00-beginning * 01-first-processing * 02-second-processsing * ... **Variables** Unix environment enables to use shell variables. To set primer sequence ``'GATACGCTACGTGC'`` to variable ``PRIMER1`` in a command line and print it on screen using ``echo``, type: .. code-block:: bash PRIMER1=GATACGCTACGTGC echo $PRIMER1 # GATACGCTACGTGC .. note:: It is good habit in Unix to use capitalized names for variables: ``PRIMER1`` not ``primer1``. **Producing lists** What do these commands do? .. code-block:: bash touch file-0{1..9}.txt file-{10..20}.txt touch 0{1..9}-{a..f}.txt {10..12}-{a..f}.txt touch 0{1..9}-{jan,feb,mar}.txt {10..12}-{jan,feb,mar}.txt **Exercise**: Program runs 20 runs of simulations for three datasets (hm, ss, mm) using three different sets of values: small (sm), medium sized (md) and large (lg). There are three groups of output files, which should go into subdirectory A, B and C. Make a directory for each dataset-set of parameters-run-subdirectory. Count the number of directories. Producing lists of subdirectories .. code-block:: bash mkdir –p {2013..2015}/{A..C} mkdir –p {2013..2015}/0{1..9}/{A..C} {2013..2015}/{10..12}/{A..C} **Pipes** Unix environment enables to chain commands using pipe symbol ``|``. Standard output of the first command serves as standard input of the second one, and so on. .. code-block:: bash ls | head -n 5 **Subshell** Subshell enables to run two commands and capture the output into single file. It can be helpful in dealing with data files headers. Use of subshell enables to remove header, run the set of operations on the data, and later insert the header back to file. The basic syntax is: .. code-block:: bash (command1 file1.txt && command2 file1.txt) > file2.txt To sort data file based on two columns without including header type: .. code-block:: bash (head -n 1 file1.txt && tail -n +2 file1.txt | sort -n -k1,1 -k2,2) > file2.txt Subshell can be used also to preprocess multiple inputs on the fly (saving useless intermediate files): .. code-block:: bash paste <(< file1.txt tr ' ' '\t') <( file3.txt Advanced text manipulation (sed) -------------------------------- ``sed`` "stream editor" allows you to change file line by line. You can substitute text, you can drop lines, you can transform text... but the syntax can be quite opaque if you're doing anything more than substituting `foo` with `bar` in every line (``sed 's/foo/bar/g'``). More complex data manipulation (awk) ------------------------------------ ``awk`` enables to manipulate text data in a very complex way. In fact, it is a simple programming language with functionality similar to regular programming languages. As such it enables enormous variability in ways of how to process text data. It can be used to write a short script and which can be chained along with Unix commands in one pipeline. The biggest power of `awk` is that it's line oriented and saves you lot of boilerplate code that you would have to write in other languages, if you need moderately complex processing of text files. The basic structure of the script is divided into three parts and any of these three parts may or may not be included in the script (according to the intention of user). The first part ``'BEGIN{}'`` conducts operation before going through the input file, the middle part ``'{}'`` goes throughout the input file and conducts operations on each line separately. The last part ``'END{}'`` conducts operation after going through the input file. The basic syntax: .. code-block:: bash < data.txt awk 'BEGIN{} {} END{}' > output.txt **Built-in variables** awk has several built-in variables which can be used to track and process data without having to program specific feature. The basic four built-in variables: * ``FS`` - input field separator * ``OFS`` - output field separator * ``NR`` - record (line) number * ``NF`` - number of fields in record (in line) There is even more built-in variables that we won't discuss here: ``RS``, ``ORS``, ``FILENAME``, ``FNR`` Use of built-in variables: awk splits each line into columns based on white space. When a different delimiter (e.g. TAB) is to be used, it can be specified using ``-F`` option. If you want to keep this custom Field Separator in the output, you have to set the Output Field Separator as well (there's no command line option for OFS): .. code-block:: bash < data.txt awk -F $'\t' 'BEGIN{OFS=FS}{print $1,$2}' > output.txt This command takes file data.txt, extract first two TAB delimited columns of the input file and print them TAB delimited into the output file output.txt. When we look more closely on the syntax we see that the TAB delimiter was set using ``-F`` option. This option corresponds to the ``FS`` built-in variable. As we want TAB delimited columns in the output file we pass ``FS`` to ``OFS`` (i.e. ouput field separator) in the ``BEGIN`` section. Further, in the middle section we print out first two columns which can be extracted by numbers with ``$`` symbol (``$1``, ``$2``). The numbers correspond to position of the column in the input file. We could, of course, use for this operation the ``tr`` command which is even simpler. However, the awk enables to conduct any other operation on given data. .. note:: The complete input line is stored in ``$0``. The ``NR`` built-in variable can be used to capture each second line in a file type: .. code-block:: bash < data.txt awk '{ if(NR % 2 == 0){ print $0 }}' > output.txt The ``%`` symbol represents modulo operator which returns the remainder of division. The ``if()`` condition is used to decide on whether the modulo is 0 or not. Here is a bit more complex example of how to use ``awk``. We write a command which retrieves coordinates of introns from coordinates of exons. Example of input file: .. code-block:: bash GeneID Chromosome Exon_Start Exon_End ENSG00000139618 chr13 32315474 32315667 ENSG00000139618 chr13 32316422 32316527 ENSG00000139618 chr13 32319077 32319325 ENSG00000139618 chr13 32325076 32325184 ... ... ... ... The command is going to be as follows: When we look at the command step by step we first remove header and sort data based on GeneID and Exon_Start columns: .. code-block:: bash < exons.txt tail -n +2 | sort -k1,1 -k3,3n | ... Further, we write a short script using awk to obtain coordinates of introns: .. code-block:: bash ... | awk -F $'\t' 'BEGIN{OFS=FS}{ if(NR==1){ x=$1; end1=$4+1; }else{ if(x==$1) { print $1,$2,end1,$3-1; end1=$4+1; }else{ x=$1; end1=$4+1; } } }' > introns.txt In the ``BEGIN{}`` part we set TAB as output field separator. Further, using ``NR==1`` test we set GeneID for first line into ``x`` variable and intron start into end1 variable. Otherwise we do nothing. For others records ``NR > 1`` condition ``x==$1`` test if we are still within the same gene. If so we print exon end from previous line (``end1``) as intron start and exon start of current line we use as intron end. Next, we set new intron start (i.e. exon end from current line) into end1. If we have already moved into new one ``x<>$1``) we repeat procedure for the first line and print nothing waiting for next line. Joining multiple files + subshell --------------------------------- Use ``paste``, ``join`` commands. .. note:: Shell substitution is a nice way to pass a pipeline in a place where a file is expected, be it input or output file (Just use the appropriate sign). Multiple pipelines can be used in a single command: .. code-block:: bash cat <( cut -f 1 file.txt | sort -n ) <( cut -f 1 file2.txt | sort -n ) | less *Use nightingale FASTQ file* 1. Join all nightingale FASTQ files and create a TAB separated file with one line per read .. code-block:: bash # repeating input in paste causes it to take more lines from the same source cat *.fastq | paste - - - - | cut -f 1-3 | less 2. Make a TAB-separated file having four columns: 1. chromosome name 2. number of variants in total for given chromosome 3. number of variants which pass 4. number of variants which fails .. code-block:: bash # Command 1 < data/luscinia_vars_flags.vcf grep -v '^#' | cut -f 1 | sort | uniq -c | sed -r 's/^ +//' | tr " " "\t" > data/count_vars_chrom.txt # Command 2 < data/luscinia_vars_flags.vcf grep -v '^#' | cut -f 1,7 | sort -r | uniq -c | sed -r 's/^ +//' | tr " " "\t" | paste - - | cut --complement -f 2,3,6 > data/count_vars_pass_fail.txt # Command 3 join -1 2 -2 3 data/count_vars_chrom.txt data/count_vars_pass_fail.txt | wc -l # How many lines did you retrieved? # You have to sort the data before sending to ``join`` - subshell join -1 2 -2 3 <( sort -k2,2 data/count_vars_chrom.txt ) \ <( sort -k3,3 data/count_vars_pass_fail.txt ) | tr " " "\t" > data/count_all.txt All three commands together using subshell: .. code-block:: bash # and indented a bit more nicely IN=data/luscinia_vars_flags.vcf join -1 2 -2 3 \ <( <$IN grep -v '^#' | cut -f 1 | sort | uniq -c | sed -r 's/^ +//' | tr " " "\t" | sort -k2,2 ) \ <( <$IN grep -v '^#' | cut -f 1,7 | sort -r | uniq -c | sed -r 's/^ +//' | tr " " "\t" | paste - - | cut --complement -f 2,3,6 | sort -k3,3 ) | tr " " "\t" \ > data/count_all.txt Helpful commands (dir content and its size, disc usage) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash ls -shaR # list all contents of directory (including subdirectories) du -sh # disc usage (by directory) df -h # disc free space ls | wc -l # what does this command do? locate # find a file/program by name