A few more tips =============== This is a collection of tips, that may help to overcome the initial barrier of working with a 'foreign' system. There is a lot of ways to achieve the solution, those presented here are not the only correct ones, but some that proved beneficial to the authors. Easiest ways to get Unix ------------------------ To get the most basic Unix tools, you can download an install `Git for Windows `_. It comes with a nice terminal emulator, and installs to your right-click menu as 'Git Bash here' - which runs terminal in the folder that you clicked. Git itself is meant for managing versions of directories, but it cannot live without the Unix environment, so someone did the hard work and packaged it all nicely together. If you need more complete Unix environment, there are currently several options. If you have a recent version of Windows 10 (yes, there are different versions of Window 10), you can enable 'Windows Subsystem for Linux (WSL)' and then install Ubuntu or Debian from the Windows Store. It's a marvel of engineering to connect two operating systems, you're getting a 'real' Linux. In older Windows, you can use `Cygwin `_. It is quite complete, but it can't replace native Unix, as you'll find sooner or later. Another easy way of getting a Unix environment in Windows is to install a basic Linux into a virtual machine. Our previous courses used this method. It's much more convenient that the dual boot configurations, and the risk of completely breaking your computer is lower. You can be using Unix while having all your familiar stuff at hand. The only downside is that you have to transfer all the data as if the image was a remote machine. It's much more convenient to use a normal terminal like PuTTY to connect to the machine rather than typing the commands into the virtual screen of VirtualBox - It's usually lacking clipboard support, you cannot change the size of the window, etc. Mac OS X and Linux are Unix based, you just have to know how to start your terminal program (``konsole``, ``xterm``, ``Terminal`` ...). Essentials ---------- Always use ``screen`` for any remote work. Not using screen will cause your jobs being interrupted when the network link fails (given you're working remotely), and it will make you keep your home computer running even if your calculation is running on a remote server. Track system resources usage with ``htop``. System that is running low on memory won't perform fast. System with many cores where only one core ('CPU') is used should be used for more tasks - or can finish your task much faster, if used correctly. Data organization ----------------- Make a new directory for each project. Put all your data into subdirectories. Use symbolic links to reference huge data that are reused by more projects in your current project directory. Prefix your directory names with 2 digit numbers, if your projects have more than few subdirectories. Increase the number as the data inside is more and more 'processed'. Keep the code in the top directory. It is easy to distinguish data references just by having ``[0-9]{2}-`` prefix. Example of genomic pipeline data directory follows: .. code:: 00-raw --> /data/slavici/all-reads 01-fastqc 10-trim-adaptors 13-fastqc 20-assembly-newbler 30-map-reads-gmap 31-map-reads-bwa 50-variants-samtools Take care to note all the code used to produce all the intermediate data files. This has two benefits: 1) your results will be really **reproducible** 2) it will **save you much work** when doing the same again, or trying different settings If you feel geeky, use ``git`` to track your code files. It will save you from having 20 versions of one script - and you being completely lost a year later, when trying to figure out which one was the one that was actually working. Building command lines ---------------------- Build the pipelines command by command, keeping ``| less -S`` (or ``| head`` if you don't expect lines of the output to be longer than your terminal width) at the end. Every time you check if the output is what you expect, and only after that add the next command. If there is a ``sort`` in your pipeline, you have to put ``head`` in front of the ``sort``, because otherwise sort has to process all the data before it gives out any output. I (Libor) do prefer the 'input first' syntax (``out``) which improves legibility, resembles the real world pipeline (garden hose, input tap -> garden hose -> garden sprinkler) more, and when changing the input file names when reusing the pipeline, the names are easier to find. Wrap your long pipelines on ``|`` - copy and paste to bash still works, because bash knows there has to be something after ``|`` at the end of the line. Only the last line has to be escaped with ``\``, otherwise all your output would go to the screen instead of a file. .. code-block:: bash out You can get a nice progress bar if you use ``pv`` (pipe viewer) instead of ``cat`` at the beginning of the pipeline. But again, if there is a ``sort`` in your pipeline, it has to consume all the data before it starts to work. Use variables instead of hard-coded file names / arguments, especially when the name is used more times in the process, or the argument is supposed to be tuned: .. code-block:: bash FILE=/data/00-reads/GS60IET02.RL1.fastq THRESHOLD=300 # count sequences in file <$FILE awk '(NR % 4 == 2)' | wc -l # 42308 # count sequences longer that <$FILE awk '(NR % 4 == 2 && length($0) > $THRESHOLD)' | wc -l # 14190 Parallelization --------------- Many tasks, especially in Big Data and NGS, are 'data parallel' - that means you can split the data in pieces, compute the results on each piece separately and then combine the results to get the complete result. This makes very easy to exploit the full power of modern multi core machines, speeding up your processing e.g. 10 times. ``GNU parallel`` is a nice tool that helps to parallelize bash pipelines, check the manual for some examples: ``man parallel_tutorial``.