Session 7: Variant quality
==========================

In this part you will be working on your own. You're already familiar with the
VCF format and some reformatting and plotting tools. There is a file with
variants from several nightingale individuals::

  /data-shared/vcf_examples/luscinia_vars.vcf.gz

Your task now is:

- set up a new project directory
- pick only data for chromosomes ``chr1`` and ``chrZ``
- extract the sequencing depth ``DP`` from the ``INFO`` column
- extract variant type by checking if the ``INFO`` column contains ``INDEL`` string
- merge these two columns together with the first six columns of the VCF
- create a new git repository and commit your script
- push the repository to GitHub

And a bit of guidance here:

- create a new project directory in your ``projects``
- you will create 3 different files (one for teh first 6 columns, one for the read depths and one for the variant type)
- get rid of the comments (they start with ``#``, that is ``^#`` regular expression)
- filter lines based on chromosomes (``grep -e 'chr1\s' -e 'chrZ\s'``)
- extact the first 6 columns (``cut -f1-6``)
- extract ``DP`` column (``egrep -o 'DP=[^;]*' | sed 's/DP=//'``)
- check each line for ``INDEL`` (``awk '{if($0 ~ /INDEL/) print "INDEL"; else print "SNP"}'``)
- check if all files you are about to merge have the same length ``wc -l data/*.tsv``
- merge the data (columns) before loading to R (``paste file1 file2 file3 > file_combined``)

.. pull-quote:: Good luck! (We will help you;)

Extra R task (if you want to challenge yourself 😉)
------------

- load the data into R
- explore graphically

  - barchart of variant types, how many variant are INDELs and how many SNPs (use ``geom_bar()``)
  - boxplot of qualities for INDELs and SNPs (use ``geom_boxplot()`` and additionally ``scale_y_log10()`` if you don't like the outliers)
  - histogram of qualities for INDELs and SNPs (use ``geom_histogram()`` and additionally ``scale_x_log10()``, ``facet_wrap()``)
  - can you spot some problem in the data

Hints:

- don't forget to load library for data manipulation and plotting (``library(tidyverse)``)
- add column names while loading the data with ``read_tsv(..., col_names=c('CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'DP', 'TYPE'))``