Session 7: Variant quality ========================== In this part you will be working on your own. You're already familiar with the VCF format and some reformatting and plotting tools. There is a file with variants from several nightingale individuals:: /data-shared/vcf_examples/luscinia_vars.vcf.gz Your task now is: - set up a new project directory - pick only data for chromosomes ``chr1`` and ``chrZ`` - extract the sequencing depth ``DP`` from the ``INFO`` column - extract variant type by checking if the ``INFO`` column contains ``INDEL`` string - merge these two columns together with the first six columns of the VCF - create a new git repository and commit your script - push the repository to GitHub And a bit of guidance here: - create a new project directory in your ``projects`` - you will create 3 different files (one for teh first 6 columns, one for the read depths and one for the variant type) - get rid of the comments (they start with ``#``, that is ``^#`` regular expression) - filter lines based on chromosomes (``grep -e 'chr1\s' -e 'chrZ\s'``) - extact the first 6 columns (``cut -f1-6``) - extract ``DP`` column (``egrep -o 'DP=[^;]*' | sed 's/DP=//'``) - check each line for ``INDEL`` (``awk '{if($0 ~ /INDEL/) print "INDEL"; else print "SNP"}'``) - check if all files you are about to merge have the same length ``wc -l data/*.tsv`` - merge the data (columns) before loading to R (``paste file1 file2 file3 > file_combined``) .. pull-quote:: Good luck! (We will help you;) Extra R task (if you want to challenge yourself 😉) ------------ - load the data into R - explore graphically - barchart of variant types, how many variant are INDELs and how many SNPs (use ``geom_bar()``) - boxplot of qualities for INDELs and SNPs (use ``geom_boxplot()`` and additionally ``scale_y_log10()`` if you don't like the outliers) - histogram of qualities for INDELs and SNPs (use ``geom_histogram()`` and additionally ``scale_x_log10()``, ``facet_wrap()``) - can you spot some problem in the data Hints: - don't forget to load library for data manipulation and plotting (``library(tidyverse)``) - add column names while loading the data with ``read_tsv(..., col_names=c('CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'DP', 'TYPE'))``