Session 7: Variant quality

In this part you will be working on your own. You’re already familiar with the VCF format and some reformatting and plotting tools. There is a file with variants from several nightingale individuals:

/data-shared/vcf_examples/luscinia_vars.vcf.gz

Your task now is:

set up a new project directory
pick only data for chromosomes chr1 and chrZ
extract the sequencing depth DP from the INFO column
extract variant type by checking if the INFO column contains INDEL string
merge these two columns together with the first six columns of the VCF
create a new git repository and commit your script
push the repository to GitHub

And a bit of guidance here:

create a new project directory in your projects
you will create 3 different files (one for teh first 6 columns, one for the read depths and one for the variant type)
get rid of the comments (they start with #, that is ^# regular expression)
filter lines based on chromosomes (grep -e 'chr1\s' -e 'chrZ\s')
extact the first 6 columns (cut -f1-6)
extract DP column (egrep -o 'DP=[^;]*' | sed 's/DP=//')
check each line for INDEL (awk '{if($0 ~ /INDEL/) print "INDEL"; else print "SNP"}')
check if all files you are about to merge have the same length wc -l data/*.tsv
merge the data (columns) before loading to R (paste file1 file2 file3 > file_combined)

Good luck! (We will help you;)

Extra R task (if you want to challenge yourself 😉)

load the data into R
explore graphically
- barchart of variant types, how many variant are INDELs and how many SNPs (use geom_bar())
- boxplot of qualities for INDELs and SNPs (use geom_boxplot() and additionally scale_y_log10() if you don’t like the outliers)
- histogram of qualities for INDELs and SNPs (use geom_histogram() and additionally scale_x_log10(), facet_wrap())
- can you spot some problem in the data

Hints:

don’t forget to load library for data manipulation and plotting (library(tidyverse))
add column names while loading the data with read_tsv(..., col_names=c('CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'DP', 'TYPE'))