Session 9: Variant quality¶

In this part you will be working on your own. You’re already familiar with the VCF format and some reformatting and plotting tools. There is a file with variants from several nightingale individuals:

/data-shared/vcf_examples/luscinia_vars.vcf.gz

Your task now is:

pick only data for chromosomes chr1 and chrZ
extract the sequencing depth DP from the INFO column
extract variant type by checking if the INFO column contains INDEL string
load these two columns together with the first six columns of the VCF into R
explore graphically - barchart of variant types - boxplot of qualities for INDELs and SNPs (use scale_y_log10() if you don’t like the outliers) - histogram of qualities for INDELs and SNPs (use scale_x_log10(), facet_wrap()) - what is the problem?

And a bit of guidance here:

create a new project directory in your projects
get rid of the comments (they start with #, that is ^# regular expression)
filter lines based on chromosomes (grep -e 'chr1\s' -e 'chrZ\s')
extact the first 6 columns (cut -f1-6)
extract DP column (egrep -o 'DP=[^;]*' | sed 's/DP=//')
check each line for INDEL (awk '{if($0 ~ /INDEL/) print "INDEL"; else print "SNP"}')
merge the data (columns) before loading to R (paste)
add column names while loading the data with read_tsv(..., col_names=c(...))
If you’re done early, try to submit your solution to a new repo in GitHub!

Good luck! (We will help you;)