Variant qualityΒΆ
In this part you will be working on your own. You’re already familiar with the VCF format and some reformatting and plotting tools. There is a file with variants from several nightingale individuals:
/data/vcf_examples/luscinia_vars.vcf.gz
Your task now is:
- pick only data for chromosomes
chr1
andchrZ
- extract the sequencing depth
DP
from theINFO
column - extract variant type by checking if the
INFO
column containsINDEL
string - load these two columns together with the first six columns of the VCF into R
- explore graphically (barchart of variant types, histogram of qualities for INDELs and SNPs, ...)
And a bit of guidance here:
- create a new project directory in your
projects
- get rid of the comments (they start with
#
, that is^#
regular expression) - filter lines based on chromosomes (
grep -e chr1 -e chrZ
) - extact the first 6 columns (
cut -f1-6
) - extract
DP
column (egrep -o 'DP=[^;]*' | sed 's/DP=//'
) - check each line for
INDEL
(awk '{if($0 ~ /INDEL/) print "INDEL"; else print "SNP"}'
) - merge the data (columns) before loading to R (
paste
) - add column names while loading the data with
read.delim(..., col.names=c(...))
Good luck! (We will help you;)