Session 9: Variant quality¶
In this part you will be working on your own. You’re already familiar with the VCF format and some reformatting and plotting tools. There is a file with variants from several nightingale individuals:
/data-shared/vcf_examples/luscinia_vars.vcf.gz
Your task now is:
- pick only data for chromosomes
chr1andchrZ - extract the sequencing depth
DPfrom theINFOcolumn - extract variant type by checking if the
INFOcolumn containsINDELstring - load these two columns together with the first six columns of the VCF into R
- explore graphically
- barchart of variant types
- boxplot of qualities for INDELs and SNPs (use
scale_y_log10()if you don’t like the outliers) - histogram of qualities for INDELs and SNPs (usescale_x_log10(),facet_wrap()) - what is the problem?
And a bit of guidance here:
- create a new project directory in your
projects - get rid of the comments (they start with
#, that is^#regular expression) - filter lines based on chromosomes (
grep -e 'chr1\s' -e 'chrZ\s') - extact the first 6 columns (
cut -f1-6) - extract
DPcolumn (egrep -o 'DP=[^;]*' | sed 's/DP=//') - check each line for
INDEL(awk '{if($0 ~ /INDEL/) print "INDEL"; else print "SNP"}') - merge the data (columns) before loading to R (
paste) - add column names while loading the data with
read_tsv(..., col_names=c(...)) - If you’re done early, try to submit your solution to a new repo in GitHub!
Good luck! (We will help you;)