Session 7: Variant quality
In this part you will be working on your own. You’re already familiar with the VCF format and some reformatting and plotting tools. There is a file with variants from several nightingale individuals:
/data-shared/vcf_examples/luscinia_vars.vcf.gz
Your task now is:
set up a new project directory
pick only data for chromosomes
chr1andchrZextract the sequencing depth
DPfrom theINFOcolumnextract variant type by checking if the
INFOcolumn containsINDELstringmerge these two columns together with the first six columns of the VCF
create a new git repository and commit your script
push the repository to GitHub
And a bit of guidance here:
create a new project directory in your
projectsyou will create 3 different files (one for teh first 6 columns, one for the read depths and one for the variant type)
get rid of the comments (they start with
#, that is^#regular expression)filter lines based on chromosomes (
grep -e 'chr1\s' -e 'chrZ\s')extact the first 6 columns (
cut -f1-6)extract
DPcolumn (egrep -o 'DP=[^;]*' | sed 's/DP=//')check each line for
INDEL(awk '{if($0 ~ /INDEL/) print "INDEL"; else print "SNP"}')check if all files you are about to merge have the same length
wc -l data/*.tsvmerge the data (columns) before loading to R (
paste file1 file2 file3 > file_combined)
Good luck! (We will help you;)
Extra R task (if you want to challenge yourself 😉)
load the data into R
explore graphically
barchart of variant types, how many variant are INDELs and how many SNPs (use
geom_bar())boxplot of qualities for INDELs and SNPs (use
geom_boxplot()and additionallyscale_y_log10()if you don’t like the outliers)histogram of qualities for INDELs and SNPs (use
geom_histogram()and additionallyscale_x_log10(),facet_wrap())can you spot some problem in the data
Hints:
don’t forget to load library for data manipulation and plotting (
library(tidyverse))add column names while loading the data with
read_tsv(..., col_names=c('CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'DP', 'TYPE'))