Genomic tools session¶
A lot of command line tools available for genomics, e.g.:
- Read alignment data:
- samtools (https://samtools.github.io)
- Variant data:
- vcftools (https://vcftools.github.io/index.html)
- bcftools (https://samtools.github.io/bcftools/)
- Annotation data (genome arithmetics):
- bedtools (https://bedtools.readthedocs.io/en/latest/)
- bedops (https://github.com/bedops/bedops)
- Sequence/Alignment/Tree data:
- newick-utilities (https://github.com/tjunier/newick_utils/wiki)
- BuddySuite (https://github.com/biologyguy/BuddySuite)
Exercise¶
Get a population differentiation calculated as Fst between M. m. musculus and M. m. domesticus within a given sliding window and find candidate genes within highly differentiated regions:
- use
vcftools
to filter data and calculate Fst for individual SNPs- calculate Fst for each SNP (
vcftools
)- write a function that will create sliding windows for windows of different sizes and steps (
bedtools makewindows
) and calculate average Fst for each window (groupBy
) and calculate average Fst for sliding windows of these sizes and steps
- 100 kb + 10 kb step
- 500 kb + 50 kb step
- 1 Mb + 100 kb step
- use R-Studio and ggplot2 to plot Fst values across the genome
- use R or
tabtk
to obtain the 99th percentile and use it to obtain a set of candidate genomic regions- use
bedtools intersect
to get a list of candidate genes
Extract genotype data for European mouse individuals and filter out variants having more than one missing genotype and minor allele frequency 0.2 (we have already started - you should have prepared VCF file with European samples and filtered out variants with missing genomes and low minor allele frequency).
mkdir -p ~/projects/fst
cd ~/projects/fst
IN=/data-shared/mus_mda/00-popdata/popdata_mda.vcf.gz
SAMPLES=/data-shared/mus_mda/00-popdata/euro_samps.txt
vcftools --gzvcf $IN \
--keep $SAMPLES \
--recode --stdout |
vcftools --vcf - \
--max-missing 1 \
--maf 0.2 \
--recode \
--stdout \
> popdata_mda_euro.vcf
Calculate Fst values for variants between M. m. musculus
and M. m. domesticus populations (populations specified in
musculus_samps.txt
and domesticus_samps.txt
):
MUS=/data-shared/mus_mda/00-popdata/musculus_samps.txt
DOM=/data-shared/mus_mda/00-popdata/domesticus_samps.txt
IN=popdata_mda_euro.vcf
vcftools --vcf $IN \
--weir-fst-pop $MUS \
--weir-fst-pop $DOM \
--stdout |
tail -n +2 |
awk -F $'\t' 'BEGIN{OFS=FS}{print $1,$2-1,$2,$1":"$2,$3}' \
> popdata_mda_euro_fst.bed
Build function that calculates average Fst for sliding windows
# Set file name with Fst values by SNP
IN=popdata_mda_euro_fst.bed
# Make sliding windows (genome file containing info about size of chromosome has to be specified)
grep -E '^2|^11' /data-shared/mus_mda/02-windows/genome.fa.fai > genome-fst.fa.fai
GENOME=genome-fst.fa.fai
WIN=1000000
STEP=100000
NAME="1Mb"
bedtools makewindows \
-g $GENOME \
-w $WIN \
-s $STEP |
awk -v win=$NAME '{ print $0"\t"win }' | less
# Intersect windows with list of SNPs
bedtools makewindows \
-g $GENOME \
-w $WIN \
-s $STEP | \
awk -v win=$NAME '{ print $0"\t"win }' |
bedtools intersect \
-a - \
-b $IN \
-wa -wb | less
# Calculate the average Fst by windows
bedtools makewindows \
-g $GENOME \
-w $WIN \
-s $STEP | \
awk -v win=$NAME '{ print $0"\t"win }' |
bedtools intersect \
-a - \
-b $IN \
-wa -wb |
sort -k4,4 -k1,1 -k2,2n |
groupBy -i - \
-g 4,1,2,3 \
-c 9 \
-o mean
# We can put everything together to write a function that can be re-used for different window sizes
average_fst() {
bedtools makewindows \
-g $1 \
-w $2 \
-s $3 |
awk -v win=$4 '{ print $0"\t"win }' |
bedtools intersect \
-a - \
-b $5 \
-wa -wb |
sort -k4,4 -k1,1 -k2,2n |
groupBy -i - \
-g 4,1,2,3 \
-c 9 \
-o mean
}
Make three sets of sliding windows (100 kb, 500 kb, 1 Mb) and concatenate them into a single file:
IN=popdata_mda_euro_fst.bed
GENOME=genome-fst.fa.fai
# 1 Mb sliding windows with 100 kb step
average_fst $GENOME 1000000 100000 "1Mb" $IN > fst_1000kb.bed
# 500 kb sliding windows with 50 kb step
average_fst $GENOME 500000 50000 "500kb" $IN > fst_500kb.bed
# 100 kb sliding windows with 10 kb step
average_fst $GENOME 100000 10000 "100kb" $IN > fst_100kb.bed
cat fst*.bed > windows_mean_fst.tsv
Visualize the average Fst values within the sliding windows of the three sizes between the two house mouse subspecies in R-Studio. Plot the distribution of the Fst values for the three window sizes and also plot the average Fst values along the chromosomes.
Note
R ggplot2 commands to plot population differentiation
library(tidyverse)
setwd("~/projects/fst")
## Read Fst file and rename names in header
read_tsv('windows_mean_fst.tsv', col_names=F) -> fst
names(fst) <- c("win_size", "chrom", "start", "end", "avg_fst" )
# Reorder levels for window size
fst %>%
mutate(win_size = factor(win_size, levels=c("100kb", "500kb", "1Mb"))) ->
fst
# Plot density distribution for average Fst values across windows
ggplot(fst, aes(avg_fst)) +
geom_density(fill=I("blue")) +
facet_wrap(~win_size)
## Plot Fst values along physical position
ggplot(fst, aes(y=avg_fst, x=start, colour=win_size)) +
geom_line() +
facet_wrap(~chrom, nrow=2) +
scale_colour_manual(name="Window size", values=c("green", "blue","red"))
## Retrieve 99% quantiles
fst %>%
group_by(win_size) %>%
summarize(p=quantile(avg_fst,probs=0.99)) -> fst_quantiles
## Add 99% quantiles for 500kb window
ggplot(fst, aes(y=avg_fst, x=start, colour=win_size)) +
geom_line() +
facet_wrap(~chrom, nrow=2) +
geom_hline(yintercept=as.numeric(fst_quantiles[2,2]), colour="black") +
scale_colour_manual(name="Window size", values=c("green", "blue","red"))
Find the 99th percentile of genome-wide distribution of Fst values
in order to guess possible outlier genome regions. 99th percentile
can be obtained running R as command line or by using tabtk
.
The output would be a list of windows having Fst higher
than or equal to 99% of the data.
## Calculate the 99 % quantile for average Fst for 500 kb windows
Q=$( grep '500kb' windows_mean_fst.tsv | tabtk num -c5 -q0.99 )
## Use of variables in AWK: -v q=value
grep 500kb windows_mean_fst.tsv |
awk -v q=$Q -F $'\t' 'BEGIN{OFS=FS}$5>=q{print $2,$3,$4}' |
sortBed |
bedtools merge -i stdin \
> signif_500kb.bed
Use the mouse gene annotation file to retrieve genes within the windows of high Fst (i.e. putative reproductive isolation loci).
GENES=/data-shared/bed_examples/Ensembl.NCBIM37.67.bed
bedtools intersect \
-a $GENES \
-b signif_500kb.bed -wa | \
column -t | less