Session 8: Bioinformatic pipelines ================================== Mouse Gene Ontology enrichment analysis pipeline ------------------------------------------------ Bioinformatic pipeline to carry Gene Ontology enrichment analysis for low and high divergence genes among two house mouse subspecies. SNP variants for two mouse strains (PWD/PhJ, WSB/EiJ) were downloaded from the Mouse Genome Project FTP site (`Mouse Genome Project `_). PWD/PhJ and WSB/EiJ represent *Mus musculus musculus* and *Mus musculus domesticus* subspecies, respectively. The aim is to identify genes with high relative divergence between the two strains and carry Gene Ontology enrichment analysis for genes according to the divergence. First, you have to clone the repository from the GitHub: .. code-block:: bash # Go to you project directory cd ~/projects # Clone repository git clone https://github.com/vjanousk/mouse-go-analysis.git # Go to the local repository cd mouse-go-analysis Install required software ^^^^^^^^^^^^^^^^^^^^^^^^^ For the mouse Gene Ontology pipeline three specialized genomic tools are necessary: ``bedtools``, ``vcftools`` or ``bcftools``. If they are not installed, the script below will install these tools. .. code-block:: bash ./install.sh Prepare directories ^^^^^^^^^^^^^^^^^^^ We are going to prepare directories that are use to store source data as well as intermediate steps and final resulting data. We will create a new ``data`` directory in your project directory .. code-block:: bash mkdir -p data/00-source-data data/01-divergence data/02-go Define variables ^^^^^^^^^^^^^^^^ Several types of variables defined. Filtering parameters provide thresholds on filtering quality and number of genes used at different stages of the pipeline. .. code-block:: bash # Filtering parameters quality=50 readdepth=10 minnumgenes=30 # Working directories (removed from the git in .gitignore) wd_source=data/00-source-data wd_divergence=data/01-divergence wd_go=data/02-go # Source files sourcevcf=/data-shared/mouse_go/mgp.v5.snps.dbSNP142.clean.X.vcf.gz sourcegenes=/data-shared/mouse_go/MGI.gff3.gz go2genes=/data-shared/mouse_go/gene_association.mgi.gz goterms=/data-shared/mouse_go/go_terms.mgi.gz # Processed source files cds_db=$wd_source/mgi-cds.bed go_db=$wd_source/go2genes.txt # Divergence analysis output files: annotation=$wd_divergence/annotation.tab divergencevcf=$wd_divergence/out-vars.vcf divergence=$wd_divergence/divergence.bed # GO enrichment by high and low divergence regions div_go=$wd_go/divergence_by_go.txt Prepare CDS & GO databases ^^^^^^^^^^^^^^^^^^^^^^^^^^ ``MGI.gff3.gz`` represents a full report containing detailed information on genes, mRNAs, exons and CDS. For the divergence analysis only CDS are needed. CDS database is prepared in this step and ``.gff3`` is converted to ``.bed`` to work more easily with the CDS data. .. code-block:: bash src/make_cds_database.sh $sourcegenes $cds_db ``go_terms.mgi.gz`` and ``gene_association.mgi.gz`` represents GO terms and association between genes and GO terms IDs provided by Mouse Genome Informatics (`Mouse Genome Informatics `_) and Gene Ontology Consortium (`Gene Ontology `_). In the command below joined dataset of list of genes with GO term enrichment is prepared. .. code-block:: bash src/make_go_database.sh $go2genes $goterms $go_db Run the pipeline step-by-step ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **1. Selecting SNPs that are divergent between the two strains** Other criteria used for selection is the PHRED quality and read depth (DP). Divergent SNPs are identified using Fst function built in the ``vcftools``. SNPs are considered to be divergent when Fst equals 1. .. code-block:: bash src/get_divergent_variants.sh \ $quality \ $readdepth \ $sourcevcf \ $annotation \ $divergencevcf **2. Calculate the per gene divergence** Once the list of divergent SNPs between the two strains and the CDS database are created, the divergence per gene can be calculated. Combination of ``bedtools`` tools and ``awk`` commands is used to find SNPs overlapping CDS parts of the genes and calculate sums and relative divergence by genes. .. code-block:: bash src/calculate_per_gene_divergence.sh \ $divergencevcf.gz \ $cds_db \ $divergence **3. Calculate the average relative divergence by Gene Ontology category** Per-gene relative divergences are used to calculate the average relative divergence for individual GO terms. Combinatino of the built-in Unix ``join`` and ``sort`` commands is used along with `groupby` that is part of the ``bedtools`` tools suite. GO dataset is joined to dataset on with gene relative divergences. The average for every GO term is then calculated omitting low prevalence GO terms. .. code-block:: bash src/divergence_by_go.sh \ $divergence \ $go_db \ $minnumgenes \ $div_go **4. Prepare a barplot showing results of the GO enrichment analysis** To plot the results of the GO enrichment analysis ``Rscript`` is used. Library ``ggplot2`` is the most suitable tool to provide fast and efficient plot. .. code-block:: bash Rscript src/plot.R Alternatively, we can open the ``.R`` file in R Studio and plot the graph there. Resulting ggplot graph ^^^^^^^^^^^^^^^^^^^^^^ .. image:: _static/go-enrichment.jpg Run the whole pipeline at once ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Now we can try to run the whole pipeline at once using ``pipeline.sh`` shell script using different set of parameters. .. code-block:: bash # Filtering parameters quality=50 readdepth=10 minnumgenes=30 # GO enrichment by high and low divergence regions div_go=$wd_go/divergence_by_go-ver2.txt # Run the pipeline ./pipeline.sh \ $quality \ $readdepth \ $minnumgenes \ $sourcevcf \ $annotation \ $divergencevcf \ $cds_db \ $divergence \ $go_db \ $div_go