Session 5: Graphics session
===========================

.. pull-quote:: A picture is worth a thousand words.

Especially when your data is big. We'll try to show you one of the easiest
ways to get nice pictures from your unix. We'll be using R, but we're not
trying to teach you R. There is a lot of packages in R Project, and most
of them are not that godo. We tried to cherry pick the best bits;)

Summarization
^^^^^^^^^^^^^
R is best for working with 'tables'. That means data, where each line contains
the same amount of 'fields', delimited by some special character like ``;`` or
``\t`` (tab character). The first row can contain column names. VCF is almost a
nice tabular file. The delimiter is ``\t``, but there is some mess in the
beginning of the file::

  </data-shared/mus_mda/00-popdata/popdata_mda.vcf.gz zcat | less -S

Prepare the input file
----------------------
Our input dataset is huge, and we want to use only some subset of the animals.
Let's choose few  european individuals. They have ``RDS``, ``KCT``, ``MWN`` and
``BAG`` in their names. Each programmer is lazy and prefers to avoid mistakes
by letting the machine do the work - let's find the column numbers with unix
tools.

.. code-block:: bash

   # we'll be reusing the same long file name, store it into a variable
   IN=/data-shared/mus_mda/00-popdata/popdata_mda.vcf.gz

   # create a new 'project' directory in data
   mkdir -p ~/projects/plotvcf/data
   cd ~/projects/plotvcf

   # check the file once again, and find the obligatory VCF columns
   <$IN zcat | less -S

   # let the computer number the colums (colnames)
   <$IN zcat |
     grep -v '^##' |
     head -1 |
     tr "\t" "\n" |
     nl |
     less

   # the obligatory columns are 1-9

   # now find the sample columns according to the pattern
   <$IN zcat |
     grep -v '^##' |
     head -1 |
     tr "\t" "\n" |
     nl |
     grep -e RDS -e KCT -e MWN -e BAG |
     awk '{print $1}' |
     paste -sd,

   # it's 67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85

   # now decompress with zcat and subset the data with cut
   <$IN zcat |
     cut -f1-9,67-85 \
   > data/popdata_mda_euro.vcf

We want to get rid of the comment lines starting with ``##``, and keep the
line starting with ``#`` as column names (getting rid of the ``#`` itself):

.. code-block:: bash

   IN=data/popdata_mda_euro.vcf

   # get rid of the '##' lines (quotes have to be there, otherwise
   # '#' means a comment in bash)
   <$IN grep -v '##' | less -S

   # better formatted output
   <$IN grep -v '##' | column -t | less -S

   # it takes too much time, probably column -t is reading too much of the file
   # we're good with first 1000 lines
   <$IN grep -v '##' | head -1000 | column -t | less -S

   # good, now trim the first #
   <$IN grep -v '##' | tail -c +2 | less -S

   # all looks ok, store it (tsv for tab separated values)
   # skip the very first character with tail
   <$IN grep -v '##' | tail -c +2 > data/popdata_mda_euro.tsv

.. admonition:: What have we learned?

  - unix tools excell at handling very large unstructured data
  - R is awesome at visualizatoin, but it needs smaller, tabular data
  - to get the best from both worlds, use unix tools to preprocess data for R
  - ``column -t`` can be used to preview tabular data in unix

Visualize data in RStudio
^^^^^^^^^^^^^^^^^^^^^^^^^
Just click this link (ctrl-click to keep this manual open): `Open RStudio <https://ngs-course.behavio.dev>`_.

In R Studio choose :menuselection:`File --> New project ...`. Locate your
project folder ``plotvcf`` in the dialog. Then
:menuselection:`File --> New file --> R Script`. R has it's own *working
directory*, opening/creating a project will set it automatically for you.

Load and check the input
------------------------
We'll be using a specifc subset of R known as `Tidyverse <http://tidyverse.org/>`_::

  library(tidyverse)

.. note::

  You can run the command by placing your caret in it and pressing ``ctrl+enter``.

Now press :kbd:`ctrl+s`, and save your script as ``plots.R``. It is a better
practice to write all your commands in the script window, and execute with
:kbd:`ctrl+enter`. You can comment them easily, you'll find them faster than in
``.Rhistory``.

Tabular data is loaded by ``read_tsv``. On a new line,
type ``read_tsv`` and press :kbd:`F1`. Help should pop up::

  read_tsv('data/popdata_mda_euro.tsv') -> d

A new entry should show up in the 'Environment' tab. Click the arrow and
explore. Also click the ``d`` letter itself.

You can see that ``CHROM`` was encoded as a number only and it was loaded as
``integer``. But in fact it is a factor, not a number (remember e.g.
chromosome X). Fix this in the ``mutate`` command, loading the data again
and overwriting ``d``. The (smart) plotting would not work well otherwise::

  read_tsv('data/popdata_mda_euro.tsv') %>%
    mutate(CHROM = as.factor(CHROM)) ->
    d

First plot
----------

We will use the ``ggplot2`` library. The 'grammatical' structure of the
command says what to plot, and how to represent the values. Usually the
``ggplot`` command contains the reference to the data, and graphic elements
are added  with ``+ geom_..()``. There are even some sensible defaults - e.g.
``geom_bar`` of a factor sums the observations for each level of the factor::

  ggplot(d, aes(CHROM)) + geom_bar()

This shows the number of variants in each chromosome. You can see here, that
we've included only a subset of the data, comprising chromosomes 2 and 11.

Summarize the data
------------------
We're interested in variant density along the chromosomes. We can simply
break the chromosome into equal sized chunks, and count variants in each of them
as a measure of density.

There is a function ``round_any`` in the package ``plyr``, which given
precision rounds the numbers. We will use it to round the variant position to
1x10^6 (million base pairs), and then use this rounded position as the block
identifier. Because the same positions repeat on each chromosome, we need to
calculate it once per each chromosome. This is guaranteed by ``group_by``.
``mutate`` just adds a column to the data.

You're already used to pipes from the previous exercises. While it's not
common in R, it is possible to build your commands in a similar way thanks to
the ``magrittr`` package. The name of the package is an homage to the Belgian
surrealist René Magritte and his most popular painting.

.. figure:: _static/magritte.jpg
   :align: center

   Ceci n'est pas une pipe. This is not a pipe.

Although the magrittr ``%>%`` operator is not a pipe, it behaves like one. You
can chain your commands like when building a bash pipeline:

.. code-block:: r

   # 'bash-like' ordering (source data -> operations -> output)
   d %>%
     group_by(CHROM) %>%
     mutate(POS_block = plyr::round_any(POS, 1e6)) ->
     dc

   # the above command is equivalent to
   dc <- mutate(group_by(d, CHROM), POS_block = plyr::round_any(POS, 1e6))


Now you can check how the ``round_any`` processed the ``POS`` value. Click the
``dc`` in the **Environment** tab and look for ``POS_block``. Looks good, we can go on.
The next transformation is to count variants (table rows) in each block (per chromosome):
You can use ``View`` in **R Studio** instead of ``less`` in bash.

.. code-block:: r

   dc %>%
    group_by(CHROM, POS_block) %>%
    summarise(nvars = n()) %>%
    View

.. note:: To run the whole block at once with ``ctrl+enter``, select it before you press the shortcut.

If the data look like you expected, you can go on to plotting:

.. code-block:: r

   dc %>%
     group_by(CHROM, POS_block) %>%
     summarise(nvars = n()) %>%
     ggplot(aes(POS_block, nvars)) +
       geom_line() +
       facet_wrap(~CHROM, ncol = 1)

Now you can improve your plot by making the labels more comprehensible:

.. code-block:: r

   dc %>%
     group_by(CHROM, POS_block) %>%
     summarise(nvars=n()) %>%
     ggplot(aes(POS_block, nvars)) +
       geom_line() +
       facet_wrap(~CHROM, ncol = 1) +
       ggtitle("SNP denisty per chromosome") +
       ylab("number of variants") +
       xlab("chromosome position")

If you prefer bars instead of a connected line, it's an easy swap with ggplot.

.. code-block:: r

   dc %>%
     group_by(CHROM, POS_block) %>%
     summarise(nvars = n()) %>%
     ggplot(aes(POS_block, nvars)) +
       geom_col() +
       facet_wrap(~CHROM, ncol = 1) +
       ggtitle("SNP denisty per chromosome") +
       ylab("number of variants") +
       xlab("chromosome position")

This could have saved us some more typing:

.. code-block:: r

   ggplot(d, aes(POS)) +
     geom_histogram() +
     facet_wrap(~CHROM, ncol = 1) +
     ggtitle("SNP denisty per chromosome") +
     ylab("number of variants") +
     xlab("chromosome position")


``ggplot`` warned you in the **Console**::

  stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

You can use ``binwidth`` to adjust the width of the bars, setting it to 1x10^6
again:

.. code-block:: r

   ggplot(d, aes(POS)) +
     geom_histogram(binwidth=1e6) +
     facet_wrap(~CHROM, ncol = 1) +
     ggtitle("SNP denisty per chromosome") +
     ylab("number of variants") +
     xlab("chromosome position")

.. image:: _static/snp_density.png
   :align: center

.. admonition:: What have we learned?

  - there is a nice way to produce graphics directly on the remote machine
  - it's easy to visualize data when you prepare the right inputs ;)
  - you can use the 'pipe ordering' of commands familiar from unix also in R

Tidy data
^^^^^^^^^
To create plots in such a smooth way like in the previous example the data has
to loosely conform to some simple rules. In short - each column is a variable,
each row is an observation. You can find more details in the
`Tidy data <http://vita.had.co.nz/papers/tidy-data.html>`_ paper.

The vcf as is can be considered `tidy` when using the ``CHROM`` and ``POS``
columns. Each variant (SNP) is a row. The data is not tidy when using variants
in particular individuals. All individual identifiers should be in single
column (variable), but there are several columns with individual names. This
is not 'wrong' per se, this format is more concise. But it does not work well
with ``ggplot``.

Now if we want to look at genotypes per individual, we need the genotype as a
single  variable, not 18. ``pivot_longer`` takes the values from multiple columns
and gathers them into one column. It creates another column where it stores
the originating column name for each value.

.. code-block:: r

  d %>%
    pivot_longer(10:28, names_to = "individual", values_to = "genotype") ->
    dm

Look at the data. Now we can plot the counts of reference/heterozygous/alternative
alleles easily.

.. code-block:: r

  ggplot(dm, aes(individual, fill = genotype)) + geom_bar()

Again, most of the code is (usually) spent on trying to make the plot look better:

.. code-block:: r

  ggplot(dm, aes(individual, fill = genotype)) +
    geom_bar() +
    theme(axis.text.x = element_text(angle = 30, hjust = 1))

.. image:: _static/genotypes.png

Now try to change parts of the command to see the effect of various parts. Delete
``, fill = genotype`` (including the comma), execute. A bit boring. We can get much
more colours by colouring each base change differently:

.. code-block:: r

  # base change pairs, but plotting sometnihg else than we need (probably)
  ggplot(dm, aes(individual, fill = paste(REF, ALT))) + geom_bar()

What could be interesting is the transitions to transversions ratio, for each
individual:

.. code-block:: r

  # transitions are AG, GA, CT, TC
  # transversions is the rest
  transitions <- c("A G", "G A", "C T", "T C")

  # count each homozygous ref as 0,
  # heterozygous as 1 and homozygous alt as 2
  # filter out uncalled
  dm %>%
    filter(genotype != './.') %>%
    mutate(vartype = paste(REF, ALT) %in% transitions %>% ifelse("Transition", "Transversion"),
           score = ifelse(genotype == '0/0', 0, ifelse(genotype == '0/1', 1, 2))) %>%
    group_by(individual, vartype) %>%
    summarise(score = sum(score)) %>%
    pivot_wider(names_from = "vartype", values_from = "score") %>%
    mutate(TiTv = Transition / Transversion) %>%
    ggplot(aes(individual, TiTv)) +
    geom_point() +
    theme(axis.text.x = element_text(angle = 30, hjust = 1))

.. note::

  Take your time to look at the `wonderful cheat sheets
  <https://www.rstudio.com/resources/cheatsheets/>`_ compiled by the company
  behind RStudio!

.. admonition:: What have we learned?

  - to visualize different aspects of your dataset with ggplot, reshape the data before visualization
  - there is a particular way of reshaping that works well with ``ggplot``