Session 3: Plain text file processing in Unix¶
This session focuses on plain text file data extraction/modification using built-in Unix tools.
Pattern search & regular expressions¶
grep
command can be used to efficiently match and also retrieve pattern in text files.
grep pattern file.txt # Returns lines matching a pattern
grep -v pattern file.txt # Returns lines not matching a pattern
grep -E regex file.txt # Returns lines not matching a regex
grep -c pattern file.txt # Returns number of lines matching a pattern
grep -B pattern file.txt # Returns number of lines before a line matching a pattern
grep -o pattern file.txt # Returns only matching part of lines
man grep # For other options
But what if we want match a variable pattern, e.g. differing length or content? Regular expressions are exactly the tool that we need.
^A # match A at the beginning of line
A$ # match A at the end of line
[0-9] # match numerical characters
[A-Z] # match alphabetical characters
[ATGC] # match A or T or G or C
. # match any character
A* # match A letter 0 or more times
A{2} # match A letter exactly 2 times
A{1,} # match A letter 1 or more times
A+ # match A letter 1 or more times (extended regular expressions)
A{1,3} # match A letter at least 1 times but no more than 3 times
AATT|TTAA # match AATT or TTAA
\s # match whitespace (also TAB)
grep -E
is a useful tool to search for patterns using a mini-language called
regular expressions. You can use the egrep
shorthand, which means the same.
Word and line count¶
wc
command states for word count provides a quick summary statistics on the plain text file
content. Bytes, words and lines can be counted in defined files.
wc -c file.txt # Number of bytes in a file
wc -w file.txt # Number of words in a file
wc -l file.txt # Number of lines in a file
To calculate the number of lines in every file listed:
wc -l *.txt
- Count the number of variants in the file
< /data-shared/vcf_examples/luscinia_vars_flags.vcf.gz zcat |
grep -v '^#' |
wc -l
- Count the number of variants passing/failing the quality threshold
< /data-shared/vcf_examples/luscinia_vars_flags.vcf.gz zcat |
grep -v '^#' |
grep 'PASS' |
wc -l
< /data-shared/vcf_examples/luscinia_vars_flags.vcf.gz zcat |
grep -v '^#' |
grep 'FAIL' |
wc -l
- Count the number of variants on the chromosome Z passing the quality threshold
< /data-shared/vcf_examples/luscinia_vars_flags.vcf.gz zcat |
grep -v '^#' |
grep 'PASS' |
grep '^chrZ\s' |
wc -l
Retrieve & count unique records¶
Often times we face a problem of how many unique records we have in a file or how many there are instances of every unique item.
Unix provides efficient way to cut (cut
) desired columns and retrieve unique
records for selected column (sort -u
). Additionaly, we can count the instances (uniq -c
).
Typical examle of use in bioinformatics is to count the number of genes or SNPs per contig or chromosome.
To select specified columns cut
command can be used. By default, cut
use whitespace as separator. When there is need to distinguish between
standard whitespece and TAB
(i.e. TAB
-separated files) then -d $'\t'
delimiter has to be set explicitly. When all columns except a specific column(s)
are supposed to be selected --complement
flag can be used.
cut -f1-3 file.txt
cut -d $'\t' -f1-3 file.txt
cut --complement -f4 file.txt # Select all columns except column 4
Content of the file can be sorted based on the specified column (-k1,1
)
or range of columns (-k1,3
). When the data needs to be sorted numerically (-n
)
or in reverse order (-r
) appropriate flags need to be added. Similarly to cut
command sort
recognizes as separator any whitespace. When TAB
is used as a separator,
to enforce distinction from the possible whitespaces used in the file,
-d $'\t'
flag has to be used explicitly.
sort -k1,1 file.txt # Sort based on first column
sort -k1,1 -k2,2nr file.txt # Sort first based on first column, then on second column numerically in reversed order
sort -k1,3 file.txt # Sort based on range of columns
sort
command can also be used to retrieve the unique records using flag -u
.
When counts of instances for every unique items are supposed to be provided uniq -c
command should be used in combination with sort
as records before sent to uniq
have to be sorted.
sort -u file.txt # Retrieve unique records in the file
< file.txt sort | uniq -c # Count number of instances for every unique item
Exercise: Which chromosome has the highest and the least number of variants?
< data-shared/luscinia_vars_flags.vcf grep -v '^#' |
cut -f1 |
sort |
uniq -c |
sort -k1,1n
Exercise: Get the first six base pairs from every read
cat *.fastq |
grep -E "^[ACGT]+$" |
cut -c1-6
sort |
uniq -c |
sort -k1,1nr |
less
String extraction and replacement¶
Another common task in bioinformatics is a string extraction and/or replacement. Often times we need to extract specific piece of information from a complex data.
Typical example is the extraction of a specific value according to a TAG in gff3
or VCF file. As positions of TAGS can differ from line to line, using cut
command is simply not possible. Matching using sed
based on a TAG is
the only possibility. regex
can be used to match appropriate pattern using sed
.
Another typical task is a replacing of delimiters. tr
command is very
well suited for this task. -d
flag can be used to remove specific characters
from the file. The whole classes can be replaced which can be for instance
used to change uppercase to lowercase or vice versa.
For extraction of repeating strings grep -o
is the most efficient tool. In bioinformatics
it can be easily used to match and retrieve microsatellites from the sequence for instance.
Note
Difference between sed
and tr
:
tr
(from TRansliterate) replaces (or deletes) individual characters:
Ideal for removing line ends (tr -d "\n"
) or replacing some
separator to TAB (tr ";" "\t"
).
sed
replaces (or deletes) complex patterns.
Typical usage of tr
is as follows:
tr "\t" ";" file.txt # To replace TAB separator to semicolon
tr -d "\n" file.txt # Remove new line characters (``\n``)
tr "[A-Z]" "[a-z]" # Replace uppercase to lowercase
Exercise: What is the number of samples in the VCF file?
< data-shared/luscinia_vars_flags.vcf grep -v '^##' |
head -n1 |
cut --complement -f1-9 |
tr "\t" "\n" |
wc -l
To match a specific string in the file sed
can use regex
similar
to grep
command. However, to use full regex
functionality
and simplify the regex syntax, extended regular expression flag
r
(-E
for Mac OSX) has to be used.
Comparison of use of standard sed
and use of sed
with extended
regular expressions sed -r
:
sed 's/pattern/replacement/'
# Replace one or more A or C or G or T by N
# Standard sed
sed 's/^[ACGTN]\{6\}/NNNNNN/'
# The same thing using extended regular expressions:
sed -r 's/^[ACGTN]{6}/NNNNNN/'
sed
can be used also for string extraction. Matched string designated
to be extracted has to be marked in rounded brackets (string)
and passed to the output with following notation: \#
where # character
states for the position starting with 1 in the matched string (i.e. there can be
multiple extractions from one matched string).
# Returns TTTGGG
echo 'AAATTTCCCGGG' | sed -r 's/A+(T+)C+(G+)/\1\2/'
Note
sed -r
(text Stream EDitor) can do a lot of things, however,
pattern replacement and extraction is the best thing to use it for.
The ‘sed language’ consists of single character commands, and it is no fun
to code and even less fun to read (what does sed 'h;G;s/\n//'
do?;).
Use awk
for more complex processing (see next session).
grep -o
extracts only matching parts of the string. This command can be used
to exctract repeating patterns (i.e. very usefull for extraction of microsatellite
sequences).
# Match AT di-nucleotide twice or more times
grep -o -E "(AT){2,}"
# Match GTC tri-nuleotide twice or more times
grep -o -E "(GTC){2,}"
# Match any repeating pattern
grep -o -E "([ATGC]{1,})\1+"
Exercise: What is the number of SNPs per chromosome in the VCF file??
Don’t use cut
command!
FILE=/data-shared/vcf_examples/luscinia_vars_flags.vcf.gz
< $FILE zcat |
grep -o -E '^chr[Z1-9]+' |
sort |
uniq -c |
sort -k1,1nr
Exercise: Microsatellites statistics
Extract all AT dinucleotides repeating at least twice and calculate their frequency distribution in the whole dataset.
cat *.fastq |
grep -E "^[ACGT]+$" |
grep -o -E "(AT){2,}" |
sort |
uniq -c |
less
Join & paste data¶
The final part of this session is joining and pasting data. Here, we seek to merge multiple
files into one. join
command corresponds to standard JOIN
command known from SQL
language.
It joins two files based on specific key column. It assumes that both files contain a column representing
keys the are in both files. Both files must be sorted by the key before any joining task.
join
command has the same functionality as a standard JOIN
meaning that supports INNER JOIN
,
LEFT JOIN
, RIGHT JOIN
and FULL OUTER JOIN
(Join types).
By default the column considered to be key is the first column in both input files. As already mentioned the key column needs to be sorted in a same way in both files.
sort -k1,1 file1.txt > file1.tmp
sort -k1,1 file2.txt > file2.tmp
join file1.tmp file2.tmp > joined-file.txt
If key column is at different position it needs to be specified on the input
using -1
and -2
flags:
sort -k2,2 file1.txt > file1.tmp # key column on the 2nd position
sort -k3,3 file2.txt > file2.tmp # key column on the 3rd position
join -12 -23 file1.tmp file2.tmp > joined-file.txt
To specify that the join
is supposed to print unpaired lines corresponding to left, right
and full outer join, specification of the file to print unpaired lines from has to be done using
-a
flag. Also, -e
flag sets value for missing values
# Left join
join -a1 -e NA file1.tmp file2.tmp > joined-file.txt
# Right join
join -a2 -e NA file1.tmp file2.tmp > joined-file.txt
# Full outer join
join -a1 -a2 -e NA file1.tmp file2.tmp > joined-file.txt
Another command that can be used to merge two or more files is paste
. paste
as opposed to
join
simply align files by column (corresponding to cbind
in R
). No key column
is needed as it assumes one to one correspondence between the two files.
paste file1.txt file2.txt > file-merged.txt
paste
command can also be used for smart file transpositions. paste
by default
expects input multiple files per one line. However, when only one file provided with other
possible file possitions filed with -
the command paste
takes the further columns
from next lines of the only file provided. This feature enables to transpose multiple lines
into one line.
Example:
file.txt
item-line1
item-line2
item-line3
item-line4
< filte.txt paste - -
item-line1 item-line2
item-line3 item-line4
This feature can be used in bioinformatics to convert ``.fastq`` files into ``.tab`` type separated files with one read per line. We will use this functionality in upcoming session.
Exercise: Convert FASTQ file to TAB separated file with each read on one line
cat *.fastq |
paste - - - - |
cut --complement -f3 \
> reads.tab
Exercise¶
How many bases were sequenced?¶
wc
can count characters (think bases) as well. But to get a reasonable number,
we have to get rid of the other lines that are not bases.
One way to do it is to pick only lines comprising of letters A, C, G, T and N.
There is a ubiquitous mini-language called regular expressions that can be used
to define text patterns. A line comprising only of few possible letters is
a text pattern. grep
is the basic tool for using regular expressions:
cat *.fastq | grep '^[ACGTN]*$' | less -S
Check if the output looks as expected. This is a very common way to work - build a part of
the pipeline, check the output with less
or head
and fix it or add more commands.
Now a short explanation of the ^[ACGTN]*$
pattern (grep
works one line a time):
^
marks beginning of the line - otherwisegrep
would search anywhere in the line- the square brackets (
[]
) are a character class, meaning one character of the list,[Gg]rep
matchesGrep
andgrep
- the
*
is a count suffix for the square brackets, saying there should be zero or more of such characters $
marks end of the line - that means the whole line has to match the pattern
To count the bases read, we extend our pipeline:
cat *.fastq | grep '^[ACGTN]*$' | wc -c
The thing is that this count is not correct. wc -c
counts every character,
and the end of each line is marked by a special character written as \n
(n
for newline). To get rid of this character, we can use another tool, tr
(transliterate). tr
can substitute one letter with another (imagine you
need to lowercase all your data, or mask lowercase bases in your Fasta file).
Additionally tr -d
(delete) can remove characters:
cat *.fastq | grep '^[ACGTN]*$' | tr -d "\n" | wc -c
Note
If you like regular expressions, you can hone your skills at http://regex.alf.nu/.