The V4 hypervariable
region of the 16S rRNA
(ribosomal RNA) gene encodes for part of the ribosome (more specifically, the 30S small subunit) found in prokaryotic cells
. Sections of this gene are conserved
across the genomes of all bacterial species
, and variations within the coding sequence
are used to reconstruct phylogeny
. Hence, highly complex bacterial communities
are commonly identified using 16S rRNA gene sequencing
.
Operational taxonomic units
(OTUs) are a way of grouping 16S rRNA gene
sequences together based on their sequence similarity. These OTUs are then compared to a reference database
to infer likely taxonomy
. It is therefore important to select an appropriate similarity threshold
to identify OTUs that can properly distinguish between genuine variation in the 16S sequence or artificial variation introduced through sequencing error. There is some debate over what the best thresholds are, however I have decided to use the default threshold of 97%
(as recommended by QIIME at the time of writing). Furthermore, experimentation with different (i.e. more stringent) threshold settings did not yield significant differences in total number of sequences.
Illumina platform
is generally considered good, but still prone to perform errors in sequencing, especially towards the end of reads (in both the forward and reverse strands). Therefore, it is important to perform QC to remove low quality
sequences, and to truncate sequences when they drop below a specified quality score
.Phred quality score
is a measure of accuracy for a base in a sequence; it indicates the probability that a base call is correct. A Phred score
of 30 means that the probability of a correct base call
is 99.9%
for a certain position. Therefore, an average of >Q30 is very high quality.PCR
when >=2 biological sequences join together. This is rare in shotgun sequencing
, but common in Illumina 16S rRNA gene amplicon sequencing when closely related sequences are amplified. The majority of chimeras
are thought to arise from incomplete extension; whereby a partially extended strand can bind to a template derived from a different but similar sequence. Therefore, this can act as a primer
that is then extended to form a chimeric sequence
.16S sequencing
only a small fraction of reads are chimeric (~1-5%). But, when the reads are clustered into OTUs
, a much larger fraction could be chimeric
. It is thought to be very difficult (if not almost impossible) to distinguish between chimeras
and true sequences (even with no sequencing errors and a complete reference database). This is because of fake models
; when a "correct sequence"
can be constructed as a chimera
from two other correct sequences
.chimeric sequences
are a problem in 16S rRNA gene amplicon sequencing. However, I have chosen not to remove chimeric sequences in my dataset because (1) it is very difficult to determine what is and is not a chimeric sequence
, hence I find the algorithms utilised by various software stating otherwise questionable; (2) I do not want to get rid of sequences that may be true OTUs
; (3) a couple of students have ran chimeric checking and it was found to have only removed 80 sequences, suggesting it is not a significant amount relative to the total number of sequences.