Part A: The workflow for analysis of independent groups of samples
Collapsing
technical replicates and imputation of missing data -
make_Ct_ready()
function
Data normalization using reference gene (a pairwise approach)
Quality control and filtering of transformed Ct data (a pairwise approach)
Relative quantification: 2-ddCt method (a pairwise approach)
Final visualisations (a pairwise approach)
FCh_plot()
function (a pairwise approach)results_volcano()
function (a pairwise approach)results_boxplot()
function (a pairwise approach)results_barplot()
function (a pairwise approach)results_heatmap()
function (a pairwise approach)parallel_plot()
function (a pairwise approach)RQdeltaCT
is an R package developed to perform relative
quantification of gene expression using delta Ct methods, proposed by
Kenneth J. Livak and Thomas D. Schmittgen in Article1
and Article2.
These methods were designed to analyse gene expression data (Ct values) obtained from real-time PCR experiments. The main idea is to normalise gene expression values using endogenous control gene, present gene expression levels in linear form by using the 2-(value) transformation, and calculate differences in gene expression levels between groups of samples (or technical replicates of a single sample).
There are two main delta Ct methods used for relative quantification. The choice of the best method depends on the study design. A short description of these methods is provided below; for more details, refer to articles in links provided above.
2-dCt method. In this method, Ct values are normalised by the endogenous control gene (often GAPDH, beta-actin, or other) by subtracting the Ct value of the endogenous control in each sample from the Ct value of the gene of interest in the same samples, obtaining delta Ct (dCt) values. Subsequently, the dCt values are transformed using the 2-dCt formula, summarised by means in the compared study groups, and a ratio of means (fold change) is calculated for a study group. This method is useful in scenarios where samples should be analysed as individual data points, e.g., in comparison between patients and healthy subjects. See example no 5. in Article2.
2-ddCt method. Similarly to the 2-dCt method, Ct values are normalised by endogenous control gene, but the obtained delta Ct (dCt) values are not exponentially transformed, but summarised by means in the compared study groups, and the mean dCt in a control group is subtracted from the mean dCt in a study group, giving the delta delta Ct (ddCt) value. Subsequently, the ddCt values are transformed using the 2-ddCt formula to obtain the fold change value (also called the RQ value). This method is useful where a compared groups contain rather technical than biological replicates, e.g. where samples of cell line before adding stimulant are compared to samples of the same cell line after stimulation. See examples no. 1 and 2 in Article2.
Presented RQdeltaCT
package includes functions that
encompass both of these methods either for comparison of independent
groups of samples or groups with a paired samples (pairwise analysis).
The selection of a suitable method for analysis is up to the user.
To install and load RQdeltaCT
package simply run:
The functions developed within the RQdeltaCT
package are
designed to be maximally easy to use, even for the users who are
beginners to R. The parameters of functions were prepared to
sufficiently range all essential tasks and options, and no additional,
extensive coding steps are necessary in standard workflow. The package
was developed with the intention to be user-friendly and provide an
opportunity to perform relative quantification analysis of gene
expression using RQdeltaCT
package by non-experts in R
programming (only a very basic programming skills are required).
The entire standard workflow of analysis performed using the
RQdeltaCT
package requires the following external
packages:
tidyverse
- main package for data processing
(dplyr
, tidyr
) and visualisation
(ggplot2
),coin
- used to perform the Mann-Whitney U test
(wilcox_test()
function),ctrlGene
- used to calculate gene expression stability
score using geNorm algorithm,ggsignif
- used to add significance labels to
plots,Hmisc
- used to perform correlation analysis
(rcor()
function),corrplot
- used to visualise results of correlation
analysis (corrplot()
function),ggpmisc
- used to add linear regression results to the
plot (stat_poly_eq()
function),pROC
- used to perform analysis (roc()
function),oddsratio
- used to compute odds ratio values
(or_glm()
function),GGally
- used to illustrate a pairwise changes in gene
expression (ggparcoord()
function).All plots created by functions of the RQdeltaCT
package
can be saved as a .tiff files in the working directory (if
save.to.tiff
parameter is set to TRUE). Image resolution,
dimensions, and name can be specified by the user. Also, all generated
tables can be saved as .txt files (if save.to.txt
parameter
is set to TRUE) with name specified by the user.
This vignette includes instructions and examples of usage of main
parameters of functions from RQdeltaCT
package. For all
list of parameters available and their usage, refer to documentation of
a particular function.
Data analysed with the RQdeltaCT
package should be in
tabular form and contain the following information: group names, sample
names, gene names, and Ct values. Flag information can also be included
for data filtering purposes. Any other information could exist in the
data (user do not have to remove them), but will not be used for
analysis.
Files with such tables typically can be exported from software coupled with PCR devices and used to analyse raw data files generated during real-time PCR experiments. Such files are also returned by external software, such as SDS, or even R packages, e.g. qpcR
For user convenience, the RQdeltaCT
package provides two
functions useful to import tables in .txt or csv. format: *
read_Ct_long()
- to import tables with long-format
structure (each information in columns), * read_Ct_wide()
-
to import tables with wide-format structure (samples by columns, genes
by rows).
NOTE: Imported tables must be free of empty lines.
read_Ct_long()
functionExample of a long-format table with a structure suitable for the
read_Ct_long()
function:
Group | Sample | Gene | Ct | Flag |
---|---|---|---|---|
Disease | Disease1 | Gene1 | 25.6 | OK |
Disease | Disease2 | Gene2 | 32.9 | Undetermined |
Control | Control1 | Gene1 | Undetermined | OK |
Control | Control2 | Gene2 | 27.5 | OK |
… | … | … | … | … |
For the purpose of presentation of the read_Ct_long()
function, this function will be used to import long-format .txt table
(data_Ct_long.txt) located in RQdeltaCt
package
directory:
# Set path to file:
path <- system.file("extdata",
"data_Ct_long.txt",
package = "RQdeltaCT")
# Import file using path; remember to specify proper separator, decimal character, and numbers of necessary columns:
library(RQdeltaCT)
library(tidyverse)
data.Ct <- read_Ct_long(path = path,
sep = "\t",
dec = ".",
skip = 0,
add.column.Flag = TRUE,
column.Sample = 1,
column.Gene = 2,
column.Ct = 5,
column.Group = 9,
column.Flag = 4)
Let’s look at the data structure:
str(data.Ct)
#> 'data.frame': 1288 obs. of 5 variables:
#> $ Sample: chr "Disease1" "Disease10" "Disease12" "Disease13" ...
#> $ Gene : chr "Gene1" "Gene1" "Gene1" "Gene1" ...
#> $ Ct : chr "32.563" "34.648" "35.059" "37.135" ...
#> $ Group : chr "Disease" "Disease" "Disease" "Disease" ...
#> $ Flag : num 1.38 1.35 1.34 1.2 1.39 ...
The data were imported properly; however, the Flag variable is numeric, but character or factor is required. The Flag column contains a numeric AmpScore parameter, which is often used to evaluate the quality of the amplification curve - curves with AmpScore below 1 are typically considered as low quality and removed from data during analysis. The Flag variable can be changed into character by transforming the numeric AmpScore parameter into a binary variable that contains values “OK” and “Undetermined” according to the applied AmpScore criterion:
library(tidyverse)
data.Ct <- mutate(data.Ct,
Flag = ifelse(Flag < 1, "Undetermined", "OK"))
str(data.Ct)
#> 'data.frame': 1288 obs. of 5 variables:
#> $ Sample: chr "Disease1" "Disease10" "Disease12" "Disease13" ...
#> $ Gene : chr "Gene1" "Gene1" "Gene1" "Gene1" ...
#> $ Ct : chr "32.563" "34.648" "35.059" "37.135" ...
#> $ Group : chr "Disease" "Disease" "Disease" "Disease" ...
#> $ Flag : chr "OK" "OK" "OK" "OK" ...
In this transformation, all AmpScore values in the Flag column that are below 1 were changed to “Undetermined”, otherwise to “OK”. The Flag variable now is character, as required, and the data are ready for further steps of analysis.
read_Ct_wide()
functionThe read_Ct_wide()
function was designed to import data
with gene expression results in the form of a wide-format table (with
sample names in the first row and gene names in the first column).
Because such a structure does not include names of groups, an additional
file is required, containing group names and assigned samples. This
second file must contain two columns: column named “Sample” with the
names of the samples and column named “Group” with the names of the
groups assigned to the samples. The names of the samples in this file
must be the same as the names of the columns in the file with Ct values
(the order does not have to be kept).
Example structure of a wide-format table suitable for the
read_Ct_wide()
function:
Gene | Sample1 | Sample2 | Sample3 | … |
---|---|---|---|---|
Gene1 | 25.4 | 24.9 | 25.6 | … |
Gene2 | 21.6 | 22.5 | 20.8 | … |
Gene3 | 33.7 | Undetermined | Undetermined | … |
Gene4 | 15.8 | 16.2 | 17.5 | … |
… | … | … | … | … |
Example structure of an additional file suitable for the
read_Ct_wide()
function:
Sample | Group |
---|---|
Sample1 | Control |
Sample2 | Control |
Sample3 | Disease |
Sample4 | Disease |
… | … |
In the following example, the read_Ct_wide()
function
was used to import and merge a wide-format file with Ct values
(data_Ct_wide.txt) and a file with names of groups (data_design.txt)
located in RQdeltaCt
package directory:
# Set paths to required files:
path.Ct.file <- system.file("extdata",
"data_Ct_wide.txt",
package = "RQdeltaCT")
path.design.file <- system.file("extdata",
"data_design.txt",
package = "RQdeltaCT")
# Import files:
library(tidyverse)
data.Ct <- read_Ct_wide(path.Ct.file = path.Ct.file,
path.design.file = path.design.file,
sep ="\t",
dec = ".")
# Look at the structure:
str(data.Ct)
#> tibble [1,216 × 4] (S3: tbl_df/tbl/data.frame)
#> $ Gene : chr [1:1216] "Gene1" "Gene1" "Gene1" "Gene1" ...
#> $ Sample: chr [1:1216] "Disease1" "Disease10" "Disease12" "Disease13" ...
#> $ Ct : chr [1:1216] "32.563" "34.648" "35.059" "37.135" ...
#> $ Group : chr [1:1216] "Disease" "Disease" "Disease" "Disease" ...
The table imported from the package data object can be directly subjected to further steps of analysis.
It also can be a situation, in which an imported wide-format table has samples by rows and genes by columns. There is no need to develop separate function to import file with such a table, we can do:
# Import file, be aware to specify parameters that fit to imported data:
data.Ct.wide <- read.csv(file = "data/data.Ct.wide.vign.txt",
header = TRUE,
sep = ",")
str(data.Ct.wide)
#> 'data.frame': 64 obs. of 22 variables:
#> $ X : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Group : chr "Disease" "Disease" "Disease" "Disease" ...
#> $ Sample: chr "Disease1" "Disease10" "Disease12" "Disease13" ...
#> $ Gene1 : num 32.6 34.6 35.1 37.1 34 ...
#> $ Gene2 : chr "36.554" "37.262" "36.977" "38.295" ...
#> $ Gene3 : chr "31.334" "31.161" "32.077" "33.982" ...
#> $ Gene4 : num 23.4 22.4 24.3 24.9 24.1 ...
#> $ Gene5 : chr "35.608" "33.385" "36.374" "36.997" ...
#> $ Gene7 : num 32.8 33.3 31.3 35.5 31.9 ...
#> $ Gene8 : num 21.5 22.9 24.7 24 21.6 ...
#> $ Gene9 : chr "35.037" "36.36" "35.946" "36.885" ...
#> $ Gene10: num 26.6 27.1 26.7 27.7 26.8 ...
#> $ Gene11: num 36.7 34.1 35.4 37.1 35.2 ...
#> $ Gene12: num 28.2 30.2 29.4 32.5 29.6 ...
#> $ Gene13: num 28.2 30.5 29.9 33 29.8 ...
#> $ Gene14: num 27.9 28.5 28.1 30.8 30.9 ...
#> $ Gene15: num 30 31.2 32 32.1 29.4 ...
#> $ Gene16: num 21.7 22.5 23.6 24.5 22.7 ...
#> $ Gene17: num 26.3 27.1 27.6 28.5 27.6 ...
#> $ Gene18: num 28.2 28.5 29.6 30.2 26.6 ...
#> $ Gene19: chr "28.263" "27.358" "28.867" "29.951" ...
#> $ Gene20: num 32.7 32.9 31 35.9 32 ...
# The imported table is now transformed to a long-format structure. The "X" column is unnecessary and is removed. All variables also are converted to a character to unify the class of variables.
library(tidyverse)
data.Ct <- data.Ct.wide %>%
select(-X) %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(cols = -c(Group, Sample), names_to = "Gene", values_to = "Ct")
str(data.Ct)
#> tibble [1,216 × 4] (S3: tbl_df/tbl/data.frame)
#> $ Group : chr [1:1216] "Disease" "Disease" "Disease" "Disease" ...
#> $ Sample: chr [1:1216] "Disease1" "Disease1" "Disease1" "Disease1" ...
#> $ Gene : chr [1:1216] "Gene1" "Gene2" "Gene3" "Gene4" ...
#> $ Ct : chr [1:1216] "32.563" "36.554" "31.334" "23.415" ...
NOTE: At this stage, the Ct values do not have to be numeric.
NOTE: Data can also be imported to R using user’s own code, but the final object must be a data frame and contain a table with column named “Sample” with sample names, column named “Gene” with gene names, column named “Ct” with raw Ct values, column named “Group” with group names, and optionally column named “Flag” containing flag information (this column should be a class of character or factor).
For package testing purposes, there is also a convenient possibility
to use the data objects included in the RQdeltaCT
package,
named data.Ct
(the data with independent groups of samples)
and data.Ct.pairwise
(with dependent groups of samples, can
be used for a pairwise analysis):
data(data.Ct)
str(data.Ct)
#> 'data.frame': 1288 obs. of 5 variables:
#> $ Sample: chr "Disease1" "Disease10" "Disease12" "Disease13" ...
#> $ Gene : chr "Gene1" "Gene1" "Gene1" "Gene1" ...
#> $ Ct : chr "32.563" "34.648" "35.059" "37.135" ...
#> $ Group : chr "Disease" "Disease" "Disease" "Disease" ...
#> $ Flag : chr "OK" "OK" "OK" "OK" ...
data(data.Ct.pairwise)
str(data.Ct.pairwise)
#> tibble [756 × 4] (S3: tbl_df/tbl/data.frame)
#> $ Sample: chr [1:756] "Sample01" "Sample01" "Sample01" "Sample01" ...
#> $ Group : chr [1:756] "After" "After" "After" "After" ...
#> $ Gene : chr [1:756] "Gene1" "Gene2" "Gene3" "Gene4" ...
#> $ Ct : chr [1:756] "32.563" "36.554" "31.334" "23.415" ...
The data.Ct.pairwise
data are a part of the
data.Ct
data, with a structure transformed to fit a
pairwise analysis: the number of samples in the study group (named
“Before”) has been equated with the number of the reference group (named
“After”). This data are used in this vignette to demonstrate a pairwise
approach to the relative quantification of gene expression.
The RQdeltaCT
package allows to perform analysis in two
variants: a comparison of independent groups of samples and a pairwise
comparison of dependent groups of samples. Independent groups contain
different samples without inter-group relations. The example of such
analysis is the comparison of patients with disease vs. healthy
controls. In the pairwise approach, the groups contain either the same
set of subjects or different, but paired subjects. The example of
pairwise analysis is the comparison between the same patients before and
after medical intervention. This part of the vignette contains
instruction for analysis of independent groups, for the pairwise variant
of the workflow refer to [Part B: A pairwise analysis].
The crucial step of each data analysis is an assessment of the
quality and usefulness of the data used to investigate the studied
problem. The RQdeltaCT
package offers two functions for
quality control of raw Ct data: control_Ct_barplot_sample()
(for quality control of samples) and
control_Ct_barplot_gene()
(for quality control of genes).
Both functions require specifying quality control criteria to be applied
to Ct values, in order to label each Ct value as reliable or not. These
functions return numbers of reliable and unreliable Ct values in each
sample or each gene, as well as total number of Ct values obtained from
each sample and each gene. These results are presented graphically on
barplots. The obtained results are useful for inspecting the analysed
data in order to identify samples and genes that should be considered to
be removed from the data (based on applied reliability criteria).
Three selection criteria can be set for these functions:
NOTE: This function does not perform data filtering, but only report numbers of Ct values labelled as reliable or not and presents them graphically.
An example of using these functions is provided below:
sample.Ct.control <- control_Ct_barplot_sample(data = data.Ct,
flag.Ct = "Undetermined",
maxCt = 35,
flag = c("Undetermined"),
axis.title.size = 9,
axis.text.size = 7,
plot.title.size = 9,
legend.title.size = 9,
legend.text.size = 9)
gene.Ct.control <- control_Ct_barplot_gene(data = data.Ct,
flag.Ct = "Undetermined",
maxCt = 35,
flag = c("Undetermined"),
axis.title.size = 9,
axis.text.size = 9,
plot.title.size = 9,
legend.title.size = 9,
legend.text.size = 9)
Created plots are displayed on the graphic device, and short information about the returned tables appears. Returned objects are lists that contain two elements: an object with plot and a table with numbers of Ct values labelled as reliable (Yes) and unreliable (No), as well as fraction of unreliable Ct values in each gene. To easily identify samples or genes with high number of unreliable values, tables are sorted to show them at the top. To access returned tables, the second element of returned objects should be called:
head(sample.Ct.control[[2]])
#> # A tibble: 6 × 4
#> Sample Not.reliable Reliable Not.reliable.fraction
#> <fct> <int> <int> <dbl>
#> 1 Control08 9 13 0.409
#> 2 Control16 9 13 0.409
#> 3 Control22 9 13 0.409
#> 4 Control13 8 14 0.364
#> 5 Control05 7 15 0.318
#> 6 Control20 7 15 0.318
head(gene.Ct.control[[2]])
#> # A tibble: 6 × 5
#> Gene Group Not.reliable Reliable Not.reliable.fraction
#> <fct> <fct> <int> <int> <dbl>
#> 1 Gene6 Control 48 0 1
#> 2 Gene2 Disease 35 5 0.875
#> 3 Gene9 Disease 34 6 0.85
#> 4 Gene2 Control 23 1 0.958
#> 5 Gene5 Disease 21 19 0.525
#> 6 Gene9 Control 19 5 0.792
Visual inspection of returned plots and obtained tables gives a clear image of data quality. The results obtained in the examples show that the Disease group contains more samples than the Control group. Some of the samples have more Ct values (more technical replicates) than other samples. Furthermore, in all samples, the majority of Ct values are reliable.
Regarding genes, Gene6 and Gene8 were investigated in duplicates, other genes have single Ct values. Furthermore, Gene6 was analysed only in the Control group and has all values labeled as unreliable; therefore, it is obvious that this gene should be excluded from the analysis. Some other genes also have many unreliable Ct values (e.g. Gene2, Gene9) and maybe should be considered to be removed from the data.
In some situations, a unified fraction of unreliable data need to be
established and used to make decision which samples or genes should be
excluded from the analysis. It can be done using the following code, in
which the second element of object returned by the
control_Ct_barplot_sample()
and
control_Ct_barplot_gene()
functions can be used directly,
and a vector with samples or genes for which the fraction of unreliable
Ct values is higher than a specified threshold is received:
# Finding samples with more than half of the unreliable Ct values.
low.quality.samples <- filter(sample.Ct.control[[2]], Not.reliable.fraction > 0.5)$Sample
low.quality.samples <- as.vector(low.quality.samples)
low.quality.samples
#> character(0)
# Finding genes with more than half of the unreliable Ct values in given group.
low.quality.genes <- filter(gene.Ct.control[[2]], Not.reliable.fraction > 0.5)$Gene
low.quality.genes <- unique(as.vector(low.quality.genes))
low.quality.genes
#> [1] "Gene6" "Gene2" "Gene9" "Gene5" "Gene11"
In the above examples, there is no sample with more than half of the unreliable data. Furthermore, this criterion was met by 5 genes (Gene2, Gene5, Gene6, Gene9, and Gene11); therefore, these genes will be removed from the data in the next step of analysis.
When reliability criteria are finally established for Ct values, and
some samples or genes are decided to be excluded from the analysis after
quality control of the data, the data with raw Ct values can be filtered
using the filter_Ct()
function.
As a filtering criteria, a flag used for undetermined Ct values, a maximum of Ct threshold, and a flag used in Flag column can be applied. Furthermore, vectors with samples, genes, and groups to be removed can also be specified:
# Objects returned from the `low_quality_samples()` and `low_quality_genes()`functions can be used directly:
data.CtF <- filter_Ct(data = data.Ct,
flag.Ct = "Undetermined",
maxCt = 35,
flag = c("Undetermined"),
remove.Gene = low.quality.genes,
remove.Sample = low.quality.samples)
# Check dimensions of data before and after filtering:
dim(data.Ct)
#> [1] 1288 5
dim(data.CtF)
#> [1] 946 5
NOTE: If data contain more than two groups, it is good practice to remove groups that are out of comparison; however, a majority of other functions can deal with more groups unless only two groups are indicated to be given in function’s parameters.
make_Ct_ready()
functionIn the next step, filtered Ct data can be subjected to collapsing of
technical replicates and (optional) data imputation by means within
groups using the make_Ct_ready()
function. The term
‘technical replicates’ means observations with the same group name, gene
name, and sample name. In the scenario when data contain technical
replicates but they should not be collapsed, these technical replicates
must be distinguished by different sample names, e.g. Sample1_1,
Sample1_2, Sample1_3, etc.
The parameter imput.by.mean.within.groups
can be used to
control data imputation. If it is set to TRUE, imputation will be done,
otherwise missing values will be left in the data. For a better view of
the amount of missing values in the data, the information about the
number and percentage of missing values is displayed automatically:
# Without imputation:
data.CtF.ready <- make_Ct_ready(data = data.CtF,
imput.by.mean.within.groups = FALSE)
# A part of the data with missing values:
as.data.frame(data.CtF.ready)[19:25,]
#> Group Sample Gene1 Gene10 Gene12 Gene13 Gene14 Gene15 Gene16 Gene17
#> 19 Control Control26 34.506 26.266 31.305 30.384 28.956 31.174 22.844 28.401
#> 20 Control Control05 NA 28.339 31.200 30.498 28.562 31.068 21.863 26.798
#> 21 Control Control08 NA 30.892 32.381 33.224 31.643 33.485 25.132 30.133
#> 22 Control Control13 NA 28.976 32.976 33.340 29.684 32.600 24.812 28.904
#> 23 Control Control16 NA 30.689 33.656 32.489 30.335 32.432 24.516 29.772
#> 24 Control Control22 NA 26.988 29.846 31.564 30.587 29.176 22.972 27.208
#> 25 Disease Disease1 32.563 26.552 28.179 28.227 27.905 30.048 21.691 26.272
#> Gene18 Gene19 Gene20 Gene3 Gene4 Gene7 Gene8
#> 19 29.012 26.911 33.386 34.256 23.801 31.987 22.5725
#> 20 29.582 28.194 33.680 33.657 24.123 31.019 23.3080
#> 21 31.399 29.908 NA 34.853 26.228 NA 26.4465
#> 22 30.251 28.085 NA 32.600 25.186 34.338 24.9500
#> 23 30.991 28.964 NA 33.432 25.435 NA 25.4190
#> 24 28.941 28.143 NA 34.886 25.056 NA 23.0315
#> 25 28.165 28.263 32.725 31.334 23.415 32.789 21.4550
# With imputation:
data.CtF.ready <- make_Ct_ready(data = data.CtF,
imput.by.mean.within.groups = TRUE)
# Missing values were imputed:
as.data.frame(data.CtF.ready)[19:25,]
#> Group Sample Gene1 Gene10 Gene12 Gene13 Gene14 Gene15 Gene16 Gene17
#> 19 Control Control26 34.50600 26.266 31.305 30.384 28.956 31.174 22.844 28.401
#> 20 Control Control05 33.01921 28.339 31.200 30.498 28.562 31.068 21.863 26.798
#> 21 Control Control08 33.01921 30.892 32.381 33.224 31.643 33.485 25.132 30.133
#> 22 Control Control13 33.01921 28.976 32.976 33.340 29.684 32.600 24.812 28.904
#> 23 Control Control16 33.01921 30.689 33.656 32.489 30.335 32.432 24.516 29.772
#> 24 Control Control22 33.01921 26.988 29.846 31.564 30.587 29.176 22.972 27.208
#> 25 Disease Disease1 32.56300 26.552 28.179 28.227 27.905 30.048 21.691 26.272
#> Gene18 Gene19 Gene20 Gene3 Gene4 Gene7 Gene8
#> 19 29.012 26.911 33.38600 34.256 23.801 31.98700 22.5725
#> 20 29.582 28.194 33.68000 33.657 24.123 31.01900 23.3080
#> 21 31.399 29.908 32.88615 34.853 26.228 31.83495 26.4465
#> 22 30.251 28.085 32.88615 32.600 25.186 34.33800 24.9500
#> 23 30.991 28.964 32.88615 33.432 25.435 31.83495 25.4190
#> 24 28.941 28.143 32.88615 34.886 25.056 31.83495 23.0315
#> 25 28.165 28.263 32.72500 31.334 23.415 32.78900 21.4550
NOTE: The data imputation process can significantly
influence data; therefore, no default value was set to the
imput.by.mean.within.groups
parameter to force the
specification by the user. If there are missing data for a certain gene
in the entire group, they will not be imputed and will remain NA.
In general, a majority of functions in RQdeltaCT
package
can deal with missing data; however, some used methods (e.g. PCA) are
sensitive to missing data (see PCA analysis
section).
NOTE: The make_Ct_ready()
function
should be used even if the collapsing of technical replicates and data
imputation is not required, because this function also prepares the data
structure to fit to further functions.
Ideally, the reference gene should have an identical expression level in all samples, but in many situations it is not possible to achieve this, especially when biological replicates are analysed. Therefore, differences between samples are allowed, but the variance should be as low as possible, and it is also recommended that Ct values would not be very low (below 15) or very high (above 30) Kozera and Rapacz 2013.
The RQdeltaCT
package includes
find_ref_gene()
function that can be used to select the
best reference gene for normalisation. This function calculates
descriptive statistics such as minimum, maximum, standard deviation, and
variance, as well as stability scores calculated using the NormFinder (Article)
and geNorm (Article)
algorithms. Ct values are also presented on a line plot.
NormFinder scores are computed using internal
RQdeltaCT::norm_finder()
function working on the code
adapted from the original NormFinder code. To calculate NormFinder
scores, at least two samples must be present in each group. For the
geNorm score, the geNorm()
function of the
ctrlGene
package is used. The find_ref_gene()
function allows one to choose which of these algorithms should be done
by setting the logical parameters: norm.finder.score
and
genorm.score
.
The created plot is displayed on the graphic device. The returned object is a list that contains two elements: an object with plot and a table with results. In the example below, six genes are tested for suitability to be a reference gene:
library(ctrlGene)
# Remember that the number of colors in col parameter should be equal to the number of tested genes:
ref <- find_ref_gene(data = data.CtF.ready,
groups = c("Disease","Control"),
candidates = c("Gene4", "Gene8","Gene10","Gene16","Gene17", "Gene18"),
col = c("#66c2a5", "#fc8d62","#6A6599", "#D62728", "#1F77B4", "black"),
angle = 60,
axis.text.size = 7,
norm.finder.score = TRUE,
genorm.score = TRUE)
ref[[2]]
#> Gene min max sd var NormFinder_score geNorm_score
#> 1 Gene10 22.043 30.8920 1.894611 3.589551 0.31 1.0511715
#> 2 Gene16 19.801 25.7370 1.359134 1.847245 0.34 NA
#> 3 Gene17 24.436 30.1330 1.370355 1.877874 0.19 NA
#> 4 Gene18 25.323 31.3990 1.329008 1.766262 0.14 0.8127723
#> 5 Gene4 19.295 31.5080 1.998908 3.995635 0.33 1.2335325
#> 6 Gene8 18.314 26.4465 1.724890 2.975244 0.22 0.9001852
#> 7 Gene16-Gene17 NA NA NA NA NA 0.6930721
NA values are presented because geNorm method returns a pair of genes
with the highest stability.
Among tested genes, Gene8, Gene16, Gene17, and Gene18 seem to have the
best characteristics to be a reference gene (they have low variance, low
NormFinder and low geNorm scores).
Data normalization can be performed using delta_Ct()
function that calculates delta Ct (dCt) values by subtracting Ct values
of reference gene (or mean of the Ct values of reference genes, if more
than one reference gene is used) from Ct values of gene of interest
across all samples. If 2-dCt method is used,
transform
should be set to TRUE to tansform dCt values
using 2-dCt formula:
# For 2^-dCt^ method:
data.dCt.exp <- delta_Ct(data = data.CtF.ready,
normalise = TRUE,
ref = "Gene8",
transform = TRUE)
If 2-ddCt method is used, transform
should be
set to FALSE to avoid dCt values transformation that is performed by the
consecutive RQ_ddCt()
function:
# For 2^-ddCt^ method:
data.dCt <- delta_Ct(data = data.CtF.ready,
normalise = TRUE,
ref = "Gene8",
transform = FALSE)
NOTE: In the scenario where unnormalised data should
be analysed, the normalise
parameter should be set to
FALSE.
Before further processing, non-transformed or transformed dCt data should be subjected to quality control using functions and methods described in Quality control and filtering of transformed Ct data section (see below).
Data intended for relative quantification should be subjected to
quality assessment. The main purpose is to identify outlier samples that
could introduce bias into the results. The RQdeltaCT
package offers several functions which facilitate finding outlier
samples in data by implementing distribution analysis
(control_boxplot_sample()
function), hierarchical
clustering (control_cluster_sample()
function) and
principal component analysis PCA (control_pca_sample()
function). However, corresponding functions for genes are also provided
to evaluate similarities and differences between expression of analysed
genes (control_boxplot_gene()
,
control_cluster_gene()
and control_pca_gene()
functions).
The abovementioned data quality control functions are
designed to be directly applied to data objects returned from the
make_Ct_ready()
and delta_Ct()
functions (see
The summary of standard
workflow).
One of the ways to gain insight into data distribution is drawing
boxplot that show several statistics, such as median (labelled inside
box), first and third quartiles (ranged by box), extreme point in
interquartile range (ranged by whiskers), and more distant data as
separated points. In the RQdeltaCT
package, the
control_boxplot_sample()
and
control_boxplot_gene()
functions are included to draw
boxplots that present the distribution of samples and genes,
respectively.
These functions return objects with a plot. The created plots are also displayed on the graphic device.
NOTE: If missing values are present in the data, they will be automatically removed with warning.
Hierarchical clustering is a convenient method to investigate
similarities between variables. Hierarchical clustering of samples and
genes can be done using the control_cluster_sample()
and
control_cluster_gene()
functions, respectively. These
functions allow drawing dendrograms using various methods of distance
calculation (e.g. euclidean, canberra) and agglomeration (e.g. complete,
average, single). For more details, refer to the functions
documentation.
control_cluster_sample(data = data.dCt,
method.dist = "euclidean",
method.clust = "average",
label.size = 0.6)
control_cluster_gene(data = data.dCt,
method.dist = "euclidean",
method.clust = "average",
label.size = 0.8)
The created plots are displayed on the graphic device.
NOTE: Minimum three samples or genes in data is required for clustering analysis.
Principal component analysis (PCA) is a data exploratory method, that
is commonly used to investigate similarities between variables based on
the first principal components that contain the most information about
variance. In the RQdeltaCT
package, the
control_pca_sample()
and control_pca_gene()
functions are developed to perform PCA analysis for samples and genes,
respectively. These functions return objects with a plot. Created plots
are also displayed on the graphic device.
control.pca.sample <- control_pca_sample(data = data.dCt,
point.size = 3,
label.size = 2.5,
legend.position = "top")
NOTE: PCA algorithm can not deal with missing data (NAs); therefore, variables with NA values are removed before analysis. If at least one NA value occurs in all variables in at least one of the compared group, the analysis can not be done. Imputation of missing data will avoid this issue. Also, a minimum of three samples or genes in the data are required for analysis.
If any sample or gene was decided to be removed from the data, the
filter_transformed_data()
function can be used for
filtering. Similarly to quality control functions, this function can be
directly applied to data objects returned from the
make_Ct_ready()
and delta_Ct()
functions (see
The summary of standard
workflow).
This method is used in studies where samples should be analysed as individual data points, e.g. in analysis of biological replicates. In this method, Ct values are normalised by the endogenous control gene by subtracting the Ct value of the endogenous control from the Ct value of the gene of interest, in the same sample. Obtained delta Ct (dCt) values are subsequently transformed using the 2-dCt formula, summarised by means in the compared study groups, and a ratio of means (fold change) is calculated for the study group (see [Introduction] section).
The whole process can be done using RQ_dCt()
function,
which performs:
+ calculation of means (returned in columns with the “_mean” pattern)
and standard deviations (returned in columns with the “_sd” pattern) of
transformed dCt values of genes analysed in the compared groups. +
normality testing (Shapiro_Wilk test) of transformed dCt values of
analysed genes in compared groups and returned p values are stored in
columns with the “_norm_p” pattern. + calculation of fold change values
for each gene by dividing the mean of transformed dCt values in the
study group by the mean of transformed dCt values in the reference
group. Fold change values are returned in the “FCh” column. +
statistical testing of differences in transformed Ct values between the
study group and the reference group. Student’s t test and Mann-Whitney U
test are implemented and the resulting statistics (in column with the
“_test_stat” pattern), p values (in column with the “_test_p” pattern),
and adjusted p values (in column with the “_test_p_adj” pattern) are
returned. The Benjamini-Hochberg method for adjustment of p values is
used in default. If compared groups contain less than three samples,
normality and statistical tests are not possible to perform (the
do.test
parameter should be set to FALSE to avoid
error).
NOTE: For this method, delta Ct (dCt) values
transformed by the 2-dCt formula using
delta_Ct()
function (called with transform
=
TRUE) should be used.
data.dCt.exp <- delta_Ct(data = data.CtF.ready,
ref = "Gene8",
transform = TRUE)
library(coin)
results.dCt <- RQ_dCt(data = data.dCt.exp,
do.tests = TRUE,
group.study = "Disease",
group.ref = "Control")
# Obtained table can be sorted by, e.g. p values from the Mann-Whitney U test:
head(as.data.frame(arrange(results.dCt, MW_test_p)))
#> Gene Control_mean Disease_mean Control_sd Disease_sd Control_norm_p
#> 1 Gene1 0.001779634 0.007572040 0.002442710 0.007962192 3.435812e-06
#> 2 Gene19 0.058796934 0.024802219 0.060456391 0.027097216 3.400562e-07
#> 3 Gene13 0.013651779 0.013343277 0.028406905 0.008790133 5.964156e-09
#> 4 Gene16 1.509777018 0.967490766 1.244618540 0.609112278 1.086815e-06
#> 5 Gene12 0.016030664 0.340252219 0.018392782 1.013461602 1.391742e-05
#> 6 Gene7 0.003424749 0.001911581 0.004925526 0.001866461 1.322123e-07
#> Disease_norm_p FCh log10FCh t_test_p t_test_stat MW_test_p
#> 1 1.175080e-05 4.2548290 0.628882107 8.483229e-05 -4.27774491 5.136855e-05
#> 2 4.020200e-08 0.4218284 -0.374864146 1.450713e-02 2.60232741 5.449915e-05
#> 3 2.002333e-03 0.9774021 -0.009926733 9.591380e-01 0.05173793 1.009811e-02
#> 4 4.020298e-03 0.6408170 -0.193265981 5.517804e-02 1.99590948 1.255492e-02
#> 5 6.309057e-12 21.2250864 1.326849466 4.998438e-02 -2.02276492 1.410617e-02
#> 6 3.737597e-07 0.5581668 -0.253236035 1.602119e-01 1.44408971 6.717362e-02
#> MW_test_stat t_test_p_adj MW_test_p_adj
#> 1 -4.049311 0.001187652 0.0003814941
#> 2 4.035444 0.101549928 0.0003814941
#> 3 -2.572452 0.974195838 0.0394972722
#> 4 2.496151 0.193123123 0.0394972722
#> 5 -2.454548 0.193123123 0.0394972722
#> 6 1.830511 0.448593384 0.1567384376
Similarly to the 2-dCt method, in the 2-ddCt method Ct values are normalised by the endogenous control gene, obtaining dCt values. Subsequently, dCt values are summarised by means in the compared groups, and for each gene, the obtained mean in the control group is subtracted from the mean in the study group, giving the delta delta Ct (ddCt) value. Finally, the ddCt values are transformed using the 2-ddCt formula to obtain the fold change values. This method is recommended for analysis of technical replicates (see the [Introduction] section).
The whole process can be done using RQ_ddCt()
function,
which performs:
+ calculation of means (returned in columns with the “_mean” pattern)
and standard deviations (returned in columns with the “_sd” pattern) of
delta Ct values of the analyzed genes in the compared groups. +
normality testing (Shapiro_Wilk test) of delta Ct values of the analyzed
genes in the compared groups and returned p values are stored in columns
with the “_norm_p” pattern. + calculation of differences in the mean
delta Ct values of genes between compared groups, returned in “ddCt”
column, + calculation of fold change values (returned in “FCh” column)
for each gene by transforming the ddCt values using the
2-ddCt formula. + statistical testing of differences between
the compared groups. Student’s t test and Mann-Whitney U test are
implemented and the resulted statistics (in column with the “_test_stat”
pattern), p values (in column with the “_test_p” pattern) and adjusted p
values (in column with the “_test_p_adj” pattern) are returned. The
Benjamini-Hochberg method for adjustment of p values is used in default.
If compared groups contain less than three samples, normality and
statistical tests are not possible to perform and do.test
parameter should be set to FALSE to avoid error.
NOTE: For this method, not transformed delta Ct
(dCt) values (obtained from delta_Ct()
function with
transform
= FALSE) should be used.
data.dCt <- delta_Ct(data = data.CtF.ready,
ref = "Gene8",
transform = FALSE)
library(coin)
results.ddCt <- RQ_ddCt(data = data.dCt,
group.study = "Disease",
group.ref = "Control",
do.tests = TRUE)
# Obtained table can be sorted by, e.g. p values from the Mann-Whitney U test:
head(as.data.frame(arrange(results.ddCt, MW_test_p)))
#> Gene Control_mean Disease_mean Control_sd Disease_sd Control_norm_p
#> 1 Gene1 10.117002 7.959426 1.6881153 1.8871311 0.426642224
#> 2 Gene19 4.522833 5.938756 1.1662848 1.3467583 0.007240964
#> 3 Gene13 7.216583 6.565200 1.4577193 1.0627016 0.038235161
#> 4 Gene16 -0.329125 0.338025 0.8145069 0.9570775 0.213259767
#> 5 Gene12 6.675667 5.058075 1.4070518 2.9513180 0.172997230
#> 6 Gene7 8.932742 9.536064 1.4167152 1.2052427 0.614472418
#> Disease_norm_p ddCt FCh log10FCh t_test_p t_test_stat
#> 1 0.17446884 -2.1575763 4.4616467 0.6494952 1.690067e-05 4.733411
#> 2 0.56682258 1.4159231 0.3747699 -0.4262353 4.581993e-05 -4.432997
#> 3 0.17567779 -0.6513833 1.5706735 0.1960859 6.426170e-02 1.906189
#> 4 0.57399623 0.6671500 0.6297495 -0.2008322 4.445141e-03 -2.967530
#> 5 0.04414803 -1.6175917 3.0686235 0.4869436 4.508310e-03 2.952083
#> 6 0.61294050 0.6033224 0.6582363 -0.1816182 8.872007e-02 -1.742051
#> MW_test_p MW_test_stat t_test_p_adj MW_test_p_adj
#> 1 5.136855e-05 4.049311 0.0002366094 0.0003814941
#> 2 5.449915e-05 -4.035444 0.0003207395 0.0003814941
#> 3 1.009811e-02 2.572452 0.1799327522 0.0394972722
#> 4 1.255492e-02 -2.496151 0.0157790854 0.0394972722
#> 5 1.410617e-02 2.454548 0.0157790854 0.0394972722
#> 6 6.717362e-02 -1.830511 0.2070134948 0.1567384376
For visualisation of final results, these five functions can be used:
FCh_plot()
that allows to illustrate fold change values
of genes,results_volcano()
that allows to create volcano plot
presenting the arrangement of fold change values and p values,results_barplot()
that show mean and standard deviation
values of genes across the compared groups,results_boxplot()
that illustrate the data distribution
of genes across the compared groups,results_heatmap()
that allows to create heatmap with
hierarchical clustering,All these functions can be run on all data or on finally selected
genes (see sel.Gene
parameter). These functions have a
large number of parameters, and the user should familiarise with all of
them to properly adjust created plots to the user needs.
NOTE: The functions FCh_plot()
,
results_barplot()
, and results_boxplot()
also
allow to add customised statistical significance labels to plots using
ggsignif
package. If statistical significance labels should
be added to the plot, a vector with labels (e.g., “ns”, “*“,”p = 0.03”)
should be provided in the signif.labels
parameter. There
are two important points that must be taking into account when preparing
this vector:
ggsignif
package, all values must be different (the same values are not allowed).
Thus, if the same labels are needed, they should be distinguishable by
adding symmetrically a different number of white spaces, e.g., “ns.”, ”
ns. “,” ns. “,” ns. “, etc. (see the examples below).FCh_plot()
functionThis function creates a barplot that illustrates fold change values
obtained from the analysis, together with an indication of statistical
significance. Data returned from the RQ_dCt()
and
RQ_ddCt()
functions can be directly applied to this
function (see The summary of
standard workflow).
On the barplot, bars of significant genes are distinguished by colors
and/or significance labels. The significance of genes can be established
by two criteria: p values and (optionally) fold change values.
Thresholds for both criteria can be specified. The
FCh_plot()
function offers various options of which p
values are used on the plot:
mode
=
“t”).mode
=
“mw”).mode
=
“depends”). If the data in both compared groups were considered derived
from normal distribution (p value of Shapiro_Wilk test > 0.05) - p
values of Student’s t test will be used, otherwise p values of
Mann-Whitney U test will be used.mode
=
“user”). If the user intends to use the p values obtained from the other
statistical test, the mode
parameter should be set to
“user”. In this scenario, before running the FCh_plot()
function, the user should prepare a data.frame object named “user”
containing two columns: the first column with gene names and the second
column with p values (see example below).The created plot is displayed on the graphic device. The returned object is a lists that contain two elements: an object with plot and a table with results.
# Variant with p values depending on the normality of the data:
library(ggsignif)
# Specifying vector with significance labels:
signif.labels <- c("****","**","ns."," ns. "," ns. "," ns. "," ns. "," ns. "," ns. "," ns. "," ns. "," ns. "," ns. ","***")
# Genes with p < 0.05 and 2-fold changed expression between compared groups are considered significant:
FCh.plot <- FCh_plot(data = results.ddCt,
use.p = TRUE,
mode = "depends",
p.threshold = 0.05,
use.FCh = TRUE,
FCh.threshold = 2,
signif.show = TRUE,
signif.labels = signif.labels,
angle = 20)
# Access the table with results:
head(as.data.frame(FCh.plot[[2]]))
#> Gene Control_mean Disease_mean Control_sd Disease_sd Control_norm_p
#> 1 Gene1 10.117002 7.959426 1.688115 1.887131 0.42664222
#> 2 Gene12 6.675667 5.058075 1.407052 2.951318 0.17299723
#> 3 Gene13 7.216583 6.565200 1.457719 1.062702 0.03823516
#> 4 Gene20 9.983942 9.481083 1.571200 1.143782 0.68169170
#> 5 Gene4 0.584125 0.360150 1.358816 1.670593 0.03149060
#> 6 Gene10 4.101125 3.922625 1.336964 1.381818 0.47231242
#> Disease_norm_p ddCt FCh log10FCh t_test_p t_test_stat
#> 1 1.744688e-01 -2.1575763 4.461647 0.64949517 1.690067e-05 4.7334112
#> 2 4.414803e-02 -1.6175917 3.068624 0.48694361 4.508310e-03 2.9520830
#> 3 1.756778e-01 -0.6513833 1.570674 0.19608592 6.426170e-02 1.9061886
#> 4 2.041532e-02 -0.5028583 1.417018 0.15137544 1.801127e-01 1.3657411
#> 5 2.060155e-07 -0.2239750 1.167947 0.06742319 5.610445e-01 0.5847601
#> 6 8.555529e-02 -0.1785000 1.131707 0.05373385 6.118866e-01 0.5105974
#> MW_test_p MW_test_stat t_test_p_adj MW_test_p_adj test.for.comparison
#> 1 5.136855e-05 4.0493114 0.0002366094 0.0003814941 t.student's.test
#> 2 1.410617e-02 2.4545484 0.0157790854 0.0394972722 Mann-Whitney.test
#> 3 1.009811e-02 2.5724516 0.1799327522 0.0394972722 Mann-Whitney.test
#> 4 1.271530e-01 1.5254255 0.3527853502 0.2225177371 Mann-Whitney.test
#> 5 8.551052e-02 1.7195706 0.7138676822 0.1710210453 Mann-Whitney.test
#> 6 6.572160e-01 0.4437602 0.7138676822 0.7902902022 t.student's.test
#> p.used Selected as significant?
#> 1 1.690067e-05 Yes
#> 2 1.410617e-02 Yes
#> 3 1.009811e-02 No
#> 4 1.271530e-01 No
#> 5 8.551052e-02 No
#> 6 6.118866e-01 No
A variant with user p values - the used p values are calculated using the stats::wilcox.test() function:
user <- data.dCt %>%
pivot_longer(cols = -c(Group, Sample),
names_to = "Gene",
values_to = "dCt") %>%
group_by(Gene) %>%
summarise(MW_test_p = wilcox.test(dCt ~ Group)$p.value,
.groups = "keep")
# The stats::wilcox.test() functions is limited to cases without ties; therefore, a warning "cannot compute exact p-value with ties" will appear when ties occur.
FCh.plot <- FCh_plot(data = results.ddCt,
use.p = TRUE,
mode = "user",
p.threshold = 0.05,
use.FCh = TRUE,
FCh.threshold = 2,
signif.show = TRUE,
signif.labels = signif.labels,
angle = 30)
# Access the table with results:
head(as.data.frame(FCh.plot[[2]]))
#> Gene Control_mean Disease_mean Control_sd Disease_sd Control_norm_p
#> 1 Gene1 10.117002 7.959426 1.688115 1.887131 0.42664222
#> 2 Gene12 6.675667 5.058075 1.407052 2.951318 0.17299723
#> 3 Gene13 7.216583 6.565200 1.457719 1.062702 0.03823516
#> 4 Gene20 9.983942 9.481083 1.571200 1.143782 0.68169170
#> 5 Gene4 0.584125 0.360150 1.358816 1.670593 0.03149060
#> 6 Gene10 4.101125 3.922625 1.336964 1.381818 0.47231242
#> Disease_norm_p ddCt FCh log10FCh t_test_p t_test_stat
#> 1 1.744688e-01 -2.1575763 4.461647 0.64949517 1.690067e-05 4.7334112
#> 2 4.414803e-02 -1.6175917 3.068624 0.48694361 4.508310e-03 2.9520830
#> 3 1.756778e-01 -0.6513833 1.570674 0.19608592 6.426170e-02 1.9061886
#> 4 2.041532e-02 -0.5028583 1.417018 0.15137544 1.801127e-01 1.3657411
#> 5 2.060155e-07 -0.2239750 1.167947 0.06742319 5.610445e-01 0.5847601
#> 6 8.555529e-02 -0.1785000 1.131707 0.05373385 6.118866e-01 0.5105974
#> MW_test_p MW_test_stat t_test_p_adj MW_test_p_adj p.used
#> 1 5.136855e-05 4.0493114 0.0002366094 0.0003814941 2.570256e-05
#> 2 1.410617e-02 2.4545484 0.0157790854 0.0394972722 1.360267e-02
#> 3 1.009811e-02 2.5724516 0.1799327522 0.0394972722 1.030219e-02
#> 4 1.271530e-01 1.5254255 0.3527853502 0.2225177371 1.295905e-01
#> 5 8.551052e-02 1.7195706 0.7138676822 0.1710210453 8.683564e-02
#> 6 6.572160e-01 0.4437602 0.7138676822 0.7902902022 6.645315e-01
#> Selected as significant?
#> 1 Yes
#> 2 Yes
#> 3 No
#> 4 No
#> 5 No
#> 6 No
Three genes (Gene1, Gene12, and Gene19) are shown to meet the used significance criteria.
NOTE: If p values were not calculated due to the low
number of samples, or they are not intended to be used to create a plot,
the use.p
parameter should be set to FALSE.
results_volcano()
functionThis function creates a volcano plot that illustrates the arrangement
of genes based on fold change values and p values. Significant genes can
be pointed out using specified p value and fold change thresholds, and
highlighted on the plot by color and (optionally) isolated by thresholds
lines. Similarly to FCh_plot()
function, data returned from
the RQ_dCt()
and RQ_ddCt()
functions can be
directly applied (see The
summary of standard workflow) and various sources of used p values
are available (from the Student’s t test, the Mann-Whitney U test,
depended on the normality of data, or provided by the user).
The created plot is displayed on the graphic device. The returned object is a lists that contain two elements: an object with plot and a table with results.
# Genes with p < 0.05 and 2-fold changed expression between compared groups are considered significant:
volcano <- results_volcano(data = results.ddCt,
mode = "depends",
p.threshold = 0.05,
FCh.threshold = 2)
# Access the table with results:
head(as.data.frame(volcano[[2]]))
#> Gene Control_mean Disease_mean Control_sd Disease_sd Control_norm_p
#> 1 Gene1 10.117002 7.959426 1.6881153 1.887131 0.426642224
#> 2 Gene10 4.101125 3.922625 1.3369635 1.381818 0.472312423
#> 3 Gene12 6.675667 5.058075 1.4070518 2.951318 0.172997230
#> 4 Gene13 7.216583 6.565200 1.4577193 1.062702 0.038235161
#> 5 Gene14 6.092583 6.530625 1.3929523 1.364024 0.315029911
#> 6 Gene15 7.448292 7.563225 0.9437689 1.234245 0.005740816
#> Disease_norm_p ddCt FCh log10FCh t_test_p t_test_stat
#> 1 0.17446884 -2.1575763 4.4616467 0.64949517 1.690067e-05 4.7334112
#> 2 0.08555529 -0.1785000 1.1317066 0.05373385 6.118866e-01 0.5105974
#> 3 0.04414803 -1.6175917 3.0686235 0.48694361 4.508310e-03 2.9520830
#> 4 0.17567779 -0.6513833 1.5706735 0.19608592 6.426170e-02 1.9061886
#> 5 0.91864416 0.4380417 0.7381359 -0.13186368 2.256759e-01 -1.2274336
#> 6 0.22923507 0.1149333 0.9234250 -0.03459838 6.766639e-01 -0.4191284
#> MW_test_p MW_test_stat t_test_p_adj MW_test_p_adj test.for.comparison
#> 1 5.136855e-05 4.0493114 0.0002366094 0.0003814941 t.student's.test
#> 2 6.572160e-01 0.4437602 0.7138676822 0.7902902022 t.student's.test
#> 3 1.410617e-02 2.4545484 0.0157790854 0.0394972722 Mann-Whitney.test
#> 4 1.009811e-02 2.5724516 0.1799327522 0.0394972722 Mann-Whitney.test
#> 5 2.330239e-01 -1.1926054 0.3527853502 0.3262335180 t.student's.test
#> 6 8.190116e-01 -0.2288165 0.7287149620 0.8820124714 Mann-Whitney.test
#> p.used Selected as significant?
#> 1 1.690067e-05 Yes
#> 2 6.118866e-01 No
#> 3 1.410617e-02 Yes
#> 4 1.009811e-02 No
#> 5 2.256759e-01 No
#> 6 8.190116e-01 No
results_boxplot()
functionThis function creates a boxplot that illustrates the distribution of
the data for the genes. It is similar to
control_boxplot_gene()
function; however, some new options
are added, including gene selection, faceting, addition of mean points
to boxes, and statistical significance labels.
Data objects returned from the make_Ct_ready()
and
delta_Ct() functions
can be directly applied to this
function (see The summary of
standard workflow).
final_boxplot <- results_boxplot(data = data.dCtF,
sel.Gene = c("Gene1","Gene12", "Gene16","Gene19"),
by.group = TRUE,
signif.show = TRUE,
signif.labels = c("****","*","***"," * "),
signif.dist = 1.05,
faceting = TRUE,
facet.row = 1,
facet.col = 4,
y.exp.up = 0.1,
angle = 20,
y.axis.title = "dCt")
NOTE: If missing values are present in the data, they will be automatically removed, and a warning will appear.
results_barplot()
functionThis function creates a barplot that illustrates mean and standard deviation values of the data for all or selected genes.
Data objects returned from the make_Ct_ready()
and
delta_Ct() functions
can be directly applied to this
function (see The summary of
standard workflow).
final_barplot <- results_barplot(data = data.dCtF,
sel.Gene = c("Gene1","Gene12","Gene19","Gene20"),
signif.show = TRUE,
signif.labels = c("****","*","***"," * "),
angle = 30,
signif.dist = 1.05,
faceting = TRUE,
facet.row = 1,
facet.col = 4,
y.exp.up = 0.1,
y.axis.title = "dCt")
NOTE: At least two samples in each group are required to calculate the standard deviation and properly generate the plot.
results_heatmap()
functionThis function allows to draw heatmap with hierarchical clustering. Various methods of distance calculation (e.g. euclidean, canberra) and agglomeration (e.g. complete, average, single) can be used.
Data objects returned from the make_Ct_ready()
and
delta_Ct() functions
can be directly applied to this
function (see The summary of
standard workflow).
NOTE: Remember to create named list with colors for
groups annotation and pass it in col.groups
parameter.
# Create named list with colors for groups annotation:
colors.for.groups = list("Group" = c("Disease"="firebrick1","Control"="green3"))
# Vector of colors for heatmap can be also specified to fit the user needings:
colors <- c("navy","navy","#313695","#4575B4","#74ADD1","#ABD9E9",
"#E0F3F8","#FFFFBF","#FEE090","#FDAE61","#F46D43",
"#D73027","#C32B23","#A50026","#8B0000",
"#7E0202","#000000")
results_heatmap(data.dCt,
sel.Gene = "all",
col.groups = colors.for.groups,
colors = colors,
show.colnames = FALSE,
show.rownames = TRUE,
fontsize = 11,
fontsize.row = 11,
cellwidth = 4)
The created plot is displayed on the graphic device(if save.to.tiff = FALSE) or saved to .tiff file (if save.to.tiff = TRUE).
Gene expression levels and differences between groups can be further
analysed using the following methods and corresponding functions of the
RQdeltaCT
package:
pca_kmeans()
function)corr_sample()
function) and
genes (corr_gene()
function).single_pair_sample()
function) and genes (single_pair_gene()
function).ROCh()
function).log_reg()
function).Data objects returned from the make_Ct_ready()
and
delta_Ct() functions
can be directly applied to all of
these functions (see The
summary of standard workflow). Moreover, all functions can be run on
the entire data or only on selected samples/genes.
This function allows to simultaneously perform principal component
analysis (PCA) and (if do.k.means = TRUE) samples classification using k
means method. Number of clusters can be set using k.clust
parameter. Results obtained from both methods can be presented on one
plot. Obtained clusters are distinguishable by point shapes. Confusion
matrix of sample classification is returned as the second element of the
returned object and can be used for calculation further parameters of
classification performance like precision, accuracy, recall, and
others.
pca.kmeans <- pca_kmeans(data.dCt,
sel.Gene = c("Gene1","Gene16","Gene19","Gene20"),
legend.position = "top")
# Access to the confusion matrix:
pca.kmeans[[2]]
#>
#> Cluster1 Cluster2
#> Control 14 10
#> Disease 19 21
If the legend is to wide and is cropped, setting a vertical position of legend should solve this problem:
Correlation analysis is a very useful method to explore linear
relationships between variables. The RQdeltaCT
package
offers corr_sample()
and corr_gene()
functions
to generate and plot correlation matrices of samples and genes,
respectively. The correlation coefficients can be calculated using
either the Pearson or Spearman algorithm. To facilitate plot
interpretation, these functions also have possibilities to order samples
or genes according to several methods, e.g. hierarchical clustering or
PCA first component (see order
parameter).
library(Hmisc)
library(corrplot)
# To make the plot more readable, only part of the data was used:
corr.samples <- corr_sample(data = data.dCt[15:30, ],
method = "pearson",
order = "hclust",
size = 0.7,
p.adjust.method = "BH")
library(Hmisc)
library(corrplot)
corr.genes <- corr_gene(data = data.dCt,
method = "spearman",
order = "FPC",
size = 0.7,
p.adjust.method = "BH")
The created plots are displayed on the graphic device. The returned objects are tables with computed correlation coefficients, p values, and p values adjusted by Benjamini-Hochberg correction (by default). Tables are sorted by the absolute values of correlation coefficients in descending order.
NOTE: A minimum of 5 samples/target are required for correlation analysis.
Linear relationships between pairs of samples or genes can be further
analysed using simple linear regression models using the
single_pair_sample()
function (for analysis of samples) and
the single_pair_gene()
function for analysis of genes.
These functions draw a scatter plot with a simple linear regression
line. Regression results such as regression equation, coefficient of
determination, F value, or p value can be optionally added to the
plot.
The Receiver Operating Characteristic (ROC) analysis is useful for
assessing the performance of sample classification to the particular
group, based on gene expression data. In this analysis, ROC curves
together with parameters such as the area under curve (AUC),
specificity, sensitivity, accuracy, positive and negative predictive
value are received. The ROCh()
function was designed to
perform all of these tasks. This function returns a table with
calculated parameters and a plot with multiple panels, each with ROC
curve for one gene.
NOTE: The created plot is not displayed on the
graphic device, but should be saved as .tiff image
(save.to.txt
= TRUE) and can be opened directly from the
file in the working directory.
library(pROC)
# Remember to specify the numbers of rows (panels.row parameter) and columns (panels.col parameter) to be sufficient to arrange panels:
roc_parameters <- ROCh(data = data.dCt,
sel.Gene = c("Gene1","Gene12","Gene16","Gene19"),
groups = c("Disease","Control"),
panels.row = 2,
panels.col = 2)
roc_parameters
#> Gene Threshold Specificity Sensitivity Accuracy ppv npv
#> 1 Gene1 9.40925 0.775 0.7500000 0.765625 0.6666667 0.8378378
#> 2 Gene12 6.40600 0.700 0.6666667 0.687500 0.5714286 0.7777778
#> 3 Gene16 0.48200 0.475 0.9166667 0.640625 0.5116279 0.9047619
#> 4 Gene19 5.11675 0.725 0.9166667 0.796875 0.6666667 0.9354839
#> youden AUC
#> 1 1.525000 0.8041667
#> 2 1.366667 0.6843750
#> 3 1.391667 0.6875000
#> 4 1.641667 0.8031250
Logistic regression is a useful method to investigate the impact of
the analysed variable on the odds of the occurrence of the studied
experimental condition. In the RQdeltaCT
package,
log_reg()
function allows to calculate for each gene a
chances (odds ratio, OR) of being included in the study group when gene
expression level increases by one unit (suitable for non-transformed
data) or by mean of expression levels (more suitable for transformed
data). This function returns a plot and table with the calculated
parameters (OR, confidence interval, intercept, coefficient, and p
values).
library(oddsratio)
# Remember to set the increment parameter.
log.reg.results <- log_reg(data = data.dCt,
increment = 1,
sel.Gene = c("Gene1","Gene12","Gene16","Gene19"),
group.study = "Disease",
group.ref = "Control")
log.reg.results[[2]]
#> Gene oddsratio CI_low CI_high Increment Intercept coeficient p_intercept
#> 1 Gene1 0.540 0.374 0.734 1 6.0820841 -0.6159817 0.0001466894
#> 2 Gene12 0.727 0.534 0.930 1 2.4160726 -0.3189009 0.0085244809
#> 3 Gene16 2.364 1.281 4.908 1 0.5101473 0.8603227 0.0645722078
#> 4 Gene19 2.603 1.572 5.031 1 -4.4442069 0.9567034 0.0027620282
#> p_coef
#> 1 0.0002879457
#> 2 0.0229318870
#> 3 0.0109936805
#> 4 0.0010298471
A pairwise analysis regards situations when samples are grouped in
pairs. In general, a pairwise analysis is quite similar to the workflow
for comparison of independent groups; however, some crucial points are
different. Therefore, for the users convenience, a separate part of this
vignette was prepared to demonstrate in detail a pairwise variant of
analysis using RQdeltaCT
workflow.
For a demonstration purposes, the data object named
data.Ct.pairwise
from RQdeltaCT
package is
used:
data(data.Ct.pairwise)
str(data.Ct.pairwise)
#> tibble [756 × 4] (S3: tbl_df/tbl/data.frame)
#> $ Sample: chr [1:756] "Sample01" "Sample01" "Sample01" "Sample01" ...
#> $ Group : chr [1:756] "After" "After" "After" "After" ...
#> $ Gene : chr [1:756] "Gene1" "Gene2" "Gene3" "Gene4" ...
#> $ Ct : chr [1:756] "32.563" "36.554" "31.334" "23.415" ...
This dataset contains 21 paired samples analyzed before (“Before” group) and after (“After” group) after the effect of the experimental factor. Total 18 genes were analyzed in these samples.
The crucial step of each data analysis is an assessment of the
quality and usefulness of the data used to investigate the studied
problem. The RQdeltaCT
package offers two functions for
quality control of raw Ct data: control_Ct_barplot_sample()
(for quality control of samples) and
control_Ct_barplot_gene()
(for quality control of genes).
Both functions require specifying quality control criteria to be applied
to Ct values, in order to label each Ct value as reliable or not. These
functions return numbers of reliable and unreliable Ct values in each
sample or each gene, as well as total number of Ct values obtained from
each sample and each gene. These results are presented graphically on
barplots. The obtained results are useful for inspecting the analysed
data in order to identify samples and genes that should be considered to
be removed from the data (based on applied reliability criteria).
Three selection criteria can be set for these functions:
NOTE: This function does not perform data filtering, but only report numbers of Ct values labelled as reliable or not and presents them graphically.
An example of using these functions is provided below:
library(tidyverse)
sample.Ct.control.pairwise <- control_Ct_barplot_sample(data = data.Ct.pairwise,
flag.Ct = "Undetermined",
maxCt = 35,
flag = c("Undetermined"),
axis.title.size = 9,
axis.text.size = 9,
plot.title.size = 9,
legend.title.size = 9,
legend.text.size = 9)
gene.Ct.control.pairwise <- control_Ct_barplot_gene(data = data.Ct.pairwise,
flag.Ct = "Undetermined",
maxCt = 35,
flag = c("Undetermined"),
axis.title.size = 9,
axis.text.size = 9,
plot.title.size = 9,
legend.title.size = 9,
legend.text.size = 9)
Created plots are displayed on the graphic device, and short information about the returned tables appears. Returned objects are lists that contain two elements: an object with plot and a table with numbers of Ct values labelled as reliable (in column “Reliable”) and unreliable (in column “Not.reliable”), as well as fraction of unreliable Ct values in each gene. To easily identify samples or genes with high number of unreliable values, tables are sorted to show them at the top. To access returned tables, the second element of returned objects should be called:
head(sample.Ct.control.pairwise[[2]])
#> # A tibble: 6 × 4
#> Sample Not.reliable Reliable Not.reliable.fraction
#> <fct> <int> <int> <dbl>
#> 1 Sample13 13 23 0.361
#> 2 Sample16 13 23 0.361
#> 3 Sample08 11 25 0.306
#> 4 Sample05 10 26 0.278
#> 5 Sample22 10 26 0.278
#> 6 Sample24 9 27 0.25
head(gene.Ct.control.pairwise[[2]], 10)
#> # A tibble: 10 × 5
#> Gene Group Not.reliable Reliable Not.reliable.fraction
#> <fct> <fct> <int> <int> <dbl>
#> 1 Gene9 After 21 0 1
#> 2 Gene2 After 20 1 0.952
#> 3 Gene2 Before 20 1 0.952
#> 4 Gene5 After 16 5 0.762
#> 5 Gene9 Before 16 5 0.762
#> 6 Gene11 After 15 6 0.714
#> 7 Gene1 After 11 10 0.524
#> 8 Gene11 Before 11 10 0.524
#> 9 Gene5 Before 11 10 0.524
#> 10 Gene1 Before 5 16 0.238
Visual inspection of returned plots and obtained tables gives a clear, preliminary image of data quality. The results obtained in the examples show that the Before group contains the same number of samples as the After group. In all samples, the majority of Ct values are reliable.
Regarding genes, Gene9 has all unreliable Ct values in the After group. Other genes that can be considered to remove from the data are Gene2, Gene5, Gene11, and Gene1.
In some situations, a unified fraction of unreliable data need to be
established and used to make decision which samples or genes should be
excluded from the analysis. It can be done using the following code, in
which the table returned by the control_Ct_barplot_sample()
and control_Ct_barplot_gene()
functions can be used to
identify samples or genes for which the fraction of unreliable Ct values
is higher than a specified threshold:
# Finding samples with more than half of the unreliable Ct values.
low.quality.samples.pairwise <- filter(sample.Ct.control.pairwise[[2]],
Not.reliable.fraction > 0.5)$Sample
low.quality.samples.pairwise <- as.vector(low.quality.samples.pairwise)
low.quality.samples.pairwise
#> character(0)
In the above example, there is no sample with more than half of the unreliable data.
# Finding genes with more than half of the unreliable Ct values in at least one group.
low.quality.genes.pairwise <- filter(gene.Ct.control.pairwise[[2]],
Not.reliable.fraction > 0.5)$Gene
low.quality.genes.pairwise <- unique(as.vector(low.quality.genes.pairwise))
low.quality.genes.pairwise
#> [1] "Gene9" "Gene2" "Gene5" "Gene11" "Gene1"
Five genes (Gene1, Gene2, Gene5, Gene9, and Gene11) have more than half of the unreliable Ct values in at least one group. In this example, these genes will be removed from the data in the next step of analysis.
When reliability criteria are finally established for Ct values, and
some samples or genes are decided to be excluded from the analysis after
quality control of the data, the data with raw Ct values can be filtered
using the filter_Ct()
function.
As a filtering criteria, a flag used for undetermined Ct values, a maximum of Ct threshold, and a flag used in Flag column can be applied. Furthermore, vectors with samples, genes, and groups to be removed can also be specified:
# Objects returned from the `low_quality_samples()` and
# `low_quality_genes()`functions can be used directly:
data.Ct.pairwiseF <- filter_Ct(data = data.Ct.pairwise,
flag.Ct = "Undetermined",
maxCt = 35,
flag = c("Undetermined"),
remove.Gene = low.quality.genes.pairwise,
remove.Sample = low.quality.samples.pairwise)
# Check dimensions of data before and after filtering:
dim(data.Ct.pairwise)
#> [1] 756 4
dim(data.Ct.pairwiseF)
#> [1] 530 4
NOTE: If data contain more than two groups, it is good practice to remove groups that are out of comparison; however, a majority of other functions can deal with more groups unless only two groups are indicated to be given in function’s parameters.
make_Ct_ready()
function (a pairwise approach)In the next step, filtered Ct data can be subjected to collapsing of
technical replicates and (optional) data imputation by means within
groups using the make_Ct_ready()
function. The term
‘technical replicates’ means observations with the same group name, gene
name, and sample name. In the scenario when data contain technical
replicates but they should not be collapsed, these technical replicates
must be distinguished by different sample names, e.g. SampleA_1,
SampleA_2, SampleA_3, etc.
The parameter imput.by.mean.within.groups
can be used to
control data imputation. If it is set to TRUE, imputation will be done,
otherwise missing values will be left in the data. For a better view of
the amount of missing values in the data, the information about the
number and percentage of missing values is displayed automatically:
# Without imputation:
data.Ct.pairwiseF.ready <- make_Ct_ready(data = data.Ct.pairwiseF,
imput.by.mean.within.groups = FALSE)
# A part of the data with missing values:
as.data.frame(data.Ct.pairwiseF.ready)[9:19,10:15]
#> Gene19 Gene20 Gene3 Gene4 Gene7 Gene8
#> 9 28.246 33.375 31.086 22.900 32.242 23.070
#> 10 28.867 33.011 32.077 24.332 32.318 24.677
#> 11 29.951 NA 33.982 24.895 NA 23.973
#> 12 29.459 32.032 31.916 24.139 31.937 22.586
#> 13 29.018 34.362 33.497 24.937 32.363 23.714
#> 14 28.959 NA NA 23.072 32.809 23.197
#> 15 28.763 NA 32.078 24.051 34.122 24.710
#> 16 27.325 29.509 32.185 21.328 28.986 19.841
#> 17 29.077 32.482 31.011 23.016 32.927 22.667
#> 18 30.415 31.895 NA 23.886 32.974 22.041
#> 19 31.039 34.386 33.739 24.505 33.868 24.492
# With imputation:
data.Ct.pairwiseF.ready <- make_Ct_ready(data = data.Ct.pairwiseF,
imput.by.mean.within.groups = TRUE)
# Missing values were imputed:
as.data.frame(data.Ct.pairwiseF.ready)[9:19,10:15]
#> Gene19 Gene20 Gene3 Gene4 Gene7 Gene8
#> 9 28.246 33.37500 31.08600 22.900 32.24200 23.070
#> 10 28.867 33.01100 32.07700 24.332 32.31800 24.677
#> 11 29.951 32.91194 33.98200 24.895 32.71465 23.973
#> 12 29.459 32.03200 31.91600 24.139 31.93700 22.586
#> 13 29.018 34.36200 33.49700 24.937 32.36300 23.714
#> 14 28.959 32.91194 32.19953 23.072 32.80900 23.197
#> 15 28.763 32.91194 32.07800 24.051 34.12200 24.710
#> 16 27.325 29.50900 32.18500 21.328 28.98600 19.841
#> 17 29.077 32.48200 31.01100 23.016 32.92700 22.667
#> 18 30.415 31.89500 32.19953 23.886 32.97400 22.041
#> 19 31.039 34.38600 33.73900 24.505 33.86800 24.492
NOTE: The data imputation process can significantly
influence data; therefore, no default value was set to the
imput.by.mean.within.groups
parameter to force the
specification by the user. If there are missing data for a certain gene
in the entire group, they will not be imputed and will remain NA.
In general, a majority of functions in RQdeltaCT
package
can deal with missing data; however, some methods (e.g. PCA) are
sensitive to missing data (see PCA analysis
section).
NOTE: The make_Ct_ready()
function
should be used even if the collapsing of technical replicates and data
imputation is not required, because this function also prepares the data
structure to fit to further functions.
Ideally, the reference gene should have an identical expression level in all samples, but in many situations it is not possible to achieve this, especially when biological replicates are analysed. Therefore, differences between samples are allowed, but the variance should be as low as possible, and it is also recommended that Ct values would not be very low (below 15) or very high (above 30) Kozera and Rapacz 2013.
The RQdeltaCT
package includes
find_ref_gene()
function that can be used to select the
best reference gene for normalisation. This function calculates
descriptive statistics such as minimum, maximum, standard deviation, and
variance, as well as stability scores calculated using the NormFinder (Article)
and geNorm (Article)
algorithms. Ct values are also presented on a line plot.
NormFinder scores are computed using internal
RQdeltaCT::norm_finder()
function working on the code
adapted from the original NormFinder code. To calculate NormFinder
scores, at least two samples must be present in each group. For the
geNorm score, the geNorm()
function of the
ctrlGene
package is used. The find_ref_gene()
function allows one to choose which of these algorithms should be done
by setting the logical parameters: norm.finder.score
and
genorm.score
.
The created plot is displayed on the graphic device. The returned object is a list that contains two elements: an object with plot and a table with results. In the example below, six genes are tested for suitability to be a reference gene:
library(ctrlGene)
# Remember that the number of colors in col parameter should be equal to the number of tested genes:
ref.pairwise <- find_ref_gene(data = data.Ct.pairwiseF.ready,
groups = c("After","Before"),
candidates = c("Gene4","Gene13","Gene20"),
col = c("#66c2a5", "#fc8d62","#6A6599"),
angle = 90,
axis.text.size = 7,
norm.finder.score = TRUE,
genorm.score = TRUE)
ref.pairwise[[2]]
#> Gene min max sd var NormFinder_score geNorm_score
#> 1 Gene13 27.184 33.224 1.449397 2.100751 0.17 NA
#> 2 Gene20 29.509 34.871 1.223547 1.497068 0.26 0.9961593
#> 3 Gene4 20.499 26.228 1.418726 2.012784 0.15 NA
#> 4 Gene4-Gene13 NA NA NA NA NA 0.6912531
NA values are caused because geNorm method returns a pair of genes
with the highest stability.
Among tested genes, Gene4 seems to have the best characteristics to be a
reference gene (it has Ct values below 30, relatively low standard
deviation, variance, NormFinder score, and geNorm score).
Data normalization can be performed using delta_Ct()
function that calculates delta Ct (dCt) values by subtracting Ct values
of reference gene (or mean of the Ct values of reference genes, if more
than one reference gene is used) from Ct values of gene of interest
across all samples.
If required, this function also performs a transformation of dCt data
using the 2-dCt formula. Setting transform
=
TRUE enable this transformation:
Before further processing, non-transformed or transformed dCt data should be subjected to quality control using functions and methods described in Quality control and filtering of transformed Ct data section (see below).
This method is used in studies where samples should be analysed as individual data points, e.g. in analysis of biological replicates (see [Introduction] section). In the pairwise variant of this method, Ct values are normalised by the endogenous control gene by subtracting the Ct value of the endogenous control from the Ct value of the gene of interest, in the same sample. Obtained delta Ct (dCt) values are subsequently transformed using the 2-dCt formula, and a ratio (fold change) of transformed Ct values is calculated individually for each pair of samples, and summarised by mean for each gene.
The whole process can be done using RQ_dCt()
function,
which performs:
+ calculation of means (returned in columns with the “_mean” pattern)
and standard deviations (returned in columns with the “_sd” pattern) of
transformed dCt values of genes analysed in the compared groups. +
normality testing (Shapiro_Wilk test) of transformed dCt values of
analysed genes in compared groups and returned p values are stored in
columns with the “_norm_p” pattern. + calculation of the fold change
values. In this pairwise approach (when pairwise
= TRUE),
individual fold change values are firstly calculated for each sample by
dividing transformed Ct value obtained for particular gene in the study
group by transformed Ct value obtained for the same gene in the
reference group. Subsequently, mean and standard deviation values of
individual fold changes are calculated within each group (returned in
FCh_mean and FCh_sd columns, respectively). + statistical testing of
differences in transformed Ct values between the study group and the
reference group. Student’s t test and Mann-Whitney U test are
implemented and the resulting statistics (in column with the
“_test_stat” pattern), p values (in column with the “_test_p” pattern),
and adjusted p values (in column with the “_test_p_adj” pattern) are
returned. The Benjamini-Hochberg method for adjustment of p values is
used in default. If compared groups contain less than three samples,
normality and statistical tests are not possible to perform (the
do.test
parameter should be set to FALSE to avoid
error).
NOTE: For this method, delta Ct (dCt) values
transformed by the 2-dCt formula using
delta_Ct()
function (called with transform
=
TRUE) should be used.
data.dCt.pairwise <- delta_Ct(data = data.Ct.pairwiseF.ready, ref = "Gene4", transform = FALSE)
library(coin)
results.dCt.pairwise <- RQ_dCt(data = data.dCt.pairwise,
do.tests = TRUE,
pairwise = TRUE,
group.study = "After",
group.ref = "Before")
# Obtained table can be sorted by, e.g. p values from the Mann-Whitney U test:
results <- as.data.frame(arrange(results.dCt.pairwise[[1]], MW_test_p))
head(results)
#> Gene After_mean Before_mean After_sd Before_sd After_norm_p Before_norm_p
#> 1 Gene19 5.2109333 3.7824762 0.8428066 0.9893666 0.43685704 0.03510469
#> 2 Gene8 -0.3372381 -1.0184286 0.8749796 1.0425276 0.04748179 0.17942875
#> 3 Gene16 -0.2142381 -0.7020476 0.7958921 0.6781031 0.19024540 0.07832005
#> 4 Gene14 6.0329048 5.7692381 0.5984831 0.6675626 0.58378047 0.03738612
#> 5 Gene7 9.0139833 8.4662381 0.8176409 1.3667726 0.01484502 0.69000384
#> 6 Gene20 9.2112745 9.6542381 0.9144846 1.2836473 0.17140590 0.21430743
#> FCh log10FCh FCh_sd t_test_p t_test_stat MW_test_p
#> 1 1.4700335 0.16732722 0.4499545 0.0001845345 4.573104 0.00102128
#> 2 0.2043090 -0.68971252 2.7242366 0.0537860813 2.049238 0.04565545
#> 3 1.9663253 0.29365536 6.4052867 0.0715023737 1.903242 0.05372471
#> 4 1.0595285 0.02511265 0.1643101 0.2140118947 1.283435 0.24426953
#> 5 1.0967073 0.04009074 0.2375549 0.1444388848 1.518898 0.24426953
#> 6 0.9765855 -0.01028974 0.2018320 0.2733067651 -1.126458 0.41404000
#> MW_test_stat t_test_p_adj MW_test_p_adj
#> 1 3.2845979 0.002214414 0.01225536
#> 2 1.9985649 0.286009495 0.21489883
#> 3 1.9290496 0.286009495 0.21489883
#> 4 1.1643813 0.513628547 0.58624688
#> 5 1.1643813 0.433316654 0.58624688
#> 6 -0.8168048 0.546613530 0.82808001
# Access to the table with fold change values calculated individually for each pair of sampleS:
FCh <- results.dCt.pairwise[[2]]
head(FCh)
#> # A tibble: 6 × 5
#> Sample Gene After Before FCh
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Sample01 Gene10 3.14 2.78 1.13
#> 2 Sample01 Gene13 7.81 5.67 1.38
#> 3 Sample01 Gene14 6.49 5.86 1.11
#> 4 Sample01 Gene15 6.63 6.18 1.07
#> 5 Sample01 Gene16 -0.724 -0.882 0.821
#> 6 Sample01 Gene17 2.86 3.47 0.824
NOTE: In a pairwise approach (if
pairwise
= TRUE), the abovementioned results can be found
in the first element of a list object returned by RQ_dCt()
function. For the results transparency, fold change values calculated
individually for each sample pair are also returned as the second
element of this object, and can be used to assess the homogeneity of
individual fold change values within each analysed gene using methods
described in [Quality control and filtering of transformed Ct data - a
pairwise approach] section.
Similarly to the 2-dCt method, in the 2-ddCt method Ct values are normalised by the endogenous control gene, obtaining dCt values. Subsequently, individually for each pair of samples, dCt values obtained in the control group are subtracted from the dCt values in the study group, giving individual delta delta Ct (ddCt) values. Finally, the ddCt values are transformed using the 2-ddCt formula to obtain the fold change values, which then are summarised by mean across analysed genes.
The whole process can be done using RQ_ddCt()
function,
which performs:
+ calculation of means (returned in columns with the “_mean” pattern)
and standard deviations (returned in columns with the “_sd” pattern) of
delta Ct values of the analyzed genes in the compared groups. +
normality testing (Shapiro_Wilk test) of delta Ct values of the analyzed
genes in the compared groups and returned p values are stored in columns
with the “_norm_p” pattern. + calculation of differences (delta delta
Ct, ddCt values) in the delta Ct values between paired samples by
subtracting dCt values obtained in the control group from the dCt values
in the study group. + calculation of fold change values using
2-ddCt formula. Fold change values are returned in a
separated table (see NOTE below). Subsequently, means (returned in “FCh”
column) and standard deviations (returned in “FCh_sd” column) of fold
change values are calculated for each gene. + a pairwise statistical
testing of differences between the compared groups. A pairwise Student’s
t test and Mann-Whitney U test are implemented and the resulted
statistics (in column with the “_test_stat” pattern), p values (in
column with the “_test_p” pattern) and adjusted p values (in column with
the “_test_p_adj” pattern) are returned. The Benjamini-Hochberg method
for adjustment of p values is used in default. If compared groups
contain less than three samples, normality and statistical tests are not
possible to perform and do.test
parameter should be set to
FALSE to avoid error.
NOTE: For this method, not transformed delta Ct
(dCt) values (obtained from delta_Ct()
function with
transform
= FALSE) should be used.
data.dCt.pairwise <- delta_Ct(data = data.Ct.pairwiseF.ready, ref = "Gene4", transform = FALSE)
library(coin)
# Remember to set pairwise = TRUE:
results.ddCt.pairwise <- RQ_ddCt(data = data.dCt.pairwise,
group.study = "After",
group.ref = "Before",
pairwise = TRUE,
do.tests = TRUE)
# Obtained table can be sorted by, e.g. p values from the Mann-Whitney U test:
results <- as.data.frame(arrange(results.ddCt.pairwise[[1]], MW_test_p))
head(results)
#> Gene After_mean Before_mean After_sd Before_sd After_norm_p Before_norm_p
#> 1 Gene19 5.2109333 3.7824762 0.8428066 0.9893666 0.43685704 0.03510469
#> 2 Gene8 -0.3372381 -1.0184286 0.8749796 1.0425276 0.04748179 0.17942875
#> 3 Gene16 -0.2142381 -0.7020476 0.7958921 0.6781031 0.19024540 0.07832005
#> 4 Gene14 6.0329048 5.7692381 0.5984831 0.6675626 0.58378047 0.03738612
#> 5 Gene7 9.0139833 8.4662381 0.8176409 1.3667726 0.01484502 0.69000384
#> 6 Gene20 9.2112745 9.6542381 0.9144846 1.2836473 0.17140590 0.21430743
#> FCh log10FCh FCh_sd t_test_p t_test_stat MW_test_p
#> 1 0.6485663 -0.188045654 0.9753027 0.0001845345 4.573104 0.00102128
#> 2 1.1687779 0.067731984 1.7391643 0.0537860813 2.049238 0.04565545
#> 3 0.9811133 -0.008280834 0.9145980 0.0715023737 1.903242 0.05372471
#> 4 1.0206707 0.008885637 0.6804825 0.2140118947 1.283435 0.24426953
#> 5 1.1322906 0.053957891 1.0765954 0.1444388848 1.518898 0.24426953
#> 6 2.5222676 0.401791166 2.6165869 0.2733067651 -1.126458 0.41404000
#> MW_test_stat t_test_p_adj MW_test_p_adj
#> 1 3.2845979 0.002214414 0.01225536
#> 2 1.9985649 0.286009495 0.21489883
#> 3 1.9290496 0.286009495 0.21489883
#> 4 1.1643813 0.513628547 0.58624688
#> 5 1.1643813 0.433316654 0.58624688
#> 6 -0.8168048 0.546613530 0.82808001
# Access to the table with fold change values calculated individually for each pair of samples:
FCh <- results.ddCt.pairwise[[2]]
head(FCh)
#> # A tibble: 6 × 5
#> Sample Gene After Before FCh
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Sample01 Gene10 3.14 2.78 0.783
#> 2 Sample01 Gene13 7.81 5.67 0.227
#> 3 Sample01 Gene14 6.49 5.86 0.647
#> 4 Sample01 Gene15 6.63 6.18 0.732
#> 5 Sample01 Gene16 -0.724 -0.882 0.896
#> 6 Sample01 Gene17 2.86 3.47 1.53
NOTE: In a pairwise approach (if
pairwise
= TRUE), the results can be found in the first
element of a list object returned by RQ_ddCt()
function.
For the results transparency, fold change values calculated individually
for each sample pair can be found in the second element of this object,
and can be assessed using methods described in the [Quality control and
filtering of transformed Ct data - a pairwise approach] section.
Data intended for relative quantification should be subjected to
quality assessment. The main purpose is to identify outlier samples that
could introduce bias into the results. The RQdeltaCT
package offers several functions which facilitate finding outlier
samples in data by implementing distribution analysis
(control_boxplot_sample()
function), hierarchical
clustering (control_cluster_sample()
function) and
principal component analysis PCA (control_pca_sample()
function). However, corresponding functions for genes are also provided
to evaluate similarities and differences between expression of analysed
genes (control_boxplot_gene()
,
control_cluster_gene()
and control_pca_gene()
functions).
The abovementioned data quality control functions are designed to be
directly applied to data objects returned from the
make_Ct_ready()
and delta_Ct()
functions (see
The summary of standard
workflow). Moreover, in a pairwise approach, a tables with fold
change values obtained from individual pairs of samples (the second
element of list returned from RQ_dCt()
and
RQ_ddCt()
functions) can be also passed directly, but the
pairwise.FCh
parameter of control functions must be set to
TRUE.
One of the ways to gain insight into data distribution is drawing
boxplot that show several statistics, such as median (labelled inside
box), first and third quartiles (ranged by box), extreme point in
interquartile range (ranged by whiskers), and more distant data as
separated points. In the RQdeltaCT
package, the
control_boxplot_sample()
and
control_boxplot_gene()
functions are included to draw
boxplots that present the distribution of samples and genes,
respectively.
These functions return objects with a plot. The created plots are also displayed on the graphic device.
Boxplots for samples:
control_boxplot_sample <- control_boxplot_sample(data = data.dCt.pairwise,
axis.text.size = 9,
y.axis.title = "dCt")
And boxplots for genes:
control_boxplot_gene <- control_boxplot_gene(data = data.dCt.pairwise,
by.group = TRUE,
axis.text.size = 10,
y.axis.title = "dCt")
Similar plots can be created for fold change values data returned
from RQ_dCt()
and RQ_ddCt()
functions:
# Remember to set pairwise.FCh to TRUE:
FCh <- results.dCt.pairwise[[2]]
control.boxplot.sample.pairwise <- control_boxplot_sample(data = FCh,
pairwise.FCh = TRUE,
axis.text.size = 9,
y.axis.title = "Fold change")
# There are some very high values, we can identify them using:
head(arrange(FCh, -FCh))
#> # A tibble: 6 × 5
#> Sample Gene After Before FCh
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Sample15 Gene16 -0.966 -0.0370 26.1
#> 2 Sample14 Gene16 -1.40 -0.0970 14.4
#> 3 Sample08 Gene8 0.554 0.0610 9.08
#> 4 Sample19 Gene10 3.46 0.969 3.57
#> 5 Sample09 Gene10 5.08 2.03 2.50
#> 6 Sample14 Gene8 -1.55 -0.634 2.45
Similar visualisation can be done for genes:
control.boxplot.gene.pairwise <- control_boxplot_gene(data = FCh,
by.group = FALSE,
pairwise.FCh = TRUE,
axis.text.size = 10,
y.axis.title = "Fold change")
NOTE: If missing values are present in the data, they will be automatically removed with warning.
Hierarchical clustering is a convenient method to investigate
similarities between variables. Hierarchical clustering of samples and
genes can be done using the control_cluster_sample()
and
control_cluster_gene()
functions, respectively. These
functions allow to draw dendrograms using various methods of distance
calculation (e.g. euclidean, canberra) and agglomeration (e.g. complete,
average, single). For more details, refer to the functions
documentation.
# Hierarchical clustering of samples:
control_cluster_sample(data = data.dCt.pairwise,
method.dist = "euclidean",
method.clust = "average",
label.size = 0.6)
# Hierarchical clustering of genes:
control_cluster_gene(data = data.dCt.pairwise,
method.dist = "euclidean",
method.clust = "average",
label.size = 0.8)
Similar plots can be created for fold change values data returned
from RQ_dCt()
and RQ_ddCt()
functions:
# Remember to set pairwise.FCh = TRUE:
control_cluster_sample(data = FCh,
pairwise.FCh = TRUE,
method.dist = "euclidean",
method.clust = "average",
label.size = 0.7)
control_cluster_gene(data = FCh,
pairwise.FCh = TRUE,
method.dist = "euclidean",
method.clust = "average",
label.size = 0.8)
The created plots are displayed on the graphic device.
NOTE: Minimum three samples or genes in data is required for clustering analysis.
Principal component analysis (PCA) is a data exploratory method, that
is commonly used to investigate similarities between variables based on
the first principal components that contain the most information about
variance. In the RQdeltaCT
package, the
control_pca_sample()
and control_pca_gene()
functions are developed to perform PCA analysis for samples and genes,
respectively. These functions return objects with a plot. Created plots
are also displayed on the graphic device.
control.pca.sample.pairwise <- control_pca_sample(data = data.dCt.pairwise,
point.size = 3,
label.size = 2.5,
hjust = 0.5,
legend.position = "top")
Similar plots can be created for fold change values data returned
from RQ_dCt()
and RQ_ddCt()
functions:
control.pca.sample.pairwise <- control_pca_sample(data = FCh,
pairwise.FCh = TRUE,
colors = "black",
point.size = 3,
label.size = 2.5,
hjust = 0.5)
control.pca.gene.pairwise <- control_pca_gene(data = FCh,
pairwise.FCh = TRUE,
color = "black",
hjust = 0.5)
NOTE: PCA algorithm can not deal with missing data (NAs); therefore, variables with NA values are removed before analysis. If at least one NA value occurs in all variables in at least one of the compared group, the analysis can not be done. Imputation of missing data will avoid this issue. Also, a minimum of three samples or genes in the data are required for analysis.
If any sample or gene was decided to be removed from the data, the
filter_transformed_data()
function can be used for
filtering. Similarly to quality control functions, this function can be
directly applied to data objects returned from the
make_Ct_ready()
and delta_Ct()
functions (see
The summary of standard
workflow).
data.dCt.pairwise.F <- filter_transformed_data(data = data.dCt.pairwise,
remove.Sample = c("Sample22", "Sample23", "Sample15","Sample03"))
dim(data.dCt.pairwise)
#> [1] 42 14
dim(data.dCt.pairwise.F)
#> [1] 34 14
NOTE: Remember, there must be a good objective reason to remove specific samples. It must be done with caution to avoid tuning the data to better match the user expectations.
For visualisation of final results, these five functions can be used:
FCh_plot()
that allows to illustrate fold change values
of genes,results_volcano()
that allows to create volcano plot
presenting the arrangement of fold change values and p values,results_barplot()
that show mean and standard deviation
values of genes across the compared groups,results_boxplot()
that illustrate the data distribution
of genes across the compared groups,results_heatmap()
that allows to create heatmap with
hierarchical clustering,parallel_plot()
that allows to present a pairwise
changes in genes expression.All these functions can be run on all data or on finally selected
genes (see sel.Gene
parameter). These functions have a
large number of parameters, and the user should familiarise with all of
them to properly adjust created plots to the user needs.
NOTE: The functions FCh_plot()
,
results_barplot()
, and results_boxplot()
also
allow to add customised statistical significance labels to plots using
ggsignif
package. If statistical significance labels should
be added to the plot, a vector with labels (e.g., “ns”, “*“,”p = 0.03”)
should be provided in the signif.labels
parameter. There
are two important points that must be taking into account when preparing
this vector:
ggsignif
package, all values must be different (the same values are not allowed).
Thus, if the same labels are needed, they should be distinguishable by
adding symmetrically a different number of white spaces, e.g., “ns.”, ”
ns. “,” ns. “,” ns. “, etc. (see the examples below).Before
# Remember to set pairwise = TRUE:
results.ddCt.pairwise <- RQ_ddCt(data = data.dCt.pairwise.F,
group.study = "After",
group.ref = "Before",
pairwise = TRUE,
do.tests = TRUE)
# Obtained table can be sorted by, e.g. p values from the Mann-Whitney U test:
results <- as.data.frame(arrange(results.ddCt.pairwise[[1]], MW_test_p))
head(results)
#> Gene After_mean Before_mean After_sd Before_sd After_norm_p Before_norm_p
#> 1 Gene19 5.2123294 3.5095882 0.7641196 0.6450896 0.76631561 0.37520243
#> 2 Gene8 -0.1840000 -1.2615882 0.8593104 0.7335902 0.01848096 0.57804549
#> 3 Gene16 -0.1601765 -0.8746471 0.8306935 0.5154298 0.23806535 0.28735006
#> 4 Gene15 7.2110588 6.5512941 0.9833445 0.9034758 0.10927151 0.05594945
#> 5 Gene7 9.0409206 8.3574706 0.7906209 1.3700506 0.02375954 0.98526475
#> 6 Gene17 4.0798824 3.7351176 0.7764462 0.6373618 0.72685827 0.90708293
#> FCh log10FCh FCh_sd t_test_p t_test_stat MW_test_p
#> 1 0.4078506 -0.389498890 0.3354031 1.134508e-05 6.261821 0.0005026301
#> 2 0.6323613 -0.199034707 0.5379766 9.009319e-04 4.064665 0.0035993566
#> 3 0.7577723 -0.120461285 0.5356882 1.041055e-02 2.901408 0.0056180461
#> 4 1.1067408 0.044045931 1.4292426 9.330078e-02 1.784581 0.0928595314
#> 5 0.9901632 -0.004293203 0.9777963 8.966683e-02 1.806604 0.1127786546
#> 6 0.9001501 -0.045685066 0.4176108 1.174977e-01 1.654544 0.1239292250
#> MW_test_stat t_test_p_adj MW_test_p_adj
#> 1 3.479351 0.000136141 0.006031561
#> 2 2.911294 0.005405591 0.021596140
#> 3 2.769279 0.041642189 0.022472184
#> 4 1.680503 0.223921872 0.247858450
#> 5 1.585827 0.223921872 0.247858450
#> 6 1.538488 0.224668939 0.247858450
Three genes (Gene8, Gene16, and Gene19) seem to have statistically significant differences in expression between Before and After groups
FCh_plot()
function (a pairwise approach)This function creates a barplot that illustrates fold change values
obtained from the analysis, together with an indication of statistical
significance. The first element of the object returned from the
RQ_dCt()
and RQ_ddCt()
functions worked in a
pairwise mode (when pairwise
= TRUE,) can be directly
applied to this function (see The summary of standard
workflow).
On the barplot, bars of significant genes are distinguished by colors
and/or significance labels. The significance of genes can be established
by two criteria: p values and (optionally) fold change values.
Thresholds for both criteria can be specified. The
FCh_plot()
function offers various options of which p
values are used on the plot:
mode
= “t”) or
adjusted p values from the Student’s t test (if mode
=
“t.adj”).mode
= “mw”)
or adjusted p values from the Mann-Whitney U test (if mode
= “mw.adj”)mode
=
“depends”). If the data in both compared groups were considered derived
from normal distribution (p value of Shapiro_Wilk test > 0.05) - p
values of Student’s t test will be used, otherwise p values of
Mann-Whitney U test will be used. For adjusted p values, use
mode
= “depends.adj”.mode
=
“user”). If the user intends to use the p values obtained from the other
statistical test, the mode
parameter should be set to
“user”. In this scenario, before running the FCh_plot()
function, the user should prepare a data.frame object named “user”
containing two columns: the first column with gene names and the second
column with p values (see example below).The created plot is displayed on the graphic device. The returned object is a lists that contain two elements: an object with plot and a table with results.
Below, a variant with p values depending on the normality of the data is presented. furthermore, genes with p < 0.05 and with at least 1.5-fold changed expression between groups are considered significant.
library(ggsignif)
# Remember to use the first element of list object returned by `RQ_dCt()` or RQ_ddCt() function:
FCh.plot.pairwise <- FCh_plot(data = results.ddCt.pairwise[[1]],
use.p = TRUE,
mode = "depends.adj",
p.threshold = 0.05,
use.FCh = TRUE,
FCh.threshold = 1.5,
angle = 20)
# Access the table with results:
head(as.data.frame(FCh.plot.pairwise[[2]]))
#> Gene After_mean Before_mean After_sd Before_sd After_norm_p Before_norm_p
#> 1 Gene10 3.4842353 3.2941765 0.9051358 1.3854897 0.1317249 0.79709435
#> 2 Gene13 6.6754706 6.5255882 0.8060583 0.5893336 0.7451515 0.41911404
#> 3 Gene14 6.0054118 5.6788824 0.4614858 0.6721121 0.5598084 0.01920284
#> 4 Gene15 7.2110588 6.5512941 0.9833445 0.9034758 0.1092715 0.05594945
#> 5 Gene16 -0.1601765 -0.8746471 0.8306935 0.5154298 0.2380653 0.28735006
#> 6 Gene17 4.0798824 3.7351176 0.7764462 0.6373618 0.7268583 0.90708293
#> FCh log10FCh FCh_sd t_test_p t_test_stat MW_test_p
#> 1 1.5731857 0.19678000 1.5533045 0.66262724 0.4445108 0.554033858
#> 2 1.1437228 0.05832077 0.7683948 0.57420244 0.5736167 0.758312374
#> 3 0.9376122 -0.02797677 0.5691316 0.13105688 1.5915084 0.162571759
#> 4 1.1067408 0.04404593 1.4292426 0.09330078 1.7845808 0.092859531
#> 5 0.7577723 -0.12046129 0.5356882 0.01041055 2.9014075 0.005618046
#> 6 0.9001501 -0.04568507 0.4176108 0.11749771 1.6545444 0.123929225
#> MW_test_stat t_test_p_adj MW_test_p_adj test.for.comparison p.used
#> 1 0.5917263 0.66262724 0.66484063 t.student's.test 0.66262724
#> 2 0.3076977 0.66262724 0.82724986 t.student's.test 0.66262724
#> 3 1.3964742 0.22466894 0.27869444 Mann-Whitney.test 0.27869444
#> 4 1.6805028 0.22392187 0.24785845 t.student's.test 0.22392187
#> 5 2.7692792 0.04164219 0.02247218 t.student's.test 0.04164219
#> 6 1.5384885 0.22466894 0.24785845 t.student's.test 0.22466894
#> Selected as significant?
#> 1 No
#> 2 No
#> 3 No
#> 4 No
#> 5 No
#> 6 No
Two genes (Gene8 and Gene19) met the significance criteria. Standard deviation bars and statistical significance annotations can be added to the plot:
# Firstly prepare a vector with significance labels specified according to the user needings:
signif.labels <- c("ns.",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
"**",
"***")
# Remember to set signif.show = TRUE:
FCh.plot.pairwise <- FCh_plot(data = results.ddCt.pairwise[[1]],
use.p = TRUE,
mode = "depends.adj",
p.threshold = 0.05,
use.FCh = TRUE,
FCh.threshold = 1.5,
use.sd = TRUE,
signif.show = TRUE,
signif.labels = signif.labels,
signif.dist = 0.2,
angle = 20)
# Variant with user p values - the used p values are calculated using the stats::wilcox.test() function:
user <- data.dCt.pairwise %>%
pivot_longer(cols = -c(Group, Sample),
names_to = "Gene",
values_to = "dCt") %>%
group_by(Gene) %>%
summarise(MW_test_p = wilcox.test(dCt ~ Group)$p.value,
.groups = "keep")
# The stats::wilcox.test() functions is limited to cases without ties; therefore,
# a warning "cannot compute exact p-value with ties" will appear when ties occur.
signif.labels <- c("ns.",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
" ns. ",
" * ",
"***")
FCh.plot <- FCh_plot(data = results.ddCt.pairwise[[1]],
use.p = TRUE,
mode = "user",
p.threshold = 0.05,
use.FCh = TRUE,
FCh.threshold = 1.5,
use.sd = TRUE,
signif.show = TRUE,
signif.labels = signif.labels,
signif.dist = 0.4,
angle = 30)
NOTE: If p values were not calculated due to the low
number of samples, or they are not intended to be used to create a plot,
the use.p
parameter should be set to FALSE.
results_volcano()
function (a pairwise
approach)This function creates a volcano plot that illustrates the arrangement
of genes based on fold change values and p values. Significant genes can
be pointed out using specified p value and fold change thresholds, and
highlighted on the plot by color and (optionally) isolated by thresholds
lines. Similarly to FCh_plot()
function, data returned from
the RQ_dCt()
and RQ_ddCt()
functions can be
directly applied (see The
summary of standard workflow) and various sources of used p values
are available (from the Student’s t test, the Mann-Whitney U test,
depended on the normality of data, or provided by the user).
The created plot is displayed on the graphic device. The returned object is a lists that contain two elements: an object with plot and a table with results.
# Genes with p < 0.05 and 2-fold changed expression between compared groups
# are considered significant. Remember to use the first element of list object
# returned by RQ_ddCt() function:
RQ.volcano.pairwise <- results_volcano(data = results.ddCt.pairwise[[1]],
mode = "depends.adj",
p.threshold = 0.05,
FCh.threshold = 1.5)
# Access the table with results:
head(as.data.frame(RQ.volcano.pairwise[[2]]))
#> Gene After_mean Before_mean After_sd Before_sd After_norm_p Before_norm_p
#> 1 Gene10 3.4842353 3.2941765 0.9051358 1.3854897 0.1317249 0.79709435
#> 2 Gene13 6.6754706 6.5255882 0.8060583 0.5893336 0.7451515 0.41911404
#> 3 Gene14 6.0054118 5.6788824 0.4614858 0.6721121 0.5598084 0.01920284
#> 4 Gene15 7.2110588 6.5512941 0.9833445 0.9034758 0.1092715 0.05594945
#> 5 Gene16 -0.1601765 -0.8746471 0.8306935 0.5154298 0.2380653 0.28735006
#> 6 Gene17 4.0798824 3.7351176 0.7764462 0.6373618 0.7268583 0.90708293
#> FCh log10FCh FCh_sd t_test_p t_test_stat MW_test_p
#> 1 1.5731857 0.19678000 1.5533045 0.66262724 0.4445108 0.554033858
#> 2 1.1437228 0.05832077 0.7683948 0.57420244 0.5736167 0.758312374
#> 3 0.9376122 -0.02797677 0.5691316 0.13105688 1.5915084 0.162571759
#> 4 1.1067408 0.04404593 1.4292426 0.09330078 1.7845808 0.092859531
#> 5 0.7577723 -0.12046129 0.5356882 0.01041055 2.9014075 0.005618046
#> 6 0.9001501 -0.04568507 0.4176108 0.11749771 1.6545444 0.123929225
#> MW_test_stat t_test_p_adj MW_test_p_adj test.for.comparison p.used
#> 1 0.5917263 0.66262724 0.66484063 t.student's.test 0.66262724
#> 2 0.3076977 0.66262724 0.82724986 t.student's.test 0.66262724
#> 3 1.3964742 0.22466894 0.27869444 Mann-Whitney.test 0.27869444
#> 4 1.6805028 0.22392187 0.24785845 t.student's.test 0.22392187
#> 5 2.7692792 0.04164219 0.02247218 t.student's.test 0.04164219
#> 6 1.5384885 0.22466894 0.24785845 t.student's.test 0.22466894
#> Selected as significant?
#> 1 No
#> 2 No
#> 3 No
#> 4 No
#> 5 No
#> 6 No
results_boxplot()
function (a pairwise
approach)This function creates a boxplot that illustrates the distribution of
the data for the genes. It is similar to
control_boxplot_gene()
function; however, some new options
are added, including gene selection, faceting, addition of mean points
to boxes, and statistical significance labels.
Data objects returned from the make_Ct_ready()
and
delta_Ct() functions
can be directly applied to this
function (see The summary of
standard workflow).
final.boxplot.pairwise <- results_boxplot(data = data.dCt.pairwise.F,
sel.Gene = c("Gene8","Gene19"),
by.group = TRUE,
signif.show = TRUE,
signif.labels = c("***","*"),
signif.dist = 1.2,
faceting = TRUE,
facet.row = 1,
facet.col = 2,
y.exp.up = 0.1,
y.axis.title = "dCt")
If the order of groups in the legend should be changes (to put Before group as first), the levels of the variable with group names should be reordered:
data.dCt.pairwise.F$Group <- factor(data.dCt.pairwise.F$Group, levels = c("Before", "After"))
final.boxplot.pairwise <- results_boxplot(data = data.dCt.pairwise.F,
sel.Gene = c("Gene8","Gene19"),
by.group = TRUE,
signif.show = TRUE,
signif.labels = c("***","*"),
signif.dist = 1.2,
faceting = TRUE,
facet.row = 1,
facet.col = 2,
y.exp.up = 0.1,
y.axis.title = "dCt")
NOTE: If missing values are present in the data, they will be automatically removed, and a warning will appear.
results_barplot()
function (a pairwise
approach)This function creates a barplot that illustrates mean and standard deviation values of the data for all or selected genes.
Data objects returned from the make_Ct_ready()
and
delta_Ct() functions
can be directly applied to this
function (see The summary of
standard workflow).
final.barplot.pairwise <- results_barplot(data = data.dCt.pairwise.F,
sel.Gene = c("Gene8","Gene19"),
signif.show = TRUE,
signif.labels = c("***","*"),
angle = 30,
signif.dist = 1.05,
faceting = TRUE,
facet.row = 1,
facet.col = 4,
y.exp.up = 0.1,
y.axis.title = "dCt")
NOTE: At least two samples in each group are required to calculate the standard deviation and properly generate the plot.
results_heatmap()
function (a pairwise
approach)This function allows to draw heatmap with hierarchical clustering. Various methods of distance calculation (e.g. euclidean, canberra) and agglomeration (e.g. complete, average, single) can be used.
NOTE: Remember to create named list with colors for
groups annotation and pass it in col.groups
parameter.
# Create named list with colors for groups annotation:
colors.for.groups = list("Group" = c("After"="firebrick1", "Before"="green3"))
# Vector of colors for heatmap can be also specified to fit the user needings:
colors <- c("navy","navy","#313695","#313695","#4575B4","#4575B4","#74ADD1","#74ADD1",
"#ABD9E9","#E0F3F8","#FFFFBF","#FEE090","#FDAE61","#F46D43",
"#D73027","#C32B23","#A50026","#8B0000",
"#7E0202","#000000")
library(pheatmap)
results_heatmap(data.dCt.pairwise.F,
sel.Gene = "all",
col.groups = colors.for.groups,
colors = colors,
show.colnames = TRUE,
show.rownames = TRUE,
fontsize = 11,
fontsize.row = 10,
cellwidth = 8,
angle.col = 90)
The created plot is displayed on the graphic device(if save.to.tiff = FALSE) or saved to .tiff file (if save.to.tiff = TRUE).
Gene expression levels and differences between groups can be further
analysed using the following methods and corresponding functions of the
RQdeltaCT
package:
pca_kmeans()
function)corr_sample()
function) and
genes (corr_gene()
function).single_pair_sample()
function) and genes (single_pair_gene()
function).ROCh()
function).log_reg()
function).Data objects returned from the make_Ct_ready()
and
delta_Ct() functions
can be directly applied to all of
these functions (see The
summary of standard workflow). Moreover, all functions can be run on
the entire data or only on selected samples/genes.
This function allows to simultaneously perform principal component
analysis (PCA) and (if do.k.means = TRUE) samples classification using k
means method. Number of clusters can be set using k.clust
parameter. Results obtained from both methods can be presented on one
plot. Obtained clusters are distinguishable by point shapes. Confusion
matrix of sample classification is returned as the second element of the
returned object and can be used for calculation further parameters of
classification performance like precision, accuracy, recall, and
others.
pca.kmeans <- pca_kmeans(data.dCt.pairwise.F,
sel.Gene = c("Gene8","Gene19"),
legend.position = "top")
If the legend is cropped, using a vertical position of legend should solve this problem:
Access to the confusion matrix:
Correlation analysis is a very useful method to explore linear
relationships between variables. The RQdeltaCT
package
offers corr_sample()
and corr_gene()
functions
to generate and plot correlation matrices of samples and genes,
respectively. The correlation coefficients can be calculated using
either the Pearson or Spearman algorithm. To facilitate plot
interpretation, these functions also have possibilities to order samples
or genes according to several methods, e.g. hierarchical clustering or
PCA first component (see order
parameter).
library(Hmisc)
library(corrplot)
# To make the plot more readable, only part of the data was used:
corr.samples <- corr_sample(data = data.dCt.pairwise.F[1:10, ],
method = "pearson",
order = "hclust",
size = 0.7,
p.adjust.method = "BH",
add.coef = "white")
library(Hmisc)
library(corrplot)
corr.genes <- corr_gene(data = data.dCt.pairwise.F,
method = "pearson",
order = "FPC",
size = 0.7,
p.adjust.method = "BH")
The created plots are displayed on the graphic device. The returned objects are tables with computed correlation coefficients, p values, and p values adjusted by Benjamini-Hochberg correction (by default). Tables are sorted by the absolute values of correlation coefficients in descending order.
NOTE: A minimum of 5 samples/target are required for correlation analysis.
Linear relationships between pairs of samples or genes can be further
analysed using simple linear regression models using the
single_pair_sample()
function (for analysis of samples) and
the single_pair_gene()
function for analysis of genes.
These functions draw a scatter plot with a simple linear regression
line. Regression results such as regression equation, coefficient of
determination, F value, or p value can be optionally added to the
plot.
The Receiver Operating Characteristic (ROC) analysis is useful for
assessing the performance of sample classification to the particular
group, based on gene expression data. In this analysis, ROC curves
together with parameters such as the area under curve (AUC),
specificity, sensitivity, accuracy, positive and negative predictive
value are received. The ROCh()
function was designed to
perform all of these tasks. This function returns a table with
calculated parameters and a plot with multiple panels, each with ROC
curve for one gene.
NOTE: The created plot is not displayed on the
graphic device, but should be saved as .tiff image
(save.to.txt
= TRUE) and can be opened directly from that
file in the working directory.
library(pROC)
# Remember to specify the numbers of rows (panels.row parameter) and columns (panels.col parameter) to provide sufficient place to arrange panels:
roc_parameters <- ROCh(data = data.dCt.pairwise.F,
sel.Gene = c("Gene8","Gene19"),
groups = c("After","Before"),
panels.row = 1,
panels.col = 2)
# Access to calculated parameters:
roc_parameters
#> Gene Threshold Specificity Sensitivity Accuracy ppv npv youden
#> 1 Gene19 4.5975 0.8235294 1 0.9117647 0.8500000 1 1.823529
#> 2 Gene8 0.0615 0.5882353 1 0.7941176 0.7083333 1 1.588235
#> AUC
#> 1 0.9584775
#> 2 0.8269896
Obtained warning informed us that analyzed genes have more than 1 threshold value for calculated Youden J statistic. To get all thresholds, ROC analysis should be performed separately for each gene:
# Filter data:
data <- data.dCt.pairwise.F[, colnames(data.dCt.pairwise.F) %in% c("Group", "Sample", "Gene19")]
# Perform analysis:
data_roc <- roc(response = data$Group,
predictor = as.data.frame(data)$Gene19,
levels = c("Before","After"),
smooth = FALSE,
auc = TRUE,
plot = FALSE,
ci = TRUE,
of = "auc",
quiet = TRUE)
# Gain parameters:
parameters <- coords(data_roc,
"best",
ret = c("threshold",
"specificity",
"sensitivity",
"accuracy",
"ppv",
"npv",
"youden"))
parameters
#> threshold specificity sensitivity accuracy ppv npv youden
#> 1 4.5130 0.9411765 0.8823529 0.9117647 0.9375 0.8888889 1.823529
#> 2 4.5975 1.0000000 0.8235294 0.9117647 1.0000 0.8500000 1.823529
# Gain AUC
data_roc$auc
#> Area under the curve: 0.9585
Logistic regression is a useful method to investigate the impact of
the analysed variable on the odds of the occurrence of the studied
experimental condition. In the RQdeltaCT
package,
log_reg()
function allows to calculate for each gene a
chances (odds ratio, OR) of being included in the study group when gene
expression level increases by one unit (suitable for non-transformed
data) or by mean of expression levels (more suitable for transformed
data). This function returns a plot and table with the calculated
parameters (OR, confidence interval, intercept, coefficient, and p
values).
library(oddsratio)
# Remember to set the increment parameter.
log.reg.results <- log_reg(data = data.dCt.pairwise.F,
increment = 1,
sel.Gene = c("Gene8","Gene19"),
group.study = "After",
group.ref = "Before",
log.axis = TRUE)
log.reg.results[[2]]
#> Gene oddsratio CI_low CI_high Increment Intercept coeficient p_intercept
#> 1 Gene19 37.769 5.552 1304.777 1 -15.769827 3.631485 0.006397974
#> 2 Gene8 4.604 1.858 14.908 1 1.095075 1.526832 0.049469132
#> p_coef
#> 1 0.005683362
#> 2 0.003213778
Increase in dCt values by 1 (two-fold decrease in expression) gives the higher chance to be in the “After” group.
sessionInfo()
#> R version 4.3.2 (2023-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=C LC_CTYPE=Polish_Poland.utf8
#> [3] LC_MONETARY=Polish_Poland.utf8 LC_NUMERIC=C
#> [5] LC_TIME=Polish_Poland.utf8
#>
#> time zone: Europe/Warsaw
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pheatmap_1.0.12 oddsratio_2.0.1 pROC_1.18.5 ggpmisc_0.5.5
#> [5] ggpp_0.5.6 corrplot_0.92 Hmisc_5.1-1 ggsignif_0.6.4
#> [9] coin_1.4-3 survival_3.5-7 ctrlGene_1.0.1 lubridate_1.9.3
#> [13] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2
#> [17] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.5.0
#> [21] tidyverse_2.0.0 RQdeltaCT_1.3.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.0 viridisLite_0.4.2 farver_2.1.1 libcoin_1.0-10
#> [5] fastmap_1.1.1 TH.data_1.1-2 GGally_2.2.1 digest_0.6.34
#> [9] rpart_4.1.23 timechange_0.2.0 lifecycle_1.0.4 cluster_2.1.6
#> [13] magrittr_2.0.3 compiler_4.3.2 rlang_1.1.1 sass_0.4.8
#> [17] tools_4.3.2 utf8_1.2.3 yaml_2.3.8 data.table_1.15.2
#> [21] knitr_1.45 labeling_0.4.3 htmlwidgets_1.6.4 plyr_1.8.9
#> [25] RColorBrewer_1.1-3 multcomp_1.4-25 withr_3.0.0 foreign_0.8-86
#> [29] nnet_7.3-19 grid_4.3.2 stats4_4.3.2 fansi_1.0.5
#> [33] colorspace_2.1-0 scales_1.3.0 MASS_7.3-60 cli_3.6.1
#> [37] mvtnorm_1.2-4 rmarkdown_2.26 generics_0.1.3 rstudioapi_0.15.0
#> [41] tzdb_0.4.0 polynom_1.4-1 cachem_1.0.8 modeltools_0.2-23
#> [45] splines_4.3.2 parallel_4.3.2 matrixStats_1.0.0 base64enc_0.1-3
#> [49] vctrs_0.6.3 Matrix_1.6-4 sandwich_3.1-0 jsonlite_1.8.8
#> [53] SparseM_1.81 confintr_1.0.2 hms_1.1.3 Formula_1.2-5
#> [57] htmlTable_2.4.2 jquerylib_0.1.4 glue_1.6.2 ggstats_0.5.1
#> [61] codetools_0.2-19 stringi_1.7.12 gtable_0.3.4 munsell_0.5.0
#> [65] pillar_1.9.0 htmltools_0.5.7 quantreg_5.97 R6_2.5.1
#> [69] evaluate_0.23 lattice_0.21-9 highr_0.10 backports_1.4.1
#> [73] bslib_0.6.1 MatrixModels_0.5-3 Rcpp_1.0.12 gridExtra_2.3
#> [77] nlme_3.1-164 checkmate_2.3.1 mgcv_1.9-1 xfun_0.42
#> [81] zoo_1.8-12 pkgconfig_2.0.3