Real time qPCR data analysis using R

Introduction

I will be using the Tidyverse package in R to analyse and plot a simple RT-qPCR experiment data. In this experiment, I have ectopically-expressed two proteins namely NKG2A and NKG2C (along with their binding partner CD94) in HEK293 cells. I want to assess the expression status of these genes in the HEK293 cells relative to GAPDH using the \(\Delta\)\(\Delta\)C_T method. The experiment was run in a C1000 Touch Thermal Cycler instrument. The csv file can be found here.

The \(\Delta\)\(\Delta\)C_T method for gene expression analysis

The \(\Delta\)\(\Delta\)C_T method (also known as the 2^–∆∆C_T method) is a simple formula used to calculate the relative fold gene expression of samples when performing RT-qPCR. The method was developed by Kenneth Livak and Thomas Schmittgen in 2001 and is widely used (Livak & Schmittgen, 2001).

C_T (or C_Q) stands for the Cycle Threshold (or Cycle Quantification) of a sample. This is given after the qPCR reaction by the software. Simply put, it is the cycle number where the fluorescence generated by the PCR product is distinguishable from the background fluorescence signal.

The symbol \(\Delta\) (Delta) is a mathematical term used to describe the difference between two values.

Let’s see how we can calculate the \(\Delta\)\(\Delta\)C_T:

\(\Delta\)\(\Delta\)C_T = \(\Delta\)C_T (treated sample) – \(\Delta\)C_T (untreated sample)

Essentially, \(\Delta\)\(\Delta\)C_T is the difference between the \(\Delta\)C_T values of the treated/experimental sample and the untreated/control sample. But what does \(\Delta\)C_T refer to?

\(\Delta\)C_T = C_T (gene of interest) – C_T (housekeeping gene)

Basically, \(\Delta\)C_T is the difference in \(\Delta\)C_T values for your gene of interest and your housekeeping gene for a given sample. This is to essentially normalise the gene of interest to a gene which is not affected by your experiment.

Finally, to work out the fold gene expression (F_C) we need to do 2 to the power of negative \(\Delta\)\(\Delta\)C_T:

F_C = 2^{(–\(\Delta\)\(\Delta\)C_T)}

Let us move to the fun part now!

Data Analysis

Install and load Tidyverse package

We will use the Tidyverse package for the data analysis.

Install the Tidyverse package:

install.packages("tidyverse")

Load the Tidyverse package into R Studio environment:

library(tidyverse)

Importing the data

Now we will import the qPCR data (a csv document in this case) into the R Studio environment using the read.csv function and store it as a dataframe (df) called qpcr.

qpcr <- read.csv("qPCR_HEKs_OE_NKG2A-C_human.csv", stringsAsFactors = FALSE)
tibble(qpcr)

## # A tibble: 60 × 15
##    Well  Fluor Target Content Sample  Biological.Set.…    Cq Cq.Mean Cq.Std..Dev
##    <chr> <chr> <chr>  <chr>   <chr>   <chr>            <dbl>   <dbl>       <int>
##  1 A01   SYBR  GAPDH  Unkn    Sample… empty vector      14.5    14.5           0
##  2 A02   SYBR  NKG2A  Unkn    Sample… empty vector      20.0    20.0           0
##  3 A03   SYBR  NKG2C  Unkn    Sample… empty vector      17.5    17.5           0
##  4 A04   SYBR  CD94   Unkn    Sample… empty vector      19.8    19.8           0
##  5 A05   SYBR  GAPDH  Unkn    Sample… pcDNA-NKG2A       14.5    14.5           0
##  6 A06   SYBR  NKG2A  Unkn    Sample… pcDNA-NKG2A       14.3    14.3           0
##  7 A07   SYBR  NKG2C  Unkn    Sample… pcDNA-NKG2A       19.2    19.2           0
##  8 A08   SYBR  CD94   Unkn    Sample… pcDNA-NKG2A       14.4    14.4           0
##  9 A09   SYBR  GAPDH  Unkn    Sample… pcDNA-NKG2C       14.5    14.5           0
## 10 A10   SYBR  NKG2A  Unkn    Sample… pcDNA-NKG2C       22.3    22.3           0
## # … with 50 more rows, and 6 more variables: Starting.Quantity..SQ. <dbl>,
## #   Log.Starting.Quantity <dbl>, SQ.Mean <dbl>, SQ.Std..Dev <int>,
## #   Set.Point <int>, Well.Note <lgl>

There are 15 columns (variables) in our data but we only need a some them for our data analysis. Therefore, using the select function we will extract the columns we need and store them in a new df called tidyData.

tidyData <- qpcr %>% select(Target:Cq, -Content)
glimpse(tidyData)

## Rows: 60
## Columns: 4
## $ Target              <chr> "GAPDH", "NKG2A", "NKG2C", "CD94", "GAPDH", "NKG2A…
## $ Sample              <chr> "Sample 1", "Sample 1", "Sample 1", "Sample 1", "S…
## $ Biological.Set.Name <chr> "empty vector", "empty vector", "empty vector", "e…
## $ Cq                  <dbl> 14.45711, 20.02642, 17.51733, 19.79295, 14.48283, …

Lets look at the 4 columns now:

Target: This variable stores the different target genes which in this case are GAPDH, NKG2A, NKG2B, and CD94.
Sample: These are different biological replicates. There are 5 replicates. There are no technical replicates for this experiment.
Biological.Set.Name: This variable stores the three different conditions. HEK cells are transfected with empty vector (empty vector), NKG2A/ CD94 (pcDNA-NKG2A), and NKG2C/ CD94 (pcDNA-NKG2C).
C_q: The C_q (or C_T) value for each condition. From here on till the end I will be using C_T and C_Q interchangeably.

Calculate \(\Delta\)C_T

The \(\Delta\)C_T is the differnce between the C_T of the housekeeping gene and the target gene. To calculate \(\Delta\)C_T we need to determine the average C_T of GAPDH for three conditions (empty vector, pcDNA-NKG2A, and pcDNA-NKG2C).

We will use the filter function to extract and store the C_T values of GAPDH in a new df called gapdh_only. Then we will use the group_by, summarise, and mean() functions to calculate mean C_Ts for the three conditions and store the values in another df called avg_gapdh.

%>% is called the pipe operator. This special operational function available under the magrittr and dplyr package (basically developed under magrittr), which allows us to pass the result of one function/argument to the other one in sequence.

gapdh_only <- tidyData %>% filter(Target == "GAPDH")
avg_gapdh <- gapdh_only %>% group_by(Biological.Set.Name) %>% summarize(ref_cq = mean(Cq))
avg_gapdh

## # A tibble: 3 × 2
##   Biological.Set.Name ref_cq
##   <chr>                <dbl>
## 1 empty vector          14.5
## 2 pcDNA-NKG2A           14.5
## 3 pcDNA-NKG2C           14.5

We have stored the mean GAPDH C_T value as ref_cq.
We will subtract the ref_cq from the Cq of the other target genes. Let’s take NKG2A as an example and see a step-by-step process:

First, we will extract the NKG2A target gene data for all the conditions and store it in nkg2a_only df.

nkg2a_only <- tidyData %>% filter(Target == "NKG2A")
head(nkg2a_only)

##   Target   Sample Biological.Set.Name       Cq
## 1  NKG2A Sample 1        empty vector 20.02642
## 2  NKG2A Sample 1         pcDNA-NKG2A 14.33777
## 3  NKG2A Sample 1         pcDNA-NKG2C 22.30218
## 4  NKG2A Sample 2        empty vector 20.04060
## 5  NKG2A Sample 2         pcDNA-NKG2A 14.40104
## 6  NKG2A Sample 2         pcDNA-NKG2C 22.33502

Second, we will join the avg_gapdh & nkg2a_only dfs using left_join:

nkg2a_gapdh <- left_join(nkg2a_only, avg_gapdh, by = "Biological.Set.Name")
head(nkg2a_gapdh)

##   Target   Sample Biological.Set.Name       Cq   ref_cq
## 1  NKG2A Sample 1        empty vector 20.02642 14.45323
## 2  NKG2A Sample 1         pcDNA-NKG2A 14.33777 14.47301
## 3  NKG2A Sample 1         pcDNA-NKG2C 22.30218 14.45229
## 4  NKG2A Sample 2        empty vector 20.04060 14.45323
## 5  NKG2A Sample 2         pcDNA-NKG2A 14.40104 14.47301
## 6  NKG2A Sample 2         pcDNA-NKG2C 22.33502 14.45229

Finally, we will calculate the \(\Delta\)C_T for the NKG2A dataset and store the values in nkg2a_gapdh df. The mutate function basically calculates del_cq values and adds a new column (variable) to the existing df.

nkg2a_gapdh <- nkg2a_gapdh %>% mutate(del_cq = ref_cq - Cq)
head(nkg2a_gapdh)

##   Target   Sample Biological.Set.Name       Cq   ref_cq      del_cq
## 1  NKG2A Sample 1        empty vector 20.02642 14.45323 -5.57319533
## 2  NKG2A Sample 1         pcDNA-NKG2A 14.33777 14.47301  0.13524120
## 3  NKG2A Sample 1         pcDNA-NKG2C 22.30218 14.45229 -7.84988870
## 4  NKG2A Sample 2        empty vector 20.04060 14.45323 -5.58737658
## 5  NKG2A Sample 2         pcDNA-NKG2A 14.40104 14.47301  0.07196972
## 6  NKG2A Sample 2         pcDNA-NKG2C 22.33502 14.45229 -7.88272978

Please feel free to use the dataset to calculate the \(\Delta\)C_T for NKG2C and CD94 genes.

Calculate \(\Delta\)\(\Delta\)C_T

The \(\Delta\)\(\Delta\)C_T is the difference between the \(\Delta\)C_T values of treated/experimental sample and the untreated/control sample.

First, we calculate mean C_T for “empty vector” from nkg2a_gapdh df and store it in the nkg2a_empty variable.

nkg2a_empty <- nkg2a_gapdh %>% filter(Biological.Set.Name == "empty vector") %>% summarise(mean_empty = mean(del_cq)) %>% pull(mean_empty)
nkg2a_empty

## [1] -5.592248

Second, we calculate the \(\Delta\)\(\Delta\)C_T values and append the values to nkg2a_gapdh df.

nkg2a_gapdh <- nkg2a_gapdh %>% mutate(del_del_cq = del_cq - nkg2a_empty)
head(nkg2a_gapdh)

##   Target   Sample Biological.Set.Name       Cq   ref_cq      del_cq  del_del_cq
## 1  NKG2A Sample 1        empty vector 20.02642 14.45323 -5.57319533  0.01905276
## 2  NKG2A Sample 1         pcDNA-NKG2A 14.33777 14.47301  0.13524120  5.72748929
## 3  NKG2A Sample 1         pcDNA-NKG2C 22.30218 14.45229 -7.84988870 -2.25764061
## 4  NKG2A Sample 2        empty vector 20.04060 14.45323 -5.58737658  0.00487151
## 5  NKG2A Sample 2         pcDNA-NKG2A 14.40104 14.47301  0.07196972  5.66421781
## 6  NKG2A Sample 2         pcDNA-NKG2C 22.33502 14.45229 -7.88272978 -2.29048169

Next, we can calculate the fold change gene expression for each condition using the F_C formula above.

nkg2a_geneExp <- nkg2a_gapdh %>% mutate(geneExp = 2^(del_del_cq))
head(nkg2a_geneExp)

##   Target   Sample Biological.Set.Name       Cq   ref_cq      del_cq  del_del_cq
## 1  NKG2A Sample 1        empty vector 20.02642 14.45323 -5.57319533  0.01905276
## 2  NKG2A Sample 1         pcDNA-NKG2A 14.33777 14.47301  0.13524120  5.72748929
## 3  NKG2A Sample 1         pcDNA-NKG2C 22.30218 14.45229 -7.84988870 -2.25764061
## 4  NKG2A Sample 2        empty vector 20.04060 14.45323 -5.58737658  0.00487151
## 5  NKG2A Sample 2         pcDNA-NKG2A 14.40104 14.47301  0.07196972  5.66421781
## 6  NKG2A Sample 2         pcDNA-NKG2C 22.33502 14.45229 -7.88272978 -2.29048169
##      geneExp
## 1  1.0132940
## 2 52.9841626
## 3  0.2091137
## 4  1.0033824
## 5 50.7106832
## 6  0.2044073

Finally, let us calculate statistical parameters using mean() and sd() for the data:

nkg2a_geneExp_final <- nkg2a_geneExp %>% group_by(Biological.Set.Name) %>% summarise(mean_geneExp = mean(geneExp), sd_geneExp = sd(geneExp)) %>% mutate(Target = c("NKG2A", "NKG2A", "NKG2A"))
head(nkg2a_geneExp_final)

## # A tibble: 3 × 4
##   Biological.Set.Name mean_geneExp sd_geneExp Target
##   <chr>                      <dbl>      <dbl> <chr> 
## 1 empty vector               1.00     0.0211  NKG2A 
## 2 pcDNA-NKG2A               51.0      1.13    NKG2A 
## 3 pcDNA-NKG2C                0.211    0.00576 NKG2A

Please feel free to use the dataset to calculate the \(\Delta\)\(\Delta\)C_T for NKG2C and CD94 genes.

Lets move on to the visualization of our data.

Data Visualization

Before plotting the data let us make a single df. We will use full_join to accomplish this and make a new df called all which contains the analysis for individual targets namely NKG2C (nkg2c_geneExp_final) and CD94 (cd94_geneExp_final).

all <- full_join(nkg2a_geneExp_final, nkg2c_geneExp_final)
all <- full_join(all, cd94_geneExp_final)

all

## # A tibble: 9 × 4
##   Biological.Set.Name mean_geneExp sd_geneExp Target
##   <chr>                      <dbl>      <dbl> <chr> 
## 1 empty vector               1.00     0.0211  NKG2A 
## 2 pcDNA-NKG2A               51.0      1.13    NKG2A 
## 3 pcDNA-NKG2C                0.211    0.00576 NKG2A 
## 4 empty vector               1.00     0.0727  NKG2C 
## 5 pcDNA-NKG2A                0.266    0.0564  NKG2C 
## 6 pcDNA-NKG2C                8.55     0.0526  NKG2C 
## 7 empty vector               1.02     0.241   CD94  
## 8 pcDNA-NKG2A               35.2      0.179   CD94  
## 9 pcDNA-NKG2C               34.8      0.109   CD94

Let us now use ggplot package to plot a bar plot for the RT-qPCR data.

final_plot <- ggplot(all, aes(x = Biological.Set.Name, y = mean_geneExp, fill = Target)) +
  geom_bar(stat = "identity", width = 0.5, position = position_dodge(0.6), color = "#434343") +
  geom_errorbar(aes(ymin = mean_geneExp - sd_geneExp, ymax = mean_geneExp + sd_geneExp), width=.2, position = position_dodge(0.6)) +
  theme_bw()+
  labs(title = "Human NKG2A/C-CD94 overexpression status in HEK293 cells",
       subtitle = "2021.11.23 Shubham Dutta",
       x = "Transfection (Human genes)",
       y = "Expression relative to GAPDH",
       fill = "Target gene",
       caption = NULL)+
  theme(plot.title=element_text(size=13),
        axis.text=element_text(size=12, color = "#434343"),
        axis.title=element_text(size=13),
        legend.text=element_text(size=10),
        legend.title=element_text(size=12),
        legend.position = c(0.2, 0.8),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())+
  scale_fill_manual(values = c("#d55e00", "#0072b2", "#009e73")) # color blind palette

final_plot

Exporting the plot

The plot can be exported as a png file like so:

ggsave(filename = "qpcr.png", final_plot, dpi = 600)

There are other formats that can be used as well. Please read the documentation here.

References

Kenneth J. Livak, Thomas D. Schmittgen, Analysis of Relative Gene Expression Data Using Real-Time Quantitative PCR and the 2−ΔΔCT Method, Methods, Volume 25, Issue 4, 2001, Pages 402-408, ISSN 1046-2023, https://doi.org/10.1006/meth.2001.1262.

Real time qPCR data analysis using R

Shubham Dutta

11/24/2021