Skip to main content

Introduction to RNA-seq Analysis



In this workshop, we will provide an overview of RNA-sequencing (RNA-seq) technology and best practices for experimental design. Participants will also explore preprocessing steps, read alignment ( to a reference genome), and differential expression analysis using DESeq2. We will also introduce tools for functional analysis and determining the biological relevance of the results.

Tutorial Setup Instructions

Steps to prepare for the tutorial session:

  • Login to Ceres Open OnDemand. For more information on login procedures for web-based SCINet access, see the SCINet access user guide.
  • Open a command-line session by clicking on “Clusters” -> “Ceres Shell Access” on the top menu. This will open a new tab with a command-line session on Ceres’ login node.
  • Request resources on a compute node by running the following command.

    srun --reservation=wk3_workshop -A scinet_workshop2 -t 05:00:00 -N 1 –c 8  --pty bash   
    
    • For those accessing post-workshop: Remove --reservation=wk3_workshop and change -A scinet_workshop2 to your own project account. For additional information, see our SLURM guide.
  • Create a workshop working directory by running the following commands. Note: you do not have to edit the commands with your username as it will be determined by the $USER variable.

    mkdir -p /90daydata/shared/$USER/intro_rnaseq 
    cd /90daydata/shared/$USER/intro_rnaseq  
    cp /project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/*.sh .
    cp /project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/DESeq2.R .
    

RNA-Seq Data Analysis

by Sivanandan Chudalayandi and Lavida Rogers

RNA-seq experiments are performed with an aim to comprehend transcriptomic changes in organisms in response to a certain treatment. They are also designed to understand the cause and/or effect of a mutation by measuring the resulting gene expression changes. Robust algorithms specifically designed to map short stretches of nucleotide sequences to a genome while being aware of the process of RNA splicing have led to many advances in RNAseq analysis. This tutorial will guide you through basic RNAseq analysis, beginning at quality checking of the RNAseq reads through to getting the differential gene expression results. We have downloaded an Arabidopsis dataset from NCBI for this purpose. Check the BioProject page for more information.

Background Information: Experimental design

This experiment compares wild type (WT) and atrx-1 mutant to analyze how the loss of function of ATRX protein results in changes in gene expression. The ATRX protein is a histone chaperone known to be an important player in the regulation of gene expression. RNA was isolated from three WT replicates and three mutant replicates. The transcriptome was enriched/isolated using the plant RiboMinus kit for obtaining total RNA. RNA-seq was carried out on an Illumina Hiseq 2500. The sequencing reads are paired-end data, hence we have 2 files per replicate.

Condition replicate 1 replicate 2 replicate 3

WT

SRR4420293_1.fastq.gz
SRR4420293_2.fastq.gz

SRR4420294_1.fastq.gz
SRR4420294_2.fastq.gz

SRR4420295_1.fastq.gz
SRR4420295_2.fastq.gz

atrx-1

SRR4420296_1.fastq.gz
SRR4420296_2.fastq.gz

SRR4420297_1.fastq.gz
SRR4420297_2.fastq.gz

SRR4420298_1.fastq.gz
SRR4420298_2.fastq.gz

  1. Getting Data

    The Raw sequence data

    The files needed for this tutorial have already been downloaded. This will allow us to get started right away with our analyses without waiting on downloads or dealing with connectivity issues. However, for your own work, you may need to download data from public repositories. There are two common tools used for this: wget and curl. While this is not needed for the tutorial, here’s how you would download files yourself:

    • Using wget:
      • wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_1.fastq.gz
    • Using curl
      • curl - 0 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_1.fastq.gz
        -0 capital O: tells curl to save the file with its original name.

    The Reference Genome:

    We will also need the genome file and associated GTF/GFF file for Arabidopsis. These can be downloaded directly from NCBI, or plants Ensembl website, or the Phytozome website*.
    *The Phytozome needs logging in and selecting the files.

    For this tutorial, we have already downloaded the following files from NCBI:

    • You can use wget to fetch both the Genome (Fasta) file and the Annotation (GFF) file: For this tutorial, the steps below are optional

      mkdir 00_Genome
      cd 00_Genome
      wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz 
      wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.gtf.gz 
      
    • We unzipped these files so that they can be used later in our analysis pipeline. To do this we used gunzip

      gunzip GCF_000001735.4_TAIR10.1_genomic.fna.gz 
      gunzip GCF_000001735.4_TAIR10.1_genomic.gtf.gz 
      
  2. Quality Check

    We use fastqc, a tool that provides a simple way to do quality control checks on raw sequence data coming from high-throughput sequencing pipelines (FastQC).

    • Script for fastqc (already included)
      Ensure that you are in /90daydata/shared/$USER/intro_rnaseq cat fastqc.sh

      #!/bin/bash 
      #SBATCH -N1 
      #SBATCH -c8 
      #SBATCH -J fastqc  
      #SBATCH -o fastqc_%j.out 
      #SBATCH -e fastqc_%j.err 
      #SBATCH -t 01:00:00 
      
      # Load required modules 
      module load fastqc 
      module load multiqc 
      
      # Define Directories: 
      RAW_DIR="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_RawData/" 
      OUT_DIR="$SLURM_SUBMIT_DIR/01_FastqC" 
      # make output directories 
      mkdir -p $OUT_DIR 
      # The main command 
      fastqc -t 8 -o $OUT_DIR $RAW_DIR/*gz 
      multiqc -p -o 01a_MultiQC/ $OUT_DIR --title "Arabidopsis_RNAseq" 
      
    • Submit the script

      sbatch fastqc.sh  
      

    What to do if your reads need trimming?

    In this section of the tutorial we will use a different set of reads to demonstrate what you would do if, after quality check, you decided to trim your reads. Read trimming is the process of removing poor-quality bases before downstream analysis. and there are a few tools which helps with this.

    When would you trim?

    • To remove bases of poor quality
    • To remove adapter sequences
    • Reads for demonstrating trimming

    The reads for this demonstration are from the ENA. We will use BBDUK: Decontamination using kmers (BBDUK Guide)

    FFPE="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/Poor_Data/FFPE" 
    
    module load bbtools
    
    mkdir Trimmed
    # Define adapters variables 
    adapters="/software/el9/apps/bbtools/39.01/resources/adapters.fa"
    # Run bbduk:
    
    bbduk.sh in1=SRR12844918_1.fastq.gz in2=SRR12844918_2.fastq.gz \
    out1=Trimmed/SRR12844918_1.fastq.gz out2=Trimmed/SRR12844918_2.fastq.gz \
    ref=$adapters ktrim=r k=23 mink=11 hdist=1 qtrim=rl trimq=20 minlength=50 tpe tbo
    

    Explanation of options:

    • ref=adapters.fa: Reference file containing adapter sequences to be trimmed.
    • ktrim=r: Trim adapters from the right end of reads.
    • k=23: K-mer length for matching adapters.
    • mink=11: Minimum k-mer length for adapter matching.
    • hdist=1: Allow one mismatch in k-mer matching.
    • qtrim=rl: Trim both ends of reads based on quality.
    • trimq=20: Quality threshold for trimming.
    • minlength=50: Discard reads shorter than 50 bases after trimming.
    • tpe: Trim both reads of a pair if one is trimmed.
    • tbo: Trim adapters based on pair overlap detection.

    Run a quality check on the timmed reads

    We will run this quickly on the interactive node.

    cd Trimmed 
    
    mkdir Fastqc 
    mkdir Multiqc 
    
    time fastqc -o Fastqc/ -t 2 SRR12844918_* 
    time multiqc -p -o Multiqc/ Fastqc --title "SRR12844918_Trim_Quality 
    
  3. Mapping reads to the genome using HISAT2

    There are several mapping programs available for aligning RNAseq reads to the genome. Generic aligners such as BWA, bowtie2, BBMap, etc., are not suitable for mapping RNAseq reads because they are not splice-aware. RNAseq reads are mRNA reads that only contain exonic regions, hence mapping them back to the genome requires splitting the individual reads that span an intron. It can be done only by splice-aware mappers. In this tutorial, we will use HISAT2, a successor of Tophat2.

    Build genome index

    For HiSat2 mapping, you first need to index the genome and then use the read pairs to map the indexed genome (one set at a time). For indexing the genome, we use the hisat2-build command. We can run this in an interactive node.

    • To go back to the main working directory:

      cd /90daydata/shared/$USER/intro_rnaseq
      
    • Run hisat2-build:

      DATA_DIR=/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_Genome/
      { time hisat2-build -p8 $DATA_DIR/GCF_000001735.4_TAIR10.1_genomic.fna 02_Hisat2/Arabidopsis_TAIR10.1 >& hs2build.log; } 2> hs2build.time &
      

    Once complete, you should see several files with the .ht2 extension. These are the index files.

    Mapping Reads to the index:

    We will do this using parallel within a Slurm batch script. First we will need two executablescripts (included):

    • Run Hisat2
      cat run_hisat2.sh
      #!/bin/bash 
      R1=$1 
      R2=$2 
      SAMPLE=$(basename "$R1" _1.fastq.gz) 
      hisat2 -p 4 -x Arabidopsis_TAIR10.1 -1 "$R1" -2 "$R2" 2> LOG/"${SAMPLE}".log -S SAM/"${SAMPLE}".sam 
      
    • Run samtools
      cat run_samtools.sh
      #!/bin/bash 
      # Usage: ./run_samtools.sh SAMFILE 
      SAMFILE=$1 
      SAMPLE=$(basename $SAMFILE .sam) 
      samtools view -@ 7 -bS $SAMFILE | samtools sort -@ 7 -o BAM/${SAMPLE}.bam 
      

    The two scripts above will be run within a slurm batch script (also included)

    • cat hisat2_samtools_slurm.sh
      #!/bin/bash 
      #SBATCH -N1 
      #SBATCH -c16 
      #SBATCH -J hisat2  
      #SBATCH -o slurm_logs/hisat2_%j.out 
      #SBATCH -e slurm_logs/hisat2_%j.err 
      #SBATCH -t 08:00:00 
      
      # Load required modules 
      module load hisat2 
      module load samtools 
      module load parallel 
      
      # Define Directories: 
      RAW_DIR="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_RawData/" 
      INDEX_DIR="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/02_Hisat2/" 
      OUT_DIR="$SLURM_SUBMIT_DIR/02_Hisat2" 
      mkdir -p $OUT_DIR 
      
      # change to compute node's local dir 
      cd $TMPDIR 
      
      # Set Permissions 
      chgrp -R proj-scinet_workshop2 $TMPDIR 
      chmod -R g+s $TMPDIR 
      
      # make output directories 
      mkdir SAM 
      mkdir BAM 
      mkdir LOG 
      
      # copy input data and script to the node's scratch dir 
      cp --preserve=ownership $RAW_DIR/*gz . 
      cp --preserve=ownership $INDEX_DIR/Arabidopsis_TAIR10.1*h* .  
      cp --preserve=ownership $SLURM_SUBMIT_DIR/run_hisat2.sh . 
      cp --preserve=ownership $SLURM_SUBMIT_DIR/run_samtools.sh . 
      
      # Check if folders are in TMPDIR 
      echo "Files in TMPDIR:"
      find . -type f 
      
      # The main command (run the two executable scripts) 
      parallel -j 4 "./run_hisat2.sh {1} {2}" ::: *_1.fastq.gz :::+ *_2.fastq.gz 
      parallel -j 4 "./run_samtools.sh {}" ::: SAM/*.sam 
      
      # Move the output directories back to your folder 
      mv BAM $OUT_DIR 
      mv LOG $OUT_DIR 
      

    Submit the slurm script:
    sbatch hisat2_samtools_slurm.sh

    This will take about 30 minutes to complete.

  4. Abundance Estimation

    For quantifying transcript abundance from RNA-seq data, there are many programs available. The two most popular tools are featureCounts and HTSeq. We will need a file with aligned sequencing reads (SAM/BAM files generated in previous step) and a list of genomic features (from the GFF file). featureCounts is a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations. It also outputs statistics for the overall summarization of results, including the number of successfully assigned reads and the number of reads that failed to be assigned due to various reasons. We can run featureCounts on all SAM/BAM files at the same time or individually.

    pwd cd /90daydata/shared/$USER/intro_rnaseq

    You will need subread and parallel modules loaded. They are included in the script below:

    #!/bin/bash 
    #SBATCH -N 1 
    #SBATCH -c 16 
    #SBATCH -t 01:00:00 
    #SBATCH -J FeatCounts 
    #SBATCH -o slurm_logs/FeatCounts.%j.out 
    #SBATCH -e slurm_logs/FeatCounts.%j.err 
    
    # Load required modules 
    module purge 
    module load subread 
    module load parallel 
    
    ulimit -s unlimited 
    
    # Define inputs 
     ANNOT_GTF="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_Genome/GCF_000001735.4_TAIR10.1_genomic.gtf" 
    INDIR="02_Hisat2/BAM" 
    OUTDIR="03_FeatCounts" 
    [[ -d ${OUTDIR} ]] || mkdir -p ${OUTDIR} 
    
    # FeatureCounts command 
    parallel -j 4 "featureCounts -T 4 -s 2 -p --countReadPairs -t gene -g gene_id -a ${ANNOT_GFF} -o ${OUTDIR}/{/.}.gene.txt {}" ::: ${INDIR}/*.bam 
    
    scontrol show job ${SLURM_JOB_ID} 
    

    Submit the batch script:

    sbatch featureCounts_slurm_script.sh  
    

    This creates the following set of files in the specified output folder:

    • Count Files:
      SRR4420298.gene.txt 
      SRR4420293.gene.txt 
      SRR4420297.gene.txt 
      SRR4420296.gene.txt 
      SRR4420295.gene.txt 
      SRR4420294.gene.txt 
      

      Each file has a commented line starting with a # which gives the command used to create the count table and the relevant seven columns as follows, for example:

      • head SRR4420298.gene.txt
        # Program:featureCounts v1.5.2; Command:"featureCounts" "-T" "4" "-s" "2" "-p" "-t" "gene" "-g" "ID" "-a" "/path/to/GCF_000001735.3_TAIR10_genomic.gff" "-o" "/path/to/counts/SRR4420298.gene.txt" "SRR4420298.bam" 
        
        Geneid  Chr     Start   End     Strand  Length  SRR4420298.bam 
        gene0   NC_003070.9     3631    5899    +       2269    13 
        gene1   NC_003070.9     6788    9130    -       2343    17 
        gene2   NC_003070.9     11101   11372   +       272     0 
        gene3   NC_003070.9     11649   13714   -       2066    13 
        gene4   NC_003070.9     23121   31227   +       8107    32 
        gene5   NC_003070.9     23312   24099   -       788     0 
        gene6   NC_003070.9     28500   28706   +       207     0 
        gene7   NC_003070.9     31170   33171   -       2002    45 
        
    • Additionally, Summary Files are produced. These give the summary of reads that were either ambiguous, multi mapped, mapped to no features, or unmapped among other statistics. We can refer to these to further tweak our analyses etc.
      SRR4420298.gene.txt.summary 
      SRR4420293.gene.txt.summary 
      SRR4420295.gene.txt.summary 
      SRR4420296.gene.txt.summary 
      SRR4420294.gene.txt.summary 
      SRR4420297.gene.txt.summary 
      

    Using the following commands, we can produce a single count table. This count table will be loaded into R for differential expression analysis.

    • From 03_FeatCounts edit the count files “in-place”

      for f in *.txt  
        do sed -i '/^#/d' "$f" 
            sed -i 's|02_Hisat2/BAM/||g' "$f" 
            sed -i 's|\.bam||g' "$f" 
            done 
      
    • Make directory called Processed and make a combined count table with the first column as “Gene_IDs” and 6 other columns one for each sample’s gene counts.

      mkdir Processed  
      awk 'BEGIN {OFS="\t"} {print $1,$7}' SRR4420293.gene.txt > col1.txt 
      awk 'BEGIN {OFS="\t"} {print $7}' SRR4420294.gene.txt > col2.txt 
      awk 'BEGIN {OFS="\t"} {print $7}' SRR4420295.gene.txt > col3.txt 
      awk 'BEGIN {OFS="\t"} {print $7}' SRR4420296.gene.txt > col4.txt 
      awk 'BEGIN {OFS="\t"} {print $7}' SRR4420297.gene.txt > col5.txt 
      awk 'BEGIN {OFS="\t"} {print $7}' SRR4420298.gene.txt > col6.txt 
      paste col1.txt col2.txt col3.txt col4.txt col5.txt col6.txt > Arabidopsis_RNAseq_Counts.txt 
      

      head Arabidopsis_RNAseq_Counts.txt

    • The Combined count file… a sneak peek:

      Geneid  SRR4420293  SRR4420294  SRR4420295  SRR4420296  SRR4420297  SRR4420298 
      AT1G01010   11  1   10  28  11  13 
      AT1G01020   37  3   26  88  22  17 
      AT1G03987   0   0   0   0   0   0 
      
    • Convert to and save as CSV file

      sed 's|        |,|g' Arabidopsis_RNAseq_Counts.txt > Arabidopsis_RNAseq_Counts.csv
      

    This file is ready to be imported to R for DESeq2.

  5. Differential Gene Expression Analysis

    • RStudio Server: Back on the main Ceres OOD tab, click on the top or side navigation bar: “Interactive Apps” > “RStudio Server”.
    • Fill the input fields with the following (input fields not listed below can be left at their default values):
      • Account: scinet_workshop2
      • Queue: ceres
      • QOS: 400thread
      • R Version: 4.4.1
      • Number of hours: 4
      • Number of cores: 1
      • Memory required: 8GB
      • Optional Slurm Arguments: (leave empty)
    • Click the “Launch” button.
    • Wait a moment for the job card to update from “Queued” to “Running”.
    • Please confirm that clicking on the “Connect to RStudio Server” button opens a new tab with the RStudio Server interface.

    We have saved the entire R script for DESeq2 (DESeq2.R). The script is partly based on Stephen Turner’s template. We have displayed the R script here largely for reference.

    ## Set working directory in R 
    setwd("/90daydata/shared/$USER/intro_rnaseq") 
    
    ## Get the latest version of Bioconductor 
    if (!require("BiocManager", quietly = TRUE)) 
        install.packages("BiocManager") 
    BiocManager::install(version = "3.20") 
    
    ## Install libraries needed: 
    
    BiocManager::install("DESeq2") 
    library("DESeq2") 
    library(readr) 
    library(DESeq2) 
    
    # Read the data in
    df1 <- read_csv("Arabidopsis_RNAseq_Counts.csv") 
    typeof(df1) 
    df1 
    dim(df1) 
    
    # Saving the gene IDs as a vector 
    AT_genes <- df1[1] 
    df1 <- df1[,-1] 
    
    # Some diagnoses 
    # Total reads per sample 
    colSums(df1) 
    
    # SRR4420293 SRR4420294 SRR4420295 SRR4420296 SRR4420297 SRR4420298  
    # 3703248    1081388    3793357   14588240    4095513    3235033  
    
    # Boxplot of log counts per sample 
    boxplot(log10(df1 + 1), las=2, main="Raw log10(counts+1) per sample", col="lightblue") 
    
    #DESeq2 Pipeline 
    
    #Metadata 
    # Assign condition (first three are WT, next three are mutants) 
    condition <- factor(c(rep("WT",3),rep("Mut",3))) 
    condition=relevel(condition,ref = "WT") 
    
    # Create a metadata frame: its rows correspond to columns of dat (i.e., matrix representing the countData) 
    metadata <- data.frame(row.names=colnames(df1), condition) 
    head(metadata) 
    
    #               condition 
    # SRR4420293        WT 
    # SRR4420294        WT 
    # SRR4420295        WT 
    # SRR4420296       Mut 
    # SRR4420297       Mut 
    # SRR4420298       Mut 
    
    # Create the DeSEQDataSet object  
    # This will store the raw count matirx, sample metadata and experimental design 
    
    dds1 <- DESeqDataSetFromMatrix(countData = df1, colData = metadata,design=~ condition) 
    
    #- countData: gene x sample matrix we created from all_counts.txt 
    # - colData: sample metadata with rownames matching colnames  
    # - design = ~condition: the experimental design formula where ~condition means you want to test for differential gene expression  between the condition variables (e.g treated vs control) 
    
    #Run DESeq2 pipeline 
    
    dds1 <- DESeq(dds1) 
    #check sample info 
    colData(dds1) 
    #check dispersion estimates 
    plotDispEsts(dds1) 
    #extract results  
    head(results(dds1)) 
    #summary 
    summary(results(dds1)) 
    #check available coefficients 
    resultsNames(dds1)  
    # Extract normalized counts 
    norm_counts <- counts(dds1, normalized=TRUE) 
    
    # Boxplot of normalized log counts 
    boxplot(log10(norm_counts + 1), las=2, main="Normalized log10(counts+1)", col="lightgreen") 
    
    #Get differential expression results 
    
    #summary table of results: 
    table(results(dds1,contrast = c("condition", "WT", "Mut"))$padj<=0.05) 
    
    #store results as a variable 
    diff_res <- results(dds1, contrast=c("condition", "WT", "Mut")) 
    
    #filter for significant DEGs 
    res_sig<- diff_res[which(diff_res$padj <0.05 & abs(diff_res$log2FoldChange)>1),] 
    
    # Merge with normalized counts 
    results_data<- as.data.frame(res_sig) 
    results_data$gene <- rownames(results_data) 
    norm_counts_df<- as.data.frame(norm_counts) 
    norm_counts_df$gene <- rownames(norm_counts_df) 
    
    merged_data<- merge(results_data, norm_counts_df, by = "gene") 
    head(merged_data) 
    merged_ordered<- merged_data[order(merged_data$padj, abs(merged_data$log2FoldChange)),] 
    head(merged_ordered) 
    
    # Write Results 
    
    write.csv(merged_ordered, "DESeq2_results_norm_counts.csv") 
    
    # Visualize Results 
    install.packages("pheatmap") 
    BiocManager::install("EnhancedVolcano") 
    install.packages("ggplot2") 
    install.packages("RColorBrewer") 
    
    library(RColorBrewer) 
    library(pheatmap) 
    library(ggplot2) 
    library(EnhancedVolcano) 
    
    #histogram of Log2FC 
    hist(merged_ordered$log2FoldChange, breaks=50, col="skyblue", main="Distribution of Log2FC", xlab="Log2FC") 
    
    hist(log10(rowMeans(norm_counts)+1), breaks=50, 
    
         main = "Distribution of average gene expression (log10)", xlab="log10(normalized count + 1)") 
    
    # MA plot (built-in) 
    
    plotMA(dds1, ylim = c(-5, 5), main = "MA plot") 
    
    # Volcano plot 
    
    res <- results(dds1) 
    res$padj[is.na(res$padj)] <- 1  # Replace NAs 
    res_df <- as.data.frame(res) 
    res_df$gene <- rownames(res_df) 
    res_df$sig <- res_df$padj < 0.05 
    
    ggplot(res_df, aes(log2FoldChange, -log10(padj), color = sig)) + 
      geom_point(alpha = 0.4) + 
      scale_color_manual(values = c("gray", "red")) + 
      theme_minimal() + 
      labs(title = "Volcano plot", x = "log2 Fold Change", y = "-log10 Adjusted p-value") 
      
    
    #Transform data for visualization  
    #variance-stabilized transformation for PCA  
    vsdata<- vst(dds1, blind=FALSE) 
    
    #plot PCA 
    plotPCA(vsdata, intgroup="condition") 
    
    #Better PCA using ggplot2 
    pcaData<-plotPCA(vsdata, intgroup="condition", returnData=TRUE) 
    percentVar <- round(100 * attr(pcaData, "percentVar")) 
    ggplot(pcaData, aes(PC1, PC2, color = condition, label=name)) +  
    geom_point(size =3)+  
    xlab(paste0("PC1: ", percentVar[1], "%"))+ 
    ylab(paste0("PC2: ", percentVar[2], "%")) 
    
    # VOLCANO PLOT  
    
    library(EnhancedVolcano) 
    
    EnhancedVolcano(merged_ordered,  
                    lab=merged_ordered$gene,  
                    x="log2FoldChange", 
                    y="padj",  
                    pCutoff=0.05,  
                    FCcutoff=1) 
    
    # Volcano Plot with Labeled Significant Genes 
    #look at top 10 
    top10 <- head(merged_ordered,10) 
    head(top10) 
    top10$gene[1:10] 
    
    EnhancedVolcano(merged_ordered,  
                    lab=merged_ordered$gene,  
                    x="log2FoldChange", 
                    y="padj",  
                    pCutoff=0.05,  
                    FCcutoff=1, 
                    title='Volcano Plot: Top 10 Significant DEGs', 
                    xlab='Log2FC', 
                    ylab='-Log10 adjpval', 
                    selectLab = top10$gene) 
    
    library(pheatmap) 
    library(RColorBrewer) 
    #select expression columns only  
    top50 <- head(merged_ordered,50) 
    rownames(top50)<-top50$gene 
    expression<- as.matrix(top50[,grep("^SRR", colnames(top50))]) 
    head(expression) 
    
    #Scale each gene across samples 
    exprs_scaled<-t(scale(t(expression))) 
    head(exprs_scaled) 
    
    #Plot heatmap 
    
    pheatmap(exprs_scaled, 
             show_rownames=TRUE, 
             cluster_cols = TRUE, 
             cluster_rows = TRUE, 
             color=colorRampPalette(c("navy","white","firebrick3"))(100)) 
    
    # Functional annotation  
    
    #species: Arabidopsis thaliana (thale cress) 
    
    #get packages  
    BiocManager::install("org.At.tair.db") 
    BiocManager::install(c("AnnotationDbi", "clusterProfiler")) 
    #load libraries 
    library(org.At.tair.db) 
    library(clusterProfiler) 
    
    # get names of the significant genes 
    sig_gene_names<- merged_ordered$gene 
    head(sig_gene_names) 
    
    #convert TAIR IDs to entrez IDs for annotation  
    
    entrez_ids <- mapIds(org.At.tair.db, keys = sig_gene_names, column="ENTREZID", keytype = "TAIR", multiVals = "first") 
    
    #get rid of the missing mappings 
    
    entrez_ids <- na.omit(entrez_ids) 
    
    go_res<- enrichGO(gene = entrez_ids, 
                      OrgDb = org.At.tair.db, 
                      keyType = "ENTREZID", 
                      ont = "BP", #options: BP, MF, CC 
                      pAdjustMethod = "BH", 
                      qvalueCutoff = 0.05, 
                      readable = TRUE) 
    
    go_res 
    
    #Plot results 
    dotplot(go_res, showCategory = 15, title="GO BP: Arabidopsis Thaliana") 
    
    #Option to relax thresholds: 
    
    go_res<- enrichGO(gene = entrez_ids, 
                      OrgDb = org.At.tair.db, 
                      keyType = "ENTREZID", 
                      ont = "BP", #options: BP, MF, CC 
                      pAdjustMethod = "BH", 
                      qvalueCutoff = 0.2, 
                      pvalueCutoff = 0.1, 
                      readable = TRUE) 
    
    dotplot(go_res, showCategory = 15, title="GO BP: Arabidopsis Thaliana") 
    
    ### USING SYMBOL instead of using ENTREZ_ID 
    
    symbols <- mapIds(org.At.tair.db, keys = sig_gene_names, column="SYMBOL", keytype = "TAIR", multiVals = "first") 
    
    #get rid of the missing mappings 
    symbols <- na.omit(symbols) 
    
    go_res2<- enrichGO(gene = symbols, 
                      OrgDb = org.At.tair.db, 
                      keyType = "SYMBOL", 
                      ont = "BP", #options: BP, MF, CC 
                      pAdjustMethod = "BH", 
                      qvalueCutoff = 0.05, 
                      readable = TRUE) 
    
    dotplot(go_res2, showCategory = 15, title="GO BP: Arabidopsis Thaliana") 
    
    ### KEGG Enrichment  
    
    #enrichKEGG identifies over represented KEGG pathways for Arabidopsis.  
    
    ekegg<- enrichKEGG(gene = sig_gene_names, 
                       organism = "ath", 
                       pvalueCutoff = 0.05) 
    barplot(ekegg, showCategory = 15, title="KEGG: Arabidopsis thaliana") 
    

Workshop Materials

Workshop materials and recordings available for USDA employees at the links below


    • Tuesday, May 13, 2025, 1 – 5 PM ET