Introduction to RNA-seq Analysis

Provided by: ISU
Registration: Register Here

In this workshop, we will provide an overview of RNA-sequencing (RNA-seq) technology and best practices for experimental design. Participants will also explore preprocessing steps, read alignment ( to a reference genome), and differential expression analysis using DESeq2. We will also introduce tools for functional analysis and determining the biological relevance of the results.

Tutorial Setup Instructions

Steps to prepare for the tutorial session:

Login to Ceres Open OnDemand. For more information on login procedures for web-based SCINet access, see the SCINet access user guide.
Open a command-line session by clicking on “Clusters” -> “Ceres Shell Access” on the top menu. This will open a new tab with a command-line session on Ceres’ login node.
Request resources on a compute node by running the following command.
```
srun --reservation=wk3_workshop -A scinet_workshop2 -t 05:00:00 -N 1 –c 8  --pty bash   
```
- For those accessing post-workshop: Remove --reservation=wk3_workshop and change -A scinet_workshop2 to your own project account. For additional information, see our SLURM guide.

Create a workshop working directory by running the following commands. Note: you do not have to edit the commands with your username as it will be determined by the $USER variable.

mkdir -p /90daydata/shared/$USER/intro_rnaseq 
cd /90daydata/shared/$USER/intro_rnaseq  
cp /project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/*.sh .
cp /project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/DESeq2.R .

RNA-Seq Data Analysis

by Sivanandan Chudalayandi and Lavida Rogers

RNA-seq experiments are performed with an aim to comprehend transcriptomic changes in organisms in response to a certain treatment. They are also designed to understand the cause and/or effect of a mutation by measuring the resulting gene expression changes. Robust algorithms specifically designed to map short stretches of nucleotide sequences to a genome while being aware of the process of RNA splicing have led to many advances in RNAseq analysis. This tutorial will guide you through basic RNAseq analysis, beginning at quality checking of the RNAseq reads through to getting the differential gene expression results. We have downloaded an Arabidopsis dataset from NCBI for this purpose. Check the BioProject page for more information.

Background Information: Experimental design

This experiment compares wild type (WT) and atrx-1 mutant to analyze how the loss of function of ATRX protein results in changes in gene expression. The ATRX protein is a histone chaperone known to be an important player in the regulation of gene expression. RNA was isolated from three WT replicates and three mutant replicates. The transcriptome was enriched/isolated using the plant RiboMinus kit for obtaining total RNA. RNA-seq was carried out on an Illumina Hiseq 2500. The sequencing reads are paired-end data, hence we have 2 files per replicate.

Condition	replicate 1	replicate 2	replicate 3
WT	SRR4420293_1.fastq.gz SRR4420293_2.fastq.gz	SRR4420294_1.fastq.gz SRR4420294_2.fastq.gz	SRR4420295_1.fastq.gz SRR4420295_2.fastq.gz
atrx-1	SRR4420296_1.fastq.gz SRR4420296_2.fastq.gz	SRR4420297_1.fastq.gz SRR4420297_2.fastq.gz	SRR4420298_1.fastq.gz SRR4420298_2.fastq.gz

Getting Data

The Raw sequence data

The files needed for this tutorial have already been downloaded. This will allow us to get started right away with our analyses without waiting on downloads or dealing with connectivity issues. However, for your own work, you may need to download data from public repositories. There are two common tools used for this: wget and curl. While this is not needed for the tutorial, here’s how you would download files yourself:
- Using wget:
  - wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_1.fastq.gz
- Using curl
  - curl - 0 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_1.fastq.gz
    -0 capital O: tells curl to save the file with its original name.
The Reference Genome:

We will also need the genome file and associated GTF/GFF file for Arabidopsis. These can be downloaded directly from NCBI, or plants Ensembl website, or the Phytozome website*.
*The Phytozome needs logging in and selecting the files.

For this tutorial, we have already downloaded the following files from NCBI:
- You can use wget to fetch both the Genome (Fasta) file and the Annotation (GFF) file: For this tutorial, the steps below are optional
```
mkdir 00_Genome
cd 00_Genome
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz 
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.gtf.gz 
```
- We unzipped these files so that they can be used later in our analysis pipeline. To do this we used gunzip
```
gunzip GCF_000001735.4_TAIR10.1_genomic.fna.gz 
gunzip GCF_000001735.4_TAIR10.1_genomic.gtf.gz 
```
Quality Check

We use fastqc, a tool that provides a simple way to do quality control checks on raw sequence data coming from high-throughput sequencing pipelines (FastQC).
- Script for fastqc (already included)
  Ensure that you are in /90daydata/shared/$USER/intro_rnaseq cat fastqc.sh
```
#!/bin/bash 
#SBATCH -N1 
#SBATCH -c8 
#SBATCH -J fastqc  
#SBATCH -o fastqc_%j.out 
#SBATCH -e fastqc_%j.err 
#SBATCH -t 01:00:00 

# Load required modules 
module load fastqc 
module load multiqc 

# Define Directories: 
RAW_DIR="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_RawData/" 
OUT_DIR="$SLURM_SUBMIT_DIR/01_FastqC" 
# make output directories 
mkdir -p $OUT_DIR 
# The main command 
fastqc -t 8 -o $OUT_DIR $RAW_DIR/*gz 
multiqc -p -o 01a_MultiQC/ $OUT_DIR --title "Arabidopsis_RNAseq" 
```
- Submit the script
```
sbatch fastqc.sh  
```
What to do if your reads need trimming?

In this section of the tutorial we will use a different set of reads to demonstrate what you would do if, after quality check, you decided to trim your reads. Read trimming is the process of removing poor-quality bases before downstream analysis. and there are a few tools which helps with this.

When would you trim?
- To remove bases of poor quality
- To remove adapter sequences
- Reads for demonstrating trimming
The reads for this demonstration are from the ENA. We will use BBDUK: Decontamination using kmers (BBDUK Guide)
```
FFPE="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/Poor_Data/FFPE" 

module load bbtools

mkdir Trimmed
# Define adapters variables 
adapters="/software/el9/apps/bbtools/39.01/resources/adapters.fa"
# Run bbduk:

bbduk.sh in1=SRR12844918_1.fastq.gz in2=SRR12844918_2.fastq.gz \
out1=Trimmed/SRR12844918_1.fastq.gz out2=Trimmed/SRR12844918_2.fastq.gz \
ref=$adapters ktrim=r k=23 mink=11 hdist=1 qtrim=rl trimq=20 minlength=50 tpe tbo
```
Explanation of options:
- ref=adapters.fa: Reference file containing adapter sequences to be trimmed.
- ktrim=r: Trim adapters from the right end of reads.
- k=23: K-mer length for matching adapters.
- mink=11: Minimum k-mer length for adapter matching.
- hdist=1: Allow one mismatch in k-mer matching.
- qtrim=rl: Trim both ends of reads based on quality.
- trimq=20: Quality threshold for trimming.
- minlength=50: Discard reads shorter than 50 bases after trimming.
- tpe: Trim both reads of a pair if one is trimmed.
- tbo: Trim adapters based on pair overlap detection.
Run a quality check on the timmed reads

We will run this quickly on the interactive node.
```
cd Trimmed 

mkdir Fastqc 
mkdir Multiqc 

time fastqc -o Fastqc/ -t 2 SRR12844918_* 
time multiqc -p -o Multiqc/ Fastqc --title "SRR12844918_Trim_Quality 
```

Mapping reads to the genome using HISAT2

There are several mapping programs available for aligning RNAseq reads to the genome. Generic aligners such as BWA, bowtie2, BBMap, etc., are not suitable for mapping RNAseq reads because they are not splice-aware. RNAseq reads are mRNA reads that only contain exonic regions, hence mapping them back to the genome requires splitting the individual reads that span an intron. It can be done only by splice-aware mappers. In this tutorial, we will use HISAT2, a successor of Tophat2.

Build genome index

For HiSat2 mapping, you first need to index the genome and then use the read pairs to map the indexed genome (one set at a time). For indexing the genome, we use the hisat2-build command. We can run this in an interactive node.

To go back to the main working directory:

cd /90daydata/shared/$USER/intro_rnaseq

Run hisat2-build:

DATA_DIR=/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_Genome/
{ time hisat2-build -p8 $DATA_DIR/GCF_000001735.4_TAIR10.1_genomic.fna 02_Hisat2/Arabidopsis_TAIR10.1 >& hs2build.log; } 2> hs2build.time &

Once complete, you should see several files with the .ht2 extension. These are the index files.

Mapping Reads to the index:

We will do this using parallel within a Slurm batch script. First we will need two executablescripts (included):

Run Hisat2
cat run_hisat2.sh

#!/bin/bash 
R1=$1 
R2=$2 
SAMPLE=$(basename "$R1" _1.fastq.gz) 
hisat2 -p 4 -x Arabidopsis_TAIR10.1 -1 "$R1" -2 "$R2" 2> LOG/"${SAMPLE}".log -S SAM/"${SAMPLE}".sam 

Run samtools
cat run_samtools.sh

#!/bin/bash 
# Usage: ./run_samtools.sh SAMFILE 
SAMFILE=$1 
SAMPLE=$(basename $SAMFILE .sam) 
samtools view -@ 7 -bS $SAMFILE | samtools sort -@ 7 -o BAM/${SAMPLE}.bam 

The two scripts above will be run within a slurm batch script (also included)

cat hisat2_samtools_slurm.sh

#!/bin/bash 
#SBATCH -N1 
#SBATCH -c16 
#SBATCH -J hisat2  
#SBATCH -o slurm_logs/hisat2_%j.out 
#SBATCH -e slurm_logs/hisat2_%j.err 
#SBATCH -t 08:00:00 

# Load required modules 
module load hisat2 
module load samtools 
module load parallel 

# Define Directories: 
RAW_DIR="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_RawData/" 
INDEX_DIR="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/02_Hisat2/" 
OUT_DIR="$SLURM_SUBMIT_DIR/02_Hisat2" 
mkdir -p $OUT_DIR 

# change to compute node's local dir 
cd $TMPDIR 

# Set Permissions 
chgrp -R proj-scinet_workshop2 $TMPDIR 
chmod -R g+s $TMPDIR 

# make output directories 
mkdir SAM 
mkdir BAM 
mkdir LOG 

# copy input data and script to the node's scratch dir 
cp --preserve=ownership $RAW_DIR/*gz . 
cp --preserve=ownership $INDEX_DIR/Arabidopsis_TAIR10.1*h* .  
cp --preserve=ownership $SLURM_SUBMIT_DIR/run_hisat2.sh . 
cp --preserve=ownership $SLURM_SUBMIT_DIR/run_samtools.sh . 

# Check if folders are in TMPDIR 
echo "Files in TMPDIR:"
find . -type f 

# The main command (run the two executable scripts) 
parallel -j 4 "./run_hisat2.sh {1} {2}" ::: *_1.fastq.gz :::+ *_2.fastq.gz 
parallel -j 4 "./run_samtools.sh {}" ::: SAM/*.sam 

# Move the output directories back to your folder 
mv BAM $OUT_DIR 
mv LOG $OUT_DIR 

Submit the slurm script:
sbatch hisat2_samtools_slurm.sh

This will take about 30 minutes to complete.

Abundance Estimation

For quantifying transcript abundance from RNA-seq data, there are many programs available. The two most popular tools are featureCounts and HTSeq. We will need a file with aligned sequencing reads (SAM/BAM files generated in previous step) and a list of genomic features (from the GFF file). featureCounts is a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations. It also outputs statistics for the overall summarization of results, including the number of successfully assigned reads and the number of reads that failed to be assigned due to various reasons. We can run featureCounts on all SAM/BAM files at the same time or individually.

pwd cd /90daydata/shared/$USER/intro_rnaseq

You will need subread and parallel modules loaded. They are included in the script below:

#!/bin/bash 
#SBATCH -N 1 
#SBATCH -c 16 
#SBATCH -t 01:00:00 
#SBATCH -J FeatCounts 
#SBATCH -o slurm_logs/FeatCounts.%j.out 
#SBATCH -e slurm_logs/FeatCounts.%j.err 

# Load required modules 
module purge 
module load subread 
module load parallel 

ulimit -s unlimited 

# Define inputs 
 ANNOT_GTF="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_Genome/GCF_000001735.4_TAIR10.1_genomic.gtf" 
INDIR="02_Hisat2/BAM" 
OUTDIR="03_FeatCounts" 
[[ -d ${OUTDIR} ]] || mkdir -p ${OUTDIR} 

# FeatureCounts command 
parallel -j 4 "featureCounts -T 4 -s 2 -p --countReadPairs -t gene -g gene_id -a ${ANNOT_GFF} -o ${OUTDIR}/{/.}.gene.txt {}" ::: ${INDIR}/*.bam 

scontrol show job ${SLURM_JOB_ID} 

Submit the batch script:

sbatch featureCounts_slurm_script.sh

This creates the following set of files in the specified output folder:

Count Files:

SRR4420298.gene.txt 
SRR4420293.gene.txt 
SRR4420297.gene.txt 
SRR4420296.gene.txt 
SRR4420295.gene.txt 
SRR4420294.gene.txt 

Each file has a commented line starting with a # which gives the command used to create the count table and the relevant seven columns as follows, for example:

head SRR4420298.gene.txt

# Program:featureCounts v1.5.2; Command:"featureCounts" "-T" "4" "-s" "2" "-p" "-t" "gene" "-g" "ID" "-a" "/path/to/GCF_000001735.3_TAIR10_genomic.gff" "-o" "/path/to/counts/SRR4420298.gene.txt" "SRR4420298.bam" 

Geneid  Chr     Start   End     Strand  Length  SRR4420298.bam 
gene0   NC_003070.9     3631    5899    +       2269    13 
gene1   NC_003070.9     6788    9130    -       2343    17 
gene2   NC_003070.9     11101   11372   +       272     0 
gene3   NC_003070.9     11649   13714   -       2066    13 
gene4   NC_003070.9     23121   31227   +       8107    32 
gene5   NC_003070.9     23312   24099   -       788     0 
gene6   NC_003070.9     28500   28706   +       207     0 
gene7   NC_003070.9     31170   33171   -       2002    45 

Additionally, Summary Files are produced. These give the summary of reads that were either ambiguous, multi mapped, mapped to no features, or unmapped among other statistics. We can refer to these to further tweak our analyses etc.
```
SRR4420298.gene.txt.summary 
SRR4420293.gene.txt.summary 
SRR4420295.gene.txt.summary 
SRR4420296.gene.txt.summary 
SRR4420294.gene.txt.summary 
SRR4420297.gene.txt.summary 
```

Using the following commands, we can produce a single count table. This count table will be loaded into R for differential expression analysis.

From 03_FeatCounts edit the count files “in-place”

for f in *.txt  
  do sed -i '/^#/d' "$f" 
      sed -i 's|02_Hisat2/BAM/||g' "$f" 
      sed -i 's|\.bam||g' "$f" 
      done 

Make directory called Processed and make a combined count table with the first column as “Gene_IDs” and 6 other columns one for each sample’s gene counts.

mkdir Processed  
awk 'BEGIN {OFS="\t"} {print $1,$7}' SRR4420293.gene.txt > col1.txt 
awk 'BEGIN {OFS="\t"} {print $7}' SRR4420294.gene.txt > col2.txt 
awk 'BEGIN {OFS="\t"} {print $7}' SRR4420295.gene.txt > col3.txt 
awk 'BEGIN {OFS="\t"} {print $7}' SRR4420296.gene.txt > col4.txt 
awk 'BEGIN {OFS="\t"} {print $7}' SRR4420297.gene.txt > col5.txt 
awk 'BEGIN {OFS="\t"} {print $7}' SRR4420298.gene.txt > col6.txt 
paste col1.txt col2.txt col3.txt col4.txt col5.txt col6.txt > Arabidopsis_RNAseq_Counts.txt 

head Arabidopsis_RNAseq_Counts.txt

The Combined count file… a sneak peek:

Geneid  SRR4420293  SRR4420294  SRR4420295  SRR4420296  SRR4420297  SRR4420298 
AT1G01010   11  1   10  28  11  13 
AT1G01020   37  3   26  88  22  17 
AT1G03987   0   0   0   0   0   0 

Convert to and save as CSV file

sed 's|        |,|g' Arabidopsis_RNAseq_Counts.txt > Arabidopsis_RNAseq_Counts.csv

This file is ready to be imported to R for DESeq2.

Differential Gene Expression Analysis

RStudio Server: Back on the main Ceres OOD tab, click on the top or side navigation bar: “Interactive Apps” > “RStudio Server”.
Fill the input fields with the following (input fields not listed below can be left at their default values):
- Account: scinet_workshop2
- Queue: ceres
- QOS: 400thread
- R Version: 4.4.1
- Number of hours: 4
- Number of cores: 1
- Memory required: 8GB
- Optional Slurm Arguments: (leave empty)
Click the “Launch” button.
Wait a moment for the job card to update from “Queued” to “Running”.
Please confirm that clicking on the “Connect to RStudio Server” button opens a new tab with the RStudio Server interface.

We have saved the entire R script for DESeq2 (DESeq2.R). The script is partly based on Stephen Turner’s template. We have displayed the R script here largely for reference.

DESeq2 R code

## Set working directory in R 
setwd("/90daydata/shared/$USER/intro_rnaseq") 

## Get the latest version of Bioconductor 
if (!require("BiocManager", quietly = TRUE)) 
    install.packages("BiocManager") 
BiocManager::install(version = "3.20") 

## Install libraries needed: 

BiocManager::install("DESeq2") 
library("DESeq2") 
library(readr) 
library(DESeq2) 

# Read the data in
df1 <- read_csv("Arabidopsis_RNAseq_Counts.csv") 
typeof(df1) 
df1 
dim(df1) 

# Saving the gene IDs as a vector 
AT_genes <- df1[1] 
df1 <- df1[,-1] 

# Some diagnoses 
# Total reads per sample 
colSums(df1) 

# SRR4420293 SRR4420294 SRR4420295 SRR4420296 SRR4420297 SRR4420298  
# 3703248    1081388    3793357   14588240    4095513    3235033  

# Boxplot of log counts per sample 
boxplot(log10(df1 + 1), las=2, main="Raw log10(counts+1) per sample", col="lightblue") 

#DESeq2 Pipeline 

#Metadata 
# Assign condition (first three are WT, next three are mutants) 
condition <- factor(c(rep("WT",3),rep("Mut",3))) 
condition=relevel(condition,ref = "WT") 

# Create a metadata frame: its rows correspond to columns of dat (i.e., matrix representing the countData) 
metadata <- data.frame(row.names=colnames(df1), condition) 
head(metadata) 

#               condition 
# SRR4420293        WT 
# SRR4420294        WT 
# SRR4420295        WT 
# SRR4420296       Mut 
# SRR4420297       Mut 
# SRR4420298       Mut 

# Create the DeSEQDataSet object  
# This will store the raw count matirx, sample metadata and experimental design 

dds1 <- DESeqDataSetFromMatrix(countData = df1, colData = metadata,design=~ condition) 

#- countData: gene x sample matrix we created from all_counts.txt 
# - colData: sample metadata with rownames matching colnames  
# - design = ~condition: the experimental design formula where ~condition means you want to test for differential gene expression  between the condition variables (e.g treated vs control) 

#Run DESeq2 pipeline 

dds1 <- DESeq(dds1) 
#check sample info 
colData(dds1) 
#check dispersion estimates 
plotDispEsts(dds1) 
#extract results  
head(results(dds1)) 
#summary 
summary(results(dds1)) 
#check available coefficients 
resultsNames(dds1)  
# Extract normalized counts 
norm_counts <- counts(dds1, normalized=TRUE) 

# Boxplot of normalized log counts 
boxplot(log10(norm_counts + 1), las=2, main="Normalized log10(counts+1)", col="lightgreen") 

#Get differential expression results 

#summary table of results: 
table(results(dds1,contrast = c("condition", "WT", "Mut"))$padj<=0.05) 

#store results as a variable 
diff_res <- results(dds1, contrast=c("condition", "WT", "Mut")) 

#filter for significant DEGs 
res_sig<- diff_res[which(diff_res$padj <0.05 & abs(diff_res$log2FoldChange)>1),] 

# Merge with normalized counts 
results_data<- as.data.frame(res_sig) 
results_data$gene <- rownames(results_data) 
norm_counts_df<- as.data.frame(norm_counts) 
norm_counts_df$gene <- rownames(norm_counts_df) 

merged_data<- merge(results_data, norm_counts_df, by = "gene") 
head(merged_data) 
merged_ordered<- merged_data[order(merged_data$padj, abs(merged_data$log2FoldChange)),] 
head(merged_ordered) 

# Write Results 

write.csv(merged_ordered, "DESeq2_results_norm_counts.csv") 

# Visualize Results 
install.packages("pheatmap") 
BiocManager::install("EnhancedVolcano") 
install.packages("ggplot2") 
install.packages("RColorBrewer") 

library(RColorBrewer) 
library(pheatmap) 
library(ggplot2) 
library(EnhancedVolcano) 

#histogram of Log2FC 
hist(merged_ordered$log2FoldChange, breaks=50, col="skyblue", main="Distribution of Log2FC", xlab="Log2FC") 

hist(log10(rowMeans(norm_counts)+1), breaks=50, 

     main = "Distribution of average gene expression (log10)", xlab="log10(normalized count + 1)") 

# MA plot (built-in) 

plotMA(dds1, ylim = c(-5, 5), main = "MA plot") 

# Volcano plot 

res <- results(dds1) 
res$padj[is.na(res$padj)] <- 1  # Replace NAs 
res_df <- as.data.frame(res) 
res_df$gene <- rownames(res_df) 
res_df$sig <- res_df$padj < 0.05 

ggplot(res_df, aes(log2FoldChange, -log10(padj), color = sig)) + 
  geom_point(alpha = 0.4) + 
  scale_color_manual(values = c("gray", "red")) + 
  theme_minimal() + 
  labs(title = "Volcano plot", x = "log2 Fold Change", y = "-log10 Adjusted p-value") 
  

#Transform data for visualization  
#variance-stabilized transformation for PCA  
vsdata<- vst(dds1, blind=FALSE) 

#plot PCA 
plotPCA(vsdata, intgroup="condition") 

#Better PCA using ggplot2 
pcaData<-plotPCA(vsdata, intgroup="condition", returnData=TRUE) 
percentVar <- round(100 * attr(pcaData, "percentVar")) 
ggplot(pcaData, aes(PC1, PC2, color = condition, label=name)) +  
geom_point(size =3)+  
xlab(paste0("PC1: ", percentVar[1], "%"))+ 
ylab(paste0("PC2: ", percentVar[2], "%")) 

# VOLCANO PLOT  

library(EnhancedVolcano) 

EnhancedVolcano(merged_ordered,  
                lab=merged_ordered$gene,  
                x="log2FoldChange", 
                y="padj",  
                pCutoff=0.05,  
                FCcutoff=1) 

# Volcano Plot with Labeled Significant Genes 
#look at top 10 
top10 <- head(merged_ordered,10) 
head(top10) 
top10$gene[1:10] 

EnhancedVolcano(merged_ordered,  
                lab=merged_ordered$gene,  
                x="log2FoldChange", 
                y="padj",  
                pCutoff=0.05,  
                FCcutoff=1, 
                title='Volcano Plot: Top 10 Significant DEGs', 
                xlab='Log2FC', 
                ylab='-Log10 adjpval', 
                selectLab = top10$gene) 

library(pheatmap) 
library(RColorBrewer) 
#select expression columns only  
top50 <- head(merged_ordered,50) 
rownames(top50)<-top50$gene 
expression<- as.matrix(top50[,grep("^SRR", colnames(top50))]) 
head(expression) 

#Scale each gene across samples 
exprs_scaled<-t(scale(t(expression))) 
head(exprs_scaled) 

#Plot heatmap 

pheatmap(exprs_scaled, 
         show_rownames=TRUE, 
         cluster_cols = TRUE, 
         cluster_rows = TRUE, 
         color=colorRampPalette(c("navy","white","firebrick3"))(100)) 

# Functional annotation  

#species: Arabidopsis thaliana (thale cress) 

#get packages  
BiocManager::install("org.At.tair.db") 
BiocManager::install(c("AnnotationDbi", "clusterProfiler")) 
#load libraries 
library(org.At.tair.db) 
library(clusterProfiler) 

# get names of the significant genes 
sig_gene_names<- merged_ordered$gene 
head(sig_gene_names) 

#convert TAIR IDs to entrez IDs for annotation  

entrez_ids <- mapIds(org.At.tair.db, keys = sig_gene_names, column="ENTREZID", keytype = "TAIR", multiVals = "first") 

#get rid of the missing mappings 

entrez_ids <- na.omit(entrez_ids) 

go_res<- enrichGO(gene = entrez_ids, 
                  OrgDb = org.At.tair.db, 
                  keyType = "ENTREZID", 
                  ont = "BP", #options: BP, MF, CC 
                  pAdjustMethod = "BH", 
                  qvalueCutoff = 0.05, 
                  readable = TRUE) 

go_res 

#Plot results 
dotplot(go_res, showCategory = 15, title="GO BP: Arabidopsis Thaliana") 

#Option to relax thresholds: 

go_res<- enrichGO(gene = entrez_ids, 
                  OrgDb = org.At.tair.db, 
                  keyType = "ENTREZID", 
                  ont = "BP", #options: BP, MF, CC 
                  pAdjustMethod = "BH", 
                  qvalueCutoff = 0.2, 
                  pvalueCutoff = 0.1, 
                  readable = TRUE) 

dotplot(go_res, showCategory = 15, title="GO BP: Arabidopsis Thaliana") 

### USING SYMBOL instead of using ENTREZ_ID 

symbols <- mapIds(org.At.tair.db, keys = sig_gene_names, column="SYMBOL", keytype = "TAIR", multiVals = "first") 

#get rid of the missing mappings 
symbols <- na.omit(symbols) 

go_res2<- enrichGO(gene = symbols, 
                  OrgDb = org.At.tair.db, 
                  keyType = "SYMBOL", 
                  ont = "BP", #options: BP, MF, CC 
                  pAdjustMethod = "BH", 
                  qvalueCutoff = 0.05, 
                  readable = TRUE) 

dotplot(go_res2, showCategory = 15, title="GO BP: Arabidopsis Thaliana") 

### KEGG Enrichment  

#enrichKEGG identifies over represented KEGG pathways for Arabidopsis.  

ekegg<- enrichKEGG(gene = sig_gene_names, 
                   organism = "ath", 
                   pvalueCutoff = 0.05) 
barplot(ekegg, showCategory = 15, title="KEGG: Arabidopsis thaliana") 

Workshop Materials

Workshop materials and recordings available for USDA employees at the links below

Webinar recording

- Tuesday, May 13, 2025, 1 – 5 PM ET
  - Prerequisites:
    - Familiarity with basic command-line concepts.
  - Webinar recording

Introduction to RNA-seq Analysis

Tutorial Setup Instructions

RNA-Seq Data Analysis

Background Information: Experimental design

Getting Data

The Raw sequence data

The Reference Genome:

Quality Check

What to do if your reads need trimming?

Run a quality check on the timmed reads