In this workshop, we will provide an overview of RNA-sequencing (RNA-seq) technology and best practices for experimental design. Participants will also explore preprocessing steps, read alignment ( to a reference genome), and differential expression analysis using DESeq2. We will also introduce tools for functional analysis and determining the biological relevance of the results.
Tutorial Setup Instructions
Steps to prepare for the tutorial session:
- Login to Ceres Open OnDemand. For more information on login procedures for web-based SCINet access, see the SCINet access user guide.
- Open a command-line session by clicking on “Clusters” -> “Ceres Shell Access” on the top menu. This will open a new tab with a command-line session on Ceres’ login node.
-
Request resources on a compute node by running the following command.
srun --reservation=wk3_workshop -A scinet_workshop2 -t 05:00:00 -N 1 –c 8 --pty bash
- For those accessing post-workshop: Remove
--reservation=wk3_workshop
and change-A scinet_workshop2
to your own project account. For additional information, see our SLURM guide.
- For those accessing post-workshop: Remove
-
Create a workshop working directory by running the following commands. Note: you do not have to edit the commands with your username as it will be determined by the $USER variable.
mkdir -p /90daydata/shared/$USER/intro_rnaseq cd /90daydata/shared/$USER/intro_rnaseq cp /project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/*.sh . cp /project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/DESeq2.R .
RNA-Seq Data Analysis
by Sivanandan Chudalayandi and Lavida Rogers
RNA-seq experiments are performed with an aim to comprehend transcriptomic changes in organisms in response to a certain treatment. They are also designed to understand the cause and/or effect of a mutation by measuring the resulting gene expression changes. Robust algorithms specifically designed to map short stretches of nucleotide sequences to a genome while being aware of the process of RNA splicing have led to many advances in RNAseq analysis. This tutorial will guide you through basic RNAseq analysis, beginning at quality checking of the RNAseq reads
through to getting the differential gene expression results. We have downloaded an Arabidopsis dataset from NCBI for this purpose. Check the BioProject page for more information.
Background Information: Experimental design
This experiment compares wild type (WT) and atrx-1 mutant to analyze how the loss of function of ATRX protein results in changes in gene expression. The ATRX protein is a histone chaperone known to be an important player in the regulation of gene expression. RNA was isolated from three WT replicates and three mutant replicates. The transcriptome was enriched/isolated using the plant RiboMinus kit for obtaining total RNA. RNA-seq was carried out on an Illumina Hiseq 2500. The sequencing reads are paired-end data, hence we have 2 files per replicate.
Condition | replicate 1 | replicate 2 | replicate 3 |
---|---|---|---|
WT |
SRR4420293_1.fastq.gz |
SRR4420294_1.fastq.gz |
SRR4420295_1.fastq.gz |
atrx-1 |
SRR4420296_1.fastq.gz |
SRR4420297_1.fastq.gz |
SRR4420298_1.fastq.gz |
-
Getting Data
The Raw sequence data
The files needed for this tutorial have already been downloaded. This will allow us to get started right away with our analyses without waiting on downloads or dealing with connectivity issues. However, for your own work, you may need to download data from public repositories. There are two common tools used for this:
wget
andcurl
. While this is not needed for the tutorial, here’s how you would download files yourself:- Using
wget
:wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_1.fastq.gz
- Using
curl
curl - 0 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR442/003/SRR4420293/SRR4420293_1.fastq.gz
-0
capital O: tellscurl
to save the file with its original name.
The Reference Genome:
We will also need the genome file and associated GTF/GFF file for Arabidopsis. These can be downloaded directly from NCBI, or plants Ensembl website, or the Phytozome website*.
*The Phytozome needs logging in and selecting the files.For this tutorial, we have already downloaded the following files from NCBI:
-
You can use
wget
to fetch both the Genome (Fasta) file and the Annotation (GFF) file: For this tutorial, the steps below are optionalmkdir 00_Genome cd 00_Genome wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.gtf.gz
-
We unzipped these files so that they can be used later in our analysis pipeline. To do this we used
gunzip
gunzip GCF_000001735.4_TAIR10.1_genomic.fna.gz gunzip GCF_000001735.4_TAIR10.1_genomic.gtf.gz
- Using
-
Quality Check
We use
fastqc
, a tool that provides a simple way to do quality control checks on raw sequence data coming from high-throughput sequencing pipelines (FastQC).-
Script for fastqc (already included)
Ensure that you are in/90daydata/shared/$USER/intro_rnaseq
cat fastqc.sh
#!/bin/bash #SBATCH -N1 #SBATCH -c8 #SBATCH -J fastqc #SBATCH -o fastqc_%j.out #SBATCH -e fastqc_%j.err #SBATCH -t 01:00:00 # Load required modules module load fastqc module load multiqc # Define Directories: RAW_DIR="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_RawData/" OUT_DIR="$SLURM_SUBMIT_DIR/01_FastqC" # make output directories mkdir -p $OUT_DIR # The main command fastqc -t 8 -o $OUT_DIR $RAW_DIR/*gz multiqc -p -o 01a_MultiQC/ $OUT_DIR --title "Arabidopsis_RNAseq"
-
Submit the script
sbatch fastqc.sh
What to do if your reads need trimming?
In this section of the tutorial we will use a different set of reads to demonstrate what you would do if, after quality check, you decided to trim your reads. Read trimming is the process of removing poor-quality bases before downstream analysis. and there are a few tools which helps with this.
When would you trim?
- To remove bases of poor quality
- To remove adapter sequences
- Reads for demonstrating trimming
The reads for this demonstration are from the ENA. We will use BBDUK: Decontamination using kmers (BBDUK Guide)
FFPE="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/Poor_Data/FFPE" module load bbtools mkdir Trimmed # Define adapters variables adapters="/software/el9/apps/bbtools/39.01/resources/adapters.fa" # Run bbduk: bbduk.sh in1=SRR12844918_1.fastq.gz in2=SRR12844918_2.fastq.gz \ out1=Trimmed/SRR12844918_1.fastq.gz out2=Trimmed/SRR12844918_2.fastq.gz \ ref=$adapters ktrim=r k=23 mink=11 hdist=1 qtrim=rl trimq=20 minlength=50 tpe tbo
Explanation of options:
- ref=adapters.fa: Reference file containing adapter sequences to be trimmed.
- ktrim=r: Trim adapters from the right end of reads.
- k=23: K-mer length for matching adapters.
- mink=11: Minimum k-mer length for adapter matching.
- hdist=1: Allow one mismatch in k-mer matching.
- qtrim=rl: Trim both ends of reads based on quality.
- trimq=20: Quality threshold for trimming.
- minlength=50: Discard reads shorter than 50 bases after trimming.
- tpe: Trim both reads of a pair if one is trimmed.
- tbo: Trim adapters based on pair overlap detection.
Run a quality check on the timmed reads
We will run this quickly on the interactive node.
cd Trimmed mkdir Fastqc mkdir Multiqc time fastqc -o Fastqc/ -t 2 SRR12844918_* time multiqc -p -o Multiqc/ Fastqc --title "SRR12844918_Trim_Quality
-
-
Mapping reads to the genome using HISAT2
There are several mapping programs available for aligning RNAseq reads to the genome. Generic aligners such as
BWA
,bowtie2
,BBMap
, etc., are not suitable for mapping RNAseq reads because they are not splice-aware. RNAseq reads are mRNA reads that only contain exonic regions, hence mapping them back to the genome requires splitting the individual reads that span an intron. It can be done only by splice-aware mappers. In this tutorial, we will use HISAT2, a successor ofTophat2
.Build genome index
For
HiSat2
mapping, you first need to index the genome and then use the read pairs to map the indexed genome (one set at a time). For indexing the genome, we use thehisat2-build
command. We can run this in an interactive node.-
To go back to the main working directory:
cd /90daydata/shared/$USER/intro_rnaseq
-
Run hisat2-build:
DATA_DIR=/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_Genome/ { time hisat2-build -p8 $DATA_DIR/GCF_000001735.4_TAIR10.1_genomic.fna 02_Hisat2/Arabidopsis_TAIR10.1 >& hs2build.log; } 2> hs2build.time &
Once complete, you should see several files with the .ht2 extension. These are the index files.
Mapping Reads to the index:
We will do this using
parallel
within a Slurm batch script. First we will need two executablescripts (included):- Run Hisat2
cat run_hisat2.sh
#!/bin/bash R1=$1 R2=$2 SAMPLE=$(basename "$R1" _1.fastq.gz) hisat2 -p 4 -x Arabidopsis_TAIR10.1 -1 "$R1" -2 "$R2" 2> LOG/"${SAMPLE}".log -S SAM/"${SAMPLE}".sam
- Run samtools
cat run_samtools.sh
#!/bin/bash # Usage: ./run_samtools.sh SAMFILE SAMFILE=$1 SAMPLE=$(basename $SAMFILE .sam) samtools view -@ 7 -bS $SAMFILE | samtools sort -@ 7 -o BAM/${SAMPLE}.bam
The two scripts above will be run within a slurm batch script (also included)
cat hisat2_samtools_slurm.sh
#!/bin/bash #SBATCH -N1 #SBATCH -c16 #SBATCH -J hisat2 #SBATCH -o slurm_logs/hisat2_%j.out #SBATCH -e slurm_logs/hisat2_%j.err #SBATCH -t 08:00:00 # Load required modules module load hisat2 module load samtools module load parallel # Define Directories: RAW_DIR="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_RawData/" INDEX_DIR="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/02_Hisat2/" OUT_DIR="$SLURM_SUBMIT_DIR/02_Hisat2" mkdir -p $OUT_DIR # change to compute node's local dir cd $TMPDIR # Set Permissions chgrp -R proj-scinet_workshop2 $TMPDIR chmod -R g+s $TMPDIR # make output directories mkdir SAM mkdir BAM mkdir LOG # copy input data and script to the node's scratch dir cp --preserve=ownership $RAW_DIR/*gz . cp --preserve=ownership $INDEX_DIR/Arabidopsis_TAIR10.1*h* . cp --preserve=ownership $SLURM_SUBMIT_DIR/run_hisat2.sh . cp --preserve=ownership $SLURM_SUBMIT_DIR/run_samtools.sh . # Check if folders are in TMPDIR echo "Files in TMPDIR:" find . -type f # The main command (run the two executable scripts) parallel -j 4 "./run_hisat2.sh {1} {2}" ::: *_1.fastq.gz :::+ *_2.fastq.gz parallel -j 4 "./run_samtools.sh {}" ::: SAM/*.sam # Move the output directories back to your folder mv BAM $OUT_DIR mv LOG $OUT_DIR
Submit the slurm script:
sbatch hisat2_samtools_slurm.sh
This will take about 30 minutes to complete.
-
-
Abundance Estimation
For quantifying transcript abundance from RNA-seq data, there are many programs available. The two most popular tools are featureCounts and HTSeq. We will need a file with aligned sequencing reads (SAM/BAM files generated in previous step) and a list of genomic features (from the GFF file). featureCounts is a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations. It also outputs statistics for the overall summarization of results, including the number of successfully assigned reads and the number of reads that failed to be assigned due to various reasons. We can run featureCounts on all SAM/BAM files at the same time or individually.
pwd
cd /90daydata/shared/$USER/intro_rnaseq
You will need subread and
parallel
modules loaded. They are included in the script below:#!/bin/bash #SBATCH -N 1 #SBATCH -c 16 #SBATCH -t 01:00:00 #SBATCH -J FeatCounts #SBATCH -o slurm_logs/FeatCounts.%j.out #SBATCH -e slurm_logs/FeatCounts.%j.err # Load required modules module purge module load subread module load parallel ulimit -s unlimited # Define inputs ANNOT_GTF="/project/scinet_workshop2/Bioinformatics_series/wk3_workshop/day1/00_Genome/GCF_000001735.4_TAIR10.1_genomic.gtf" INDIR="02_Hisat2/BAM" OUTDIR="03_FeatCounts" [[ -d ${OUTDIR} ]] || mkdir -p ${OUTDIR} # FeatureCounts command parallel -j 4 "featureCounts -T 4 -s 2 -p --countReadPairs -t gene -g gene_id -a ${ANNOT_GFF} -o ${OUTDIR}/{/.}.gene.txt {}" ::: ${INDIR}/*.bam scontrol show job ${SLURM_JOB_ID}
Submit the batch script:
sbatch featureCounts_slurm_script.sh
This creates the following set of files in the specified output folder:
- Count Files:
SRR4420298.gene.txt SRR4420293.gene.txt SRR4420297.gene.txt SRR4420296.gene.txt SRR4420295.gene.txt SRR4420294.gene.txt
Each file has a commented line starting with a # which gives the command used to create the count table and the relevant seven columns as follows, for example:
head SRR4420298.gene.txt
# Program:featureCounts v1.5.2; Command:"featureCounts" "-T" "4" "-s" "2" "-p" "-t" "gene" "-g" "ID" "-a" "/path/to/GCF_000001735.3_TAIR10_genomic.gff" "-o" "/path/to/counts/SRR4420298.gene.txt" "SRR4420298.bam" Geneid Chr Start End Strand Length SRR4420298.bam gene0 NC_003070.9 3631 5899 + 2269 13 gene1 NC_003070.9 6788 9130 - 2343 17 gene2 NC_003070.9 11101 11372 + 272 0 gene3 NC_003070.9 11649 13714 - 2066 13 gene4 NC_003070.9 23121 31227 + 8107 32 gene5 NC_003070.9 23312 24099 - 788 0 gene6 NC_003070.9 28500 28706 + 207 0 gene7 NC_003070.9 31170 33171 - 2002 45
- Additionally, Summary Files are produced. These give the summary of reads that were either ambiguous, multi mapped, mapped to no features, or unmapped among other statistics. We can refer to these to further tweak our analyses etc.
SRR4420298.gene.txt.summary SRR4420293.gene.txt.summary SRR4420295.gene.txt.summary SRR4420296.gene.txt.summary SRR4420294.gene.txt.summary SRR4420297.gene.txt.summary
Using the following commands, we can produce a single count table. This count table will be loaded into R for differential expression analysis.
-
From
03_FeatCounts
edit the count files “in-place”for f in *.txt do sed -i '/^#/d' "$f" sed -i 's|02_Hisat2/BAM/||g' "$f" sed -i 's|\.bam||g' "$f" done
-
Make directory called
Processed
and make a combined count table with the first column as “Gene_IDs” and 6 other columns one for each sample’s gene counts.mkdir Processed awk 'BEGIN {OFS="\t"} {print $1,$7}' SRR4420293.gene.txt > col1.txt awk 'BEGIN {OFS="\t"} {print $7}' SRR4420294.gene.txt > col2.txt awk 'BEGIN {OFS="\t"} {print $7}' SRR4420295.gene.txt > col3.txt awk 'BEGIN {OFS="\t"} {print $7}' SRR4420296.gene.txt > col4.txt awk 'BEGIN {OFS="\t"} {print $7}' SRR4420297.gene.txt > col5.txt awk 'BEGIN {OFS="\t"} {print $7}' SRR4420298.gene.txt > col6.txt paste col1.txt col2.txt col3.txt col4.txt col5.txt col6.txt > Arabidopsis_RNAseq_Counts.txt
head Arabidopsis_RNAseq_Counts.txt
-
The Combined count file… a sneak peek:
Geneid SRR4420293 SRR4420294 SRR4420295 SRR4420296 SRR4420297 SRR4420298 AT1G01010 11 1 10 28 11 13 AT1G01020 37 3 26 88 22 17 AT1G03987 0 0 0 0 0 0
-
Convert to and save as CSV file
sed 's| |,|g' Arabidopsis_RNAseq_Counts.txt > Arabidopsis_RNAseq_Counts.csv
This file is ready to be imported to R for DESeq2.
- Count Files:
-
Differential Gene Expression Analysis
- RStudio Server: Back on the main Ceres OOD tab, click on the top or side navigation bar: “Interactive Apps” > “RStudio Server”.
- Fill the input fields with the following (input fields not listed below can be left at their default values):
- Account: scinet_workshop2
- Queue: ceres
- QOS: 400thread
- R Version: 4.4.1
- Number of hours: 4
- Number of cores: 1
- Memory required: 8GB
- Optional Slurm Arguments: (leave empty)
- Click the “Launch” button.
- Wait a moment for the job card to update from “Queued” to “Running”.
- Please confirm that clicking on the “Connect to RStudio Server” button opens a new tab with the RStudio Server interface.
We have saved the entire R script for DESeq2 (
DESeq2.R
). The script is partly based on Stephen Turner’s template. We have displayed the R script here largely for reference.## Set working directory in R setwd("/90daydata/shared/$USER/intro_rnaseq") ## Get the latest version of Bioconductor if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(version = "3.20") ## Install libraries needed: BiocManager::install("DESeq2") library("DESeq2") library(readr) library(DESeq2) # Read the data in df1 <- read_csv("Arabidopsis_RNAseq_Counts.csv") typeof(df1) df1 dim(df1) # Saving the gene IDs as a vector AT_genes <- df1[1] df1 <- df1[,-1] # Some diagnoses # Total reads per sample colSums(df1) # SRR4420293 SRR4420294 SRR4420295 SRR4420296 SRR4420297 SRR4420298 # 3703248 1081388 3793357 14588240 4095513 3235033 # Boxplot of log counts per sample boxplot(log10(df1 + 1), las=2, main="Raw log10(counts+1) per sample", col="lightblue") #DESeq2 Pipeline #Metadata # Assign condition (first three are WT, next three are mutants) condition <- factor(c(rep("WT",3),rep("Mut",3))) condition=relevel(condition,ref = "WT") # Create a metadata frame: its rows correspond to columns of dat (i.e., matrix representing the countData) metadata <- data.frame(row.names=colnames(df1), condition) head(metadata) # condition # SRR4420293 WT # SRR4420294 WT # SRR4420295 WT # SRR4420296 Mut # SRR4420297 Mut # SRR4420298 Mut # Create the DeSEQDataSet object # This will store the raw count matirx, sample metadata and experimental design dds1 <- DESeqDataSetFromMatrix(countData = df1, colData = metadata,design=~ condition) #- countData: gene x sample matrix we created from all_counts.txt # - colData: sample metadata with rownames matching colnames # - design = ~condition: the experimental design formula where ~condition means you want to test for differential gene expression between the condition variables (e.g treated vs control) #Run DESeq2 pipeline dds1 <- DESeq(dds1) #check sample info colData(dds1) #check dispersion estimates plotDispEsts(dds1) #extract results head(results(dds1)) #summary summary(results(dds1)) #check available coefficients resultsNames(dds1) # Extract normalized counts norm_counts <- counts(dds1, normalized=TRUE) # Boxplot of normalized log counts boxplot(log10(norm_counts + 1), las=2, main="Normalized log10(counts+1)", col="lightgreen") #Get differential expression results #summary table of results: table(results(dds1,contrast = c("condition", "WT", "Mut"))$padj<=0.05) #store results as a variable diff_res <- results(dds1, contrast=c("condition", "WT", "Mut")) #filter for significant DEGs res_sig<- diff_res[which(diff_res$padj <0.05 & abs(diff_res$log2FoldChange)>1),] # Merge with normalized counts results_data<- as.data.frame(res_sig) results_data$gene <- rownames(results_data) norm_counts_df<- as.data.frame(norm_counts) norm_counts_df$gene <- rownames(norm_counts_df) merged_data<- merge(results_data, norm_counts_df, by = "gene") head(merged_data) merged_ordered<- merged_data[order(merged_data$padj, abs(merged_data$log2FoldChange)),] head(merged_ordered) # Write Results write.csv(merged_ordered, "DESeq2_results_norm_counts.csv") # Visualize Results install.packages("pheatmap") BiocManager::install("EnhancedVolcano") install.packages("ggplot2") install.packages("RColorBrewer") library(RColorBrewer) library(pheatmap) library(ggplot2) library(EnhancedVolcano) #histogram of Log2FC hist(merged_ordered$log2FoldChange, breaks=50, col="skyblue", main="Distribution of Log2FC", xlab="Log2FC") hist(log10(rowMeans(norm_counts)+1), breaks=50, main = "Distribution of average gene expression (log10)", xlab="log10(normalized count + 1)") # MA plot (built-in) plotMA(dds1, ylim = c(-5, 5), main = "MA plot") # Volcano plot res <- results(dds1) res$padj[is.na(res$padj)] <- 1 # Replace NAs res_df <- as.data.frame(res) res_df$gene <- rownames(res_df) res_df$sig <- res_df$padj < 0.05 ggplot(res_df, aes(log2FoldChange, -log10(padj), color = sig)) + geom_point(alpha = 0.4) + scale_color_manual(values = c("gray", "red")) + theme_minimal() + labs(title = "Volcano plot", x = "log2 Fold Change", y = "-log10 Adjusted p-value") #Transform data for visualization #variance-stabilized transformation for PCA vsdata<- vst(dds1, blind=FALSE) #plot PCA plotPCA(vsdata, intgroup="condition") #Better PCA using ggplot2 pcaData<-plotPCA(vsdata, intgroup="condition", returnData=TRUE) percentVar <- round(100 * attr(pcaData, "percentVar")) ggplot(pcaData, aes(PC1, PC2, color = condition, label=name)) + geom_point(size =3)+ xlab(paste0("PC1: ", percentVar[1], "%"))+ ylab(paste0("PC2: ", percentVar[2], "%")) # VOLCANO PLOT library(EnhancedVolcano) EnhancedVolcano(merged_ordered, lab=merged_ordered$gene, x="log2FoldChange", y="padj", pCutoff=0.05, FCcutoff=1) # Volcano Plot with Labeled Significant Genes #look at top 10 top10 <- head(merged_ordered,10) head(top10) top10$gene[1:10] EnhancedVolcano(merged_ordered, lab=merged_ordered$gene, x="log2FoldChange", y="padj", pCutoff=0.05, FCcutoff=1, title='Volcano Plot: Top 10 Significant DEGs', xlab='Log2FC', ylab='-Log10 adjpval', selectLab = top10$gene) library(pheatmap) library(RColorBrewer) #select expression columns only top50 <- head(merged_ordered,50) rownames(top50)<-top50$gene expression<- as.matrix(top50[,grep("^SRR", colnames(top50))]) head(expression) #Scale each gene across samples exprs_scaled<-t(scale(t(expression))) head(exprs_scaled) #Plot heatmap pheatmap(exprs_scaled, show_rownames=TRUE, cluster_cols = TRUE, cluster_rows = TRUE, color=colorRampPalette(c("navy","white","firebrick3"))(100)) # Functional annotation #species: Arabidopsis thaliana (thale cress) #get packages BiocManager::install("org.At.tair.db") BiocManager::install(c("AnnotationDbi", "clusterProfiler")) #load libraries library(org.At.tair.db) library(clusterProfiler) # get names of the significant genes sig_gene_names<- merged_ordered$gene head(sig_gene_names) #convert TAIR IDs to entrez IDs for annotation entrez_ids <- mapIds(org.At.tair.db, keys = sig_gene_names, column="ENTREZID", keytype = "TAIR", multiVals = "first") #get rid of the missing mappings entrez_ids <- na.omit(entrez_ids) go_res<- enrichGO(gene = entrez_ids, OrgDb = org.At.tair.db, keyType = "ENTREZID", ont = "BP", #options: BP, MF, CC pAdjustMethod = "BH", qvalueCutoff = 0.05, readable = TRUE) go_res #Plot results dotplot(go_res, showCategory = 15, title="GO BP: Arabidopsis Thaliana") #Option to relax thresholds: go_res<- enrichGO(gene = entrez_ids, OrgDb = org.At.tair.db, keyType = "ENTREZID", ont = "BP", #options: BP, MF, CC pAdjustMethod = "BH", qvalueCutoff = 0.2, pvalueCutoff = 0.1, readable = TRUE) dotplot(go_res, showCategory = 15, title="GO BP: Arabidopsis Thaliana") ### USING SYMBOL instead of using ENTREZ_ID symbols <- mapIds(org.At.tair.db, keys = sig_gene_names, column="SYMBOL", keytype = "TAIR", multiVals = "first") #get rid of the missing mappings symbols <- na.omit(symbols) go_res2<- enrichGO(gene = symbols, OrgDb = org.At.tair.db, keyType = "SYMBOL", ont = "BP", #options: BP, MF, CC pAdjustMethod = "BH", qvalueCutoff = 0.05, readable = TRUE) dotplot(go_res2, showCategory = 15, title="GO BP: Arabidopsis Thaliana") ### KEGG Enrichment #enrichKEGG identifies over represented KEGG pathways for Arabidopsis. ekegg<- enrichKEGG(gene = sig_gene_names, organism = "ath", pvalueCutoff = 0.05) barplot(ekegg, showCategory = 15, title="KEGG: Arabidopsis thaliana")
Workshop Materials
Workshop materials and recordings available for USDA employees at the links below