Skip to main content

Introduction to RNA-seq Analysis


  • Provided by: ISU, SCINet Office

In this workshop we will provide an overview of RNA-sequencing (RNA-seq) technology and best practices for experimental design. Participants will also explore preprocessing steps, read alignment using a reference genome, and differential expression analysis using DESeq2. We will also introduce tools for functional analysis and determining the biological relevance of the results.

Additionally, participants will explore the RNA-seq analysis pipeline in the absence of a reference genome using de novo transcriptome assembly.

We will also introduce pseudoalignment techniques for rapid quantification of transcript abundance without base-by-base alignment.

Tutorial Setup Instructions

Steps to prepare for the tutorial session:

  • Login to Ceres Open OnDemand. For more information on login procedures for web-based SCINet access, see the SCINet access user guide.

  • Open a command-line session by clicking on “Clusters” -> “Ceres Shell Access” on the top menu. This will open a new tab with a command-line session on Ceres’ login node.

  • Create a workshop working directory by running the following commands. Note: you do not have to edit the commands with your username as it will be determined by the $USER variable.

    mkdir -p /90daydata/shared/$USER/intro_rnaseq 
    cd /90daydata/shared/$USER/intro_rnaseq
    cp -r /project/scinet_workshop2/foundations_bioinf_2026/rnaseq_analysis/files/* .
    
  • Launch VS Code:

    • Under the Interactive Apps menu, select VS Code
    • Specify the following input values on the page:
      • Account: scinet_workshop2
      • Queue: ceres
      • QoS: 400thread
      • Number of cores: 2
      • Memory required: 6 G
      • Number of hours: 5
      • Optional Slurm Parameters: --reservation=foundations_workshop
      • Working Directory: /90daydata/shared/$USER/intro_rnaseq
    • Click Launch. The screen will update to the Interactive Sessions page. When your VS Code session is ready, the top card will update from Queued to Running and a Connect to VS Code button will appear. Click Connect to VS Code.

RNA-Seq Data Analysis

by Sivanandan Chudalayandi and Lavida Rogers

RNA-seq experiments are performed with an aim to comprehend transcriptomic changes in organisms in response to a certain treatment. They are also designed to understand the cause and/or effect of a mutation by measuring the resulting gene expression changes. Robust algorithms specifically designed to map short stretches of nucleotide sequences to a genome while being aware of the process of RNA splicing have led to many advances in RNAseq analysis. This tutorial will guide you through basic RNAseq analysis, beginning at quality checking of the RNAseq reads through to getting the differential gene expression results. We have downloaded an Arabidopsis dataset from NCBI for this purpose. Check the BioProject page for more information.

Background Information: Experimental design

This experiment compares wild type (WT) and atrx-1 mutant to analyze how the loss of function of ATRX protein results in changes in gene expression. The ATRX protein is a histone chaperone known to be an important player in the regulation of gene expression. RNA was isolated from three WT replicates and three mutant replicates. The transcriptome was enriched/isolated using the plant RiboMinus kit for obtaining total RNA. RNA-seq was carried out on an Illumina Hiseq 2500. The sequencing reads are paired-end data, hence we have 2 files per replicate.

Condition replicate 1 replicate 2 replicate 3

WT

SRR4420293_1.fastq.gz
SRR4420293_2.fastq.gz

SRR4420294_1.fastq.gz
SRR4420294_2.fastq.gz

SRR4420295_1.fastq.gz
SRR4420295_2.fastq.gz

atrx-1

SRR4420296_1.fastq.gz
SRR4420296_2.fastq.gz

SRR4420297_1.fastq.gz
SRR4420297_2.fastq.gz

SRR4420298_1.fastq.gz
SRR4420298_2.fastq.gz

  1. Getting Data

    The Raw sequence data

    The files needed for this tutorial have already been downloaded. This will allow us to get started right away with our analyses without waiting on downloads or dealing with connectivity issues.

    The Reference Genome:

    We will also need the genome file and associated GTF/GFF file for Arabidopsis. These can be downloaded directly from NCBI, or plants Ensembl website, or the Phytozome website*.
    *The Phytozome needs logging in and selecting the files.

    For this tutorial, we have already downloaded the following files from NCBI.

  2. Quality Check

    We use fastqc, a tool that provides a simple way to do quality control checks on raw sequence data coming from high-throughput sequencing pipelines (FastQC).

    Script for fastqc Ensure that you are in /90daydata/shared/$USER/intro_rnaseq. Create the empty script file:

      touch 00_Scripts/01_fastqc.sl
    

    Open 01_fastqc.sl in the VS Code editor and copy and paste the script below:

      #!/bin/bash 
      #SBATCH -N1 
      #SBATCH -c8 
      #SBATCH --account=scinet_workshop2
      #SBATCH --reservation=foundations_workshop
      #SBATCH -J fastqc  
      #SBATCH -o slurm_logs/fastqc_%j.out 
      #SBATCH -e slurm_logs/fastqc_%j.err 
      #SBATCH -t 01:00:00 
    
      # Load required modules 
      module load fastqc 
      module load multiqc 
    
      # Define Directories: 
      RAW_DIR="/90daydata/shared/$USER/intro_rnaseq/00_RawData" 
      OUT_DIR="$SLURM_SUBMIT_DIR/01_FastqC" 
      # make output directories 
      mkdir -p $OUT_DIR 
      # The main command 
      fastqc -t 8 -o $OUT_DIR $RAW_DIR/*gz 
      multiqc -p -o 01a_MultiQC/ $OUT_DIR --title "Arabidopsis_RNAseq" 
    

    Submit the script:

      sbatch 00_Scripts/01_fastqc.sl 
    
  3. Read Quality Trimming

    When would you trim?

    • To remove bases of poor quality
    • To remove adapter sequences
    • Reads for demonstrating trimming

    We will use BBDUK: Decontamination using kmers (BBDUK Guide)

    Script for bbduk: Ensure that you are in /90daydata/shared/$USER/intro_rnaseq. Create the empty script file:

      touch 00_Scripts/02_bbduk.sl
    

    Open 02_bbduk.sl in the VS Code editor and copy and paste the script below:

    #!/bin/bash 
    #SBATCH -N1 
    #SBATCH -c8 
    #SBATCH --account=scinet_workshop2
    #SBATCH --reservation=foundations_workshop
    #SBATCH -J bbduk
    #SBATCH -o slurm_logs/bbduk_%j.out 
    #SBATCH -e slurm_logs/bbduk_%j.err 
    #SBATCH -t 01:00:00 
    
    module load bbtools
    module load parallel
    
    mkdir -p 02_Trimmed
    # Define adapters variables 
    adapters="/software/el9/apps/bbtools/39.01/resources/adapters.fa"
    
    # Run bbduk in parallel:
    ls 00_RawData/*_1.fastq.gz | parallel -j6 "bbduk.sh -Xmx4g in1={} in2={= s/_1.fastq.gz/_2.fastq.gz/ =} \
    out1=02_Trimmed/{= s:.*/::; s/_1.fastq.gz/_trimmed_1.fastq.gz/ =} \
    out2=02_Trimmed/{= s:.*/::; s/_1.fastq.gz/_trimmed_2.fastq.gz/ =} \
    ref=$adapters ktrim=r k=23 mink=11 hdist=1 qtrim=rl trimq=20 minlength=50 tpe tbo"
    

    Explanation of options:

    • ref=adapters.fa: Reference file containing adapter sequences to be trimmed.
    • ktrim=r: Trim adapters from the right end of reads.
    • k=23: K-mer length for matching adapters.
    • mink=11: Minimum k-mer length for adapter matching.
    • hdist=1: Allow one mismatch in k-mer matching.
    • qtrim=rl: Trim both ends of reads based on quality.
    • trimq=20: Quality threshold for trimming.
    • minlength=50: Discard reads shorter than 50 bases after trimming.
    • tpe: Trim both reads of a pair if one is trimmed.
    • tbo: Trim adapters based on pair overlap detection.

    Submit the script:

      sbatch 00_Scripts/02_bbduk.sl
    

    Run a quality check on the trimmed reads

    Script for fastqc: Create the empty script file:

    touch 00_Scripts/03_fastqc.sl
    

    Open 03_fastqc.sl in the VSCode editor and copy and paste the script below:

    #!/bin/bash 
    #SBATCH -N1 
    #SBATCH -c8 
    #SBATCH --account=scinet_workshop2
    #SBATCH --reservation=foundations_workshop
    #SBATCH -J fastqc  
    #SBATCH -o slurm_logs/fastqc_%j.out 
    #SBATCH -e slurm_logs/fastqc_%j.err 
    #SBATCH -t 01:00:00 
    
    # Load required modules 
    module load fastqc 
    module load multiqc 
    
    # Define Directories: 
    RAW_DIR="/90daydata/shared/$USER/intro_rnaseq/02_Trimmed" 
    OUT_DIR="$SLURM_SUBMIT_DIR/03_FastqC" 
    # make output directories 
    mkdir -p $OUT_DIR 
    # The main command 
    fastqc -t 8 -o $OUT_DIR $RAW_DIR/*gz 
    multiqc -p -o 03a_MultiQC/ $OUT_DIR --title "Arabidopsis_RNAseq_TrimmedData"
    

    Submit the script:

      sbatch 00_Scripts/03_fastqc.sl
    
  4. Mapping reads to the genome using HISAT2

    There are several mapping programs available for aligning RNAseq reads to the genome. Generic aligners such as BWA, bowtie2, BBMap, etc., are not suitable for mapping RNAseq reads because they are not splice-aware. Because RNA-seq primarily targets mature mRNA, many reads consist of spliced exonic regions. Mapping these specific ‘junction reads’ back to the genome requires splice-aware aligners to split and bridge them across intronic gaps. In this tutorial, we will use HISAT2, a successor of Tophat2.

    Build genome index

    For HiSat2 mapping, you first need to index the genome and then use the read pairs to map the indexed genome (one set at a time). For indexing the genome, we use the hisat2-build command.

    Create the empty script file:

    touch 00_Scripts/04a_hisat2_build.sl
    

    Open the file 04a_hisat2_build.sl in the VS Code editor and copy and paste the script below:

        #!/bin/bash 
        #SBATCH -N1 
        #SBATCH -c8 
        #SBATCH --account=scinet_workshop2
        #SBATCH --reservation=foundations_workshop
        #SBATCH -J hisat2_build
        #SBATCH -o slurm_logs/hisat2_build_%j.out 
        #SBATCH -e slurm_logs/hisat2_build_%j.err 
        #SBATCH -t 01:00:00 
        
        module load hisat2
        DATA_DIR=/90daydata/shared/$USER/intro_rnaseq/00_Genome
        mkdir -p 04_Hisat2_Index/
        
        hisat2-build -p8 $DATA_DIR/GCF_000001735.4_TAIR10.1_genomic.fna 04_Hisat2_Index/Arabidopsis_TAIR10.1
    

    Submit the script:

      sbatch 00_Scripts/04a_hisat2_build.sl
    

    Once complete, you should see several files with the .ht2 extension in the 04_Hisat2_Index folder. These are the index files.

    Mapping Reads to the index:

    We will do this using parallel within a Slurm batch script. First we will need two executable scripts (included):

    • run_hisat2.sh (Do not copy; file already included. Contents below are for reference.)
      #!/bin/bash 
      R1=$1 
      R2=$2 
      SAMPLE=$(basename "$R1" _1.fastq.gz) 
      hisat2 -p 4 -x Arabidopsis_TAIR10.1 -1 "$R1" -2 "$R2" 2> LOG/"${SAMPLE}".log -S SAM/"${SAMPLE}".sam 
      
    • run_samtools.sh (Do not copy; file already included. Contents below are for reference.) cat run_samtools.sh
      #!/bin/bash 
      # Usage: ./run_samtools.sh SAMFILE 
      SAMFILE=$1 
      SAMPLE=$(basename $SAMFILE .sam) 
      samtools view -@ 7 -bS $SAMFILE | samtools sort -@ 7 -o BAM/${SAMPLE}.bam 
      

    The two executable shell scripts will be called within a slurm batch script. We will create the slurm script as follows:

    Create the empty script:

    touch 00_Scripts/04_hisat2_samtools.sl
    

    Open the file 00_Scripts/04_hisat2_samtools.sl in the VS Code editor and copy and paste the script below:

     #!/bin/bash 
     #SBATCH --account=scinet_workshop2
     #SBATCH --reservation=foundations_workshop 
     #SBATCH -N1 
     #SBATCH -c24
     #SBATCH -J hisat2  
     #SBATCH -o slurm_logs/hisat2_samtools%j.out 
     #SBATCH -e slurm_logs/hisat2_samtools%j.err 
     #SBATCH -t 08:00:00 
      
     # Load required modules 
     module load hisat2 
     module load samtools 
     module load parallel 
      
     # Define Directories: 
     RAW_DIR="/90daydata/shared/$USER/intro_rnaseq/02_Trimmed" 
     INDEX_DIR="/90daydata/shared/$USER/intro_rnaseq/04_Hisat2_Index" 
     OUT_DIR="$SLURM_SUBMIT_DIR/04_Hisat2_Aligned"
     SCRIPTS="/90daydata/shared/$USER/intro_rnaseq/00_Scripts" 
     mkdir -p $OUT_DIR 
      
     # change to compute node's local dir 
     cd $TMPDIR 
      
     # Set Permissions 
     chgrp -R proj-scinet_workshop2 $TMPDIR 
     chmod -R g+s $TMPDIR 
      
     # make output directories 
     mkdir SAM 
     mkdir BAM 
     mkdir LOG 
      
     # copy input data and shell scripts to the node's scratch dir 
     cp --preserve=ownership $RAW_DIR/*gz . 
     cp --preserve=ownership $INDEX_DIR/Arabidopsis_TAIR10.1*h* .  
     cp --preserve=ownership $SCRIPTS/run_hisat2.sh . 
     cp --preserve=ownership $SCRIPTS/run_samtools.sh . 
      
     # Check if folders are in TMPDIR 
     echo "Files in TMPDIR:"
     find . -type f 
      
     # The main command (run the two executable scripts) 
     parallel -j 6 "./run_hisat2.sh {1} {2}" ::: *_1.fastq.gz :::+ *_2.fastq.gz 
     parallel -j 6 "./run_samtools.sh {}" ::: SAM/*.sam 
      
     # Move the output directories back to the output folder 
     mv BAM $OUT_DIR 
     mv LOG $OUT_DIR  
    

    Submit the script:

      sbatch 00_Scripts/04_hisat2_samtools.sl
    

    This will take about 15-30 minutes to complete.


    Exploring a BAM file:

    Use the code below as a guide.

    # Look at the header — what genome was used? what aligner?
    samtools view -H sample.bam | head -30
    
    # View first 10 alignments in human-readable form
    samtools view sample.bam | head -10
    
    # How many reads total? mapped? properly paired?
    samtools flagstat sample.bam
    
    # Alignment summary statistics
    samtools stats sample.bam | grep ^SN | cut -f 2-
    
    # Pull only reads mapping to one gene region (e.g. a known highly expressed gene)
    samtools view sample.bam chr1:1000000-1001000
    
    # Filter to only properly paired, primary alignments (the reads featureCounts actually uses)
    samtools view -f 2 -F 256 sample.bam | head -10
    
    # Check insert size distribution (sanity check on library)
    samtools stats sample.bam | grep ^IS | cut -f 2-3 | head -20
    

    Exploring the Hisat2 logs:

    The code below is an example of exploring the log file.

    cat SRR4420293_trimmed.log 
    14930965 reads; of these:
      14930965 (100.00%) were paired; of these:
        343736 (2.30%) aligned concordantly 0 times
        6243732 (41.82%) aligned concordantly exactly 1 time
        8343497 (55.88%) aligned concordantly >1 times
        ----
        343736 pairs aligned concordantly 0 times; of these:
          16257 (4.73%) aligned discordantly 1 time
        ----
        327479 pairs aligned 0 times concordantly or discordantly; of these:
          654958 mates make up the pairs; of these:
            383410 (58.54%) aligned 0 times
            75880 (11.59%) aligned exactly 1 time
            195668 (29.87%) aligned >1 times
    98.72% overall alignment rate
    
  5. Abundance Estimation

    There are many programs available for quantifying transcript abundance from RNA-seq data. The two most popular tools are featureCounts and HTSeq. We will need a file with aligned sequencing reads (BAM files generated in the previous step) and a list of genomic features (from the GFF file). featureCounts is a highly efficient, general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins, and chromosomal locations. It also outputs statistics for the overall summarization of results, including the number of successfully assigned reads and the number of reads that failed to be assigned for various reasons. We can run featureCounts on all SAM/BAM files at the same time or individually.

    You will need to have the subread and parallel modules loaded. They are included in the script below.

    Create the empty script:

    touch 00_Scripts/05_featurecounts.sl
    

    Open the file 00_Scripts/05_featurecounts.sl in the VS Code editor and copy and paste the script below:

    #!/bin/bash 
    #SBATCH --account=scinet_workshop2
    #SBATCH --reservation=foundations_workshop 
    #SBATCH -N 1 
    #SBATCH -c 16 
    #SBATCH -t 01:00:00 
    #SBATCH -J featCounts 
    #SBATCH -o slurm_logs/featCounts.%j.out 
    #SBATCH -e slurm_logs/featCounts.%j.err 
    
    # Load required modules 
    module purge 
    module load subread 
    module load parallel 
    
    # Define inputs 
    ANNOT_GTF="/90daydata/shared/$USER/intro_rnaseq/00_Genome/GCF_000001735.4_TAIR10.1_genomic.gtf" 
    BAMDIR="/90daydata/shared/$USER/intro_rnaseq/04_Hisat2_Aligned/BAM" 
    OUTDIR="05_FeatCounts" 
    mkdir -p ${OUTDIR} 
    
    # FeatureCounts command 
    parallel -j 4 "featureCounts -T 4 -s 2 -p --countReadPairs -t exon -g gene_id -a ${ANNOT_GTF} -o ${OUTDIR}/{/.}.txt {}" ::: ${BAMDIR}/*.bam 
    
    scontrol show job ${SLURM_JOB_ID} 
    

    Submit the batch script:

    sbatch 00_Scripts/05_featurecounts.sl 
    

    This creates the following set of files in the specified output folder:

    • Count Files:
      SRR4420298_trimmed.txt 
      SRR4420293_trimmed.txt 
      SRR4420297_trimmed.txt 
      SRR4420296_trimmed.txt 
      SRR4420295_trimmed.txt 
      SRR4420294_trimmed.txt
      
    • Additionally, summary files are produced. These give the summary of reads that were either ambiguous, multi mapped, mapped to no features, or unmapped, among other statistics. We can refer to these to fine-tune our analyses.
      SRR4420298_trimmed.txt.summary 
      SRR4420293_trimmed.txt.summary 
      SRR4420295_trimmed.txt.summary 
      SRR4420296_trimmed.txt.summary 
      SRR4420294_trimmed.txt.summary 
      SRR4420297_trimmed.txt.summary 
      

    Using the following Linux commands, we can edit the outputs to produce a single count table. This count table will be loaded into R for differential expression analysis.

    • Step 1: From /90daydata/shared/$USER/intro_rnaseq/05_FeatCounts edit the count files “in-place” using sed

      for f in *txt; do
        sed -i '/^#/d' "$f"
        sed -i 's|/90day.*BAM/||g' "$f"
        sed -i 's|_.*bam||g' "$f"
        done
      
    • Step 2: Make directory called 06_Combined_Counts in /90daydata/shared/$USER/intro_rnaseq/ and make a combined count table with the first column as “Gene_IDs” and 6 other columns one for each sample’s gene counts.

      mkdir 06_Combined_Counts
      cd 06_Combined_Counts
      COUNTS="/90daydata/shared/$USER/rna_seq/05_FeatCounts"
        
      awk 'BEGIN {OFS="\t"} {print $1,$7}' $COUNTS/SRR4420293_trimmed.txt > col1.txt
      awk 'BEGIN {OFS="\t"} {print $7}' $COUNTS/SRR4420294_trimmed.txt > col2.txt
      awk 'BEGIN {OFS="\t"} {print $7}' $COUNTS/SRR4420295_trimmed.txt > col3.txt 
      awk 'BEGIN {OFS="\t"} {print $7}' $COUNTS/SRR4420296_trimmed.txt > col4.txt 
      awk 'BEGIN {OFS="\t"} {print $7}' $COUNTS/SRR4420297_trimmed.txt > col5.txt 
      awk 'BEGIN {OFS="\t"} {print $7}' $COUNTS/SRR4420298_trimmed.txt > col6.txt 
      paste col1.txt col2.txt col3.txt col4.txt col5.txt col6.txt > Arabidopsis_RNAseq_Counts.txt 
      
    • The combined count file… a sneak peek:

      head -4 Arabidopsis_RNAseq_Counts.txt 
      Geneid  SRR4420293      SRR4420294      SRR4420295      SRR4420296      SRR4420297      SRR4420298
      AT1G01010       10      1       8       25      10      13
      AT1G01020       36      3       24      80      21      14
      AT1G03987       0       0       0       0       0       0
      
    • Convert to and save as a CSV file.

      sed 's|\t|,|g' Arabidopsis_RNAseq_Counts.txt > Arabidopsis_RNAseq_Counts.csv
      

    This file is ready to be imported to R for DESeq2.

    Before venturing into DESeq2, let’s discuss some common scenarios in biological RNAseq data:

    1. Low Coverage We don’t have enough reads to confidently place every read, so we need to be more permissive at the alignment stage but more stringent at the counting stage.

    a. HISAT2 adjustments:

    #Be more sensitive — allow more mismatches, don't penalize gaps as heavily
    --score-min L,-0.6,-0.6   # default is L,-0.2,-0.2; relaxing this recovers more alignments
    --no-mixed                 # REMOVE this flag if you had it; you want to keep singleton pairs
    --no-discordant            # similarly, REMOVE — discordant pairs are still signal at low coverage
    
    # Allow more multimappers through — at low coverage you can't afford to discard them
    -k 5                       # report up to 5 alignments per read instead of default 1
    

    b. featureCounts adjustments:

    # Accept multi-mapping reads — at low coverage every read counts
    -M                         # count multi-mapping reads
    --fraction                 # distribute multi-mapper counts fractionally across loci
    
    # Be permissive about read-to-feature overlap
    --minOverlap 1             # default is 1 but make sure it isn't set higher
    --fracOverlap 0            # don't require any minimum fraction of read to overlap
    
    # Don't require both pairs to map
    -p                         # paired-end flag — keep it
    --countReadPairs           # but consider switching to counting reads not pairs if coverage is very low
    

    2. Very High Coverage You have so many reads that multi-mappers and ambiguous counts overwhelm your results, and running times become excessive.

    a. HISAT2 adjustments:

    # Be more stringent — only accept high-confidence alignments
    --score-min L,0,−0.2       # stricter than default; fewer but more reliable alignments
    
    # Limit multimappers aggressively
    -k 1                        # only report the single best alignment (faster too)
    --no-discordant             # discard discordant pairs — at high coverage you can afford to
    --no-mixed                  # discard singletons — same reasoning
    
    # Practical: use more threads
    -p 16                       # or however many cores you have; runtime matters at high coverage
    
    # Downstream: you may want to mark and remove duplicates after alignment
    # samtools markdup or Picard MarkDuplicates
    samtools markdup -r input.bam output.bam   # -r actually removes rather than just marking
    

    b. featureCounts adjustments:

    # Be stringent about what counts
    -Q 30                      # only count reads with MAPQ >= 30 (high-confidence placements)
    --ignoreDup                # ignore reads flagged as PCR duplicates by markdup
    
    # Require meaningful overlap — at high coverage you can afford to be strict
    --minOverlap 10            # require at least 10 bp overlap with feature
    --fracOverlap 0.5          # require at least 50% of read to overlap the feature
    
    # Don't count multi-mappers — at high coverage you have enough primary alignments
    # simply omit -M flag
    

    3. Polyploid Data

    This is the most conceptually difficult scenario because the genome itself contains multiple copies of nearly identical sequences, so multi-mapping is not a technical artifact — it’s biologically real and expected.

    a. HISAT2 adjustments:

    # You must allow multimappers — the homeologs are real
    -k 10                      # report up to 10 alignments; in hexaploids (wheat) go higher
    --no-spliced-alignment     # consider this if your annotation is subgenome-specific and clean
                               # otherwise leave spliced alignment on
    
    # For allopolyploids with subgenome-specific references (e.g. AtAt, CaCa in Brassica):
    # Build separate indices per subgenome and align to each, then reconcile
    # OR align to the combined polyploid reference and use subgenome-specific annotation
    
    # Score penalty tuning for homeolog discrimination
    --mp 4,2                   # mismatch penalty: max 4, min 2 (default 6,2)
                               # slightly more tolerant of homeolog-level mismatches
    --rdg 3,2                  # read gap penalty — relax slightly
    

    b. featureCounts adjustments:

    # Multi-mapper handling is the central decision
    -M --fraction              # fractional counting — splits counts equally among homeologs
                               # this is the most conservative and commonly recommended approach
    
    # If your GTF has subgenome-specific gene IDs (e.g. TraesCS1A vs TraesCS1B in wheat):
    -g gene_id                 # default — works if annotation distinguishes subgenomes
    
    # If you want to collapse homeologs intentionally:
    -g gene_name               # use gene name instead of ID if homeologs share a name in GTF
    
    # Require higher overlap to reduce homeolog cross-assignment
    --minOverlap 25            # longer overlap = more discriminating placement
    --fracOverlap 0.5
    

    4. Poorly Annotated Genome

    Sometimes, we have to deal with a draft genome with fragmented assembly, a GFF3 converted from another species, or a genome where most genes are annotated by ab initio prediction rather than transcript evidence.

    a. HISAT2 adjustments:

    # Use a splice-site aware strategy but don't trust the annotation blindly
    # First: generate a splice site file FROM your own RNA-seq data, not just the GTF
    hisat2_extract_splice_sites.py annotation.gtf > known_splice_sites.txt
    hisat2_extract_exons.py annotation.gtf > known_exons.txt
    
    # Build index with known sites as a guide, but allow novel junction discovery
    hisat2-build --ss known_splice_sites.txt --exon known_exons.txt genome.fa genome_index
    
    # During alignment: allow novel splicing
    # (this is actually the default — just don't use --no-spliced-alignment)
    --dta                      # downstream transcriptome assembly flag
                               # produces output better suited for StringTie if you plan to
                               # improve the annotation using your own reads
    
    # Be more permissive about alignment score — annotation gaps mean some real reads
    # will align "imperfectly"
    
    --score-min L,-0.6,-0.6
    

    b. featureCounts adjustments:

    # Poorly annotated genomes often have fragmented gene models
    # A read spanning two annotation fragments of the same real gene gets discarded by default
    -f                         # count at exon level instead of gene level
                               # then you can aggregate manually and catch fragmented models
    
    # Allow reads that partially overlap annotated regions
    --fracOverlap 0.25         # lower threshold — partial overlaps are meaningful here
    --minOverlap 5
    
    # Don't require both ends of a pair to map to the same feature
    # (fragmented annotation means mates may land in annotated and unannotated space)
    -p but omit --requireBothEndsMapped
    
    # Relaxed strandedness — poorly converted annotations often have strand errors
    -s 0                       # unstranded counting if you're unsure annotation strand is reliable
    

    For a poorly annotated genome, use the RNA-seq data to improve the annotation rather than just accept it. One potential option is for us to use downstream transcriptome assembly). The steps below serve as a guide:

    HISAT2 (with –dta flag)

    • StringTie (to assemble transcripts per sample)
    • StringTie –merge (create a consensus annotation across samples)
    • compare with MAKER or GFFCompare against existing annotation
    • use improved GTF for featureCounts

    BUSCO on the resulting transcript set gives you a completeness metric to report.

  6. Differential Gene Expression Analysis

    Make a directory called 07_DESeq2 and copy the R Script to the new directory. This is more for our own convenience, so that when we open RStudio and set /90daydata/shared/$USER/intro_rnaseq/07_DESeq2 as the working directory, we will have the R script in the same directory.

    mkdir 07_DESeq2
    cp 00_Scripts/DESeq2.R 07_DESeq2
    
    • RStudio Server: Back on the main Ceres OOD tab, click on the top or side navigation bar: “Interactive Apps” > “RStudio Server”.
    • Fill the input fields with the following (input fields not listed below can be left at their default values):
      • Account: scinet_workshop2
      • Queue: ceres
      • QOS: 400thread
      • R Version: 4.4.1
      • Number of hours: 4
      • Number of cores: 1
      • Memory required: 8GB
      • Optional Slurm Arguments: (leave empty)
    • Click the “Launch” button.
    • Wait a moment for the job card to update from “Queued” to “Running”.
    • Please confirm that clicking on the “Connect to RStudio Server” button opens a new tab with the RStudio Server interface.

    We have included the entire R script for DESeq2 in the workshop materials as the file DESeq2.R. The script is partly based on Stephen Turner’s template. We provide the R script here largely for reference.

  7. Functional Annotation

    We have included the entire R script for GOseq in the workshop materials as the file functional_annot.R.

    Let’s copy the script to our current working directory:

    cp /project/scinet_workshop2/foundations_bioinf_2026/rnaseq_analysis/files/00_Scripts/functional_annot.R 00_Scripts/
    
  8. Quantification using Pseudoalignment

    RNA-seq analysis usually involves mapping sequencing reads to a reference genome to quantify gene expression levels. When a high-quality reference genome is unavailable, the transcriptome is typically assembled from the sequencing reads, allowing for the estimation of transcript abundance without performing a full alignment.

    Kallisto is a fast, alignment-free method for estimating transcript abundance from RNA-seq data. Instead of traditional base-by-base alignment, kallisto uses k-mer matching, which involves identifying short, fixed-length sequences (k-mers) shared between reads and transcripts to quickly assign reads to compatible transcripts.

    We will explore how kallisto works by using a transcriptome for Arabidopsis which we downloaded from Ensembl.

    Relaunch VS Code and navigate to the working directory:

    cd /90daydata/shared/$USER/intro_rnaseq
    
    • Load the software
    module load kallisto
    

    Build an index

    • Create working directories:
    mkdir 08_Kallisto
    
    cd 08_Kallisto
    
    mkdir -p index results
    
    • To build the index:

    The basic format of the kallisto command we’ll use is: kallisto index -i [index name.idx] [transcriptome path]

    • kallisto index: builds index
    • -i: location of index file
    • .fa: transcript sequences
    • .idx: structure that allows Kallisto to run quickly
    TRANS="/90daydata/shared/$USER/intro_rnaseq/00_Transcriptome"
    
    kallisto index -i index/Arabidopsis_thaliana.idx ${TRANS}/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
    
    • Quantification

    For this step, we’ll use kallisto quant with these options:

    • `-i : index file
    • -o: output folder
    • -b:bootstrap runs / confidence estimations
    • -t: number of CPU threads
    • paired read 1
    • paired read 2

    Because this process is computationally intensive, we’ll run kallisto quant using a separate Slurm job so that we can give it more memory and CPUs to work with. After the job completes, we should expect the following output files:

    • abundance.tsv: quantification file
    • abundance.h5: binary HDF5 format of the quantification results
    • run_info: summary file with information about the job run

    Navigate to main working directory:

    cd /90daydata/shared/$USER/intro_rnaseq
    
    touch 00_Scripts/08_kallisto_quant.sl
    

    Open the file 00_Scripts/08_kallisto_quant.sl in the VS Code editor and copy and paste the script below:

    #!/bin/bash
    #SBATCH --account=scinet_workshop2
    #SBATCH --reservation=foundations_workshop 
    #SBATCH --job-name=kallisto_quant
    #SBATCH --output=slurm_logs/kallisto_%j.out
    #SBATCH --error=slurm_logs/kallisto_%j.err
    #SBATCH --time=04:00:00
    #SBATCH --cpus-per-task=16
    #SBATCH --mem=50G
    
    # Load kallisto
    module load kallisto 
    
    # Set paths 
    INDEX="/90daydata/shared/$USER/intro_rnaseq/08_Kallisto/index/Arabidopsis_thaliana.idx"
    RAW_DIR="/90daydata/shared/$USER/intro_rnaseq/00_RawData"
    RESULTS_DIR="/90daydata/shared/$USER/intro_rnaseq/08_Kallisto/results"
    
    # find all read 1 files that end with _1.fastq.gz
    for r1 in ${RAW_DIR}/*_1.fastq.gz
    
    do 
    
    # get sample name only
    sample=$(basename "$r1" _1.fastq.gz)
    
    # find matching read 2
    r2="${RAW_DIR}/${sample}_2.fastq.gz"
     
    # echo the sample being processed
    echo "Currently processing: ${sample}"
    
    # Run kallisto
    
    kallisto quant -i ${INDEX} -o ${RESULTS_DIR}/${sample} -b 100 -t ${SLURM_CPUS_PER_TASK} ${r1} ${r2}
    
    # Print completion message
    echo "Finished with ${sample}"
    
    done
    
    echo "All runs complete" 
    

    Submit the slurm script:

    sbatch 00_Scripts/08_kallisto_quant.sl
    

    Let’s take a look at the output files:

    head abundance.tsv
    
    cat run_info
    

    Now that we have transcript-level expression estimates for each sample, we need to combine them before running DESeq2 for differential gene expression analysis. We will do this in R. We have included the entire R script for this in the workshop materials as the file kallisto_deseq.R. Let’s copy the script to our current working directory:

    cd /90daydata/shared/$USER/intro_rnaseq
    cp /project/scinet_workshop2/foundations_bioinf_2026/rnaseq_analysis/files/00_Scripts 00_Scripts/
    

    Next, launch R Studio and run the script.

    The script contents are also provided below for reference.

  9. De Novo Transcriptome Assembly and Quantification

    Let’s now assume that Arabidopsis doesn’t have a sequenced genome. We start with the RNA-seq reads and assemble them into de novo transcripts — a fundamentally harder problem than when you have the genome alignment because we have no map to work from. One such de novo assembler is Trinity. Before understanding what Trinity does, it helps to understand why naive assembly fails for RNA-seq data. Unlike a genome, a transcriptome has wildly unequal coverage — a highly expressed gene might have 10,000× more reads than a minimally expressed one. Standard genome assemblers assume roughly uniform coverage and break down completely under these conditions. Trinity was designed specifically to handle this.

    Trinity uses de Bruijn graphs to partition and assemble reads. Rather than trying to assemble reads directly, Trinity breaks every read into short overlapping substrings of length k (k-mers), and builds a graph where each unique k-mer is a node and edges represent overlaps. Paths through this graph represent candidate transcripts. The appeal of this approach is that it doesn’t require a reference — the reads wire up the graph themselves.

    Trinity then works in three sequential stages:

    1. Inchworm traverses the de Bruijn graph greedily, following the most abundant k-mer paths first. This produces a set of unique transcript sequences — essentially the dominant isoform at each locus. It is fast but ignores complexity: alternatively spliced isoforms and paralogous genes are collapsed or missed at this stage.

    2. Chrysalis takes the Inchworm contigs and clusters them into groups that share k-mers, then builds a separate, smaller de Bruijn graph for each cluster. This is the key architectural decision that makes Trinity tractable — instead of one enormous graph for the whole transcriptome (which would be computationally expensive), Trinity generates thousands of small, manageable graphs, each representing the transcriptional complexity at a single locus or gene family.

    3. Butterfly then processes each small graph independently. It traces all the paths through the graph that are supported by read evidence, resolving alternative splicing events, paralog distinctions, and sequencing errors. The output is a FASTA file of assembled transcript sequences, potentially many per locus.

    A few things worth keeping in mind about de novo assembly:

    • k-mer size (25): Larger k gives more specificity but requires higher coverage; smaller k is more sensitive but generates more spurious connections in the graph.

    • Output is transcripts (multiple isoforms per locus), and it has no way of knowing which are real biological isoforms versus assembly artifacts. Downstream tools like BUSCO (completeness assessment) and TransDecoder (ORF prediction) are essential next steps.

    • Coverage depth: De novo assembly generally requires deeper sequencing than genome-guided approaches — 50–100M reads per sample is a more comfortable starting point than the 20–30M that suffices for HISAT2 alignment.

    There is no ground truth. Unlike genome-guided assembly where you can check your mapping rate against a known reference, de novo assembly quality is harder to assess. BUSCO scores and the N50 of your assembly are your primary quality checks.

    Trinity’s output is a hypothesis about what transcripts exist in your sample. Every downstream analysis — quantification, differential expression, annotation — is testing and refining that hypothesis.


    For Trinity assembly, we will first create the empty script file:

    touch 00_Scripts/09_trinity_slurm.sl
    

    Open the file 00_Scripts/09_trinity_slurm.sl in the VS Code editor and copy and paste the script below:

    #!/bin/bash
    #SBATCH --account=scinet_workshop2
    #SBATCH --reservation=foundations_workshop
    #SBATCH -N1
    #SBATCH -c72
    #SBATCH -J trinity
    #SBATCH -o slurm_logs/trinity_%j.out
    #SBATCH -e slurm_logs/trinity_%j.err
    #SBATCH -t 24:00:00
    
    echo "Started Job: $(date)"
    
    # Load required modules
    module load trinityrnaseq
    
    # Define Directories:
    RAW_DIR="/90daydata/shared/$USER/intro_rnaseq/00_RawData"
    OUT_DIR="$SLURM_SUBMIT_DIR/09_Trinity"
    
    mkdir -p $OUT_DIR
    
    #####################
    echo "Trinity Started: $(date)"
    # running Trinity
    
    # Trinity requires that the each set of reads are concatenated into a file each. make combined left and right reads.
    
    cat $RAW_DIR/*1.*gz > $OUT_DIR/left_1.gz
    cat $RAW_DIR/*2.*gz > $OUT_DIR/right_2.gz
    
    cd  $OUT_DIR
    Trinity --seqType fq \
     --max_memory 150G \
     --CPU 64 \
     --normalize_by_read_set \
     --output Trinity_At \
     --left left_1.gz \
     --right right_2.gz \
     --trimmomatic 
    
    echo "Trinity ended: $(date)"
    
    scontrol show job $SLURM_JOB_ID
    echo "Completed job: $(date)"
    

    Submit the script:

    sbatch 00_Scripts/09_trinity_slurm.sl
    

    This would likely take longer than 2 hours to run. The output folder will have, among others, the following important files

    1. Trinity_At.Trinity.fasta: The primary assembly file
    2. Trinity_At.Trinity.fasta.gene_trans_map: A mapping between Trinity genes and the corresponding transcripts

    Assembly Statistics

    We can take a quick look at the assembly stats. From the terminal run the following line:

    cd 09_Trinity
    TRINITY /usr/local/bin/util/TrinityStats.pl Trinity_At.Trinity.fasta > Trinity_At.stats
    

    Open the file Trinity_At.stats in the VSCode editor and examine the following:

    • Genes vs Transcripts
    • N50 Values
    • GC content

    BUSCO

    Create the empty script file:

    touch 00_Scripts/09_trinity_busco.sl
    

    Open the file 00_Scripts/09_trinity_busco.sl in the VS Code editor and copy and paste the script below:

    #!/bin/bash
    #SBATCH --account=scinet_workshop2
    #SBATCH --reservation=foundations_workshop
    #SBATCH -N1
    #SBATCH -c16
    #SBATCH -J busco
    #SBATCH -o slurm_logs/busco_%j.out
    #SBATCH -e slurm_logs/busco_%j.err
    #SBATCH -t 4:00:00
    
    module purge 
    module load busco6
    ASSEMBLY_DIR="/90daydata/shared/$USER/intro_rnaseq/09_Trinity"
    EUK_ODB12="/90daydata/shared/$USER/intro_rnaseq_practice/09_Trinity/eukaryota_odb12"
    busco \
      -i $ASSEMBLY_DIR/Trinity_At.Trinity.fasta \
      -l $EUK_ODB12 \
      -o 01_TRINITY-BUSCO \
      --out_path $ASSEMBLY_DIR \
      -m transcriptome \
      -c 16 \
      --offline
    

    Submit the script:

    sbatch 00_Scripts/09_trinity_busco.sl
    

    Examine short_summary.specific.eukaryota_odb12.01_TRINITY-BUSCO.txt

            C:90.7%[S:37.2%,D:53.5%],F:4.7%,M:4.7%,n:129       
            117     Complete BUSCOs (C)                        
            48      Complete and single-copy BUSCOs (S)        
            69      Complete and duplicated BUSCOs (D)         
            6       Fragmented BUSCOs (F)                      
            6       Missing BUSCOs (M)                         
            129     Total BUSCO groups searched                
    
    

    Salmon Quantification

    Create the empty script file:

    touch 00_Scripts/09_trinity_salmon.sl
    

    Open the file 00_Scripts/09_trinity_salmon.sl in the VS Code editor and copy and paste the script below:

    #!/bin/bash
    #SBATCH --account=scinet_workshop2
    #SBATCH --reservation=foundations_workshop
    #SBATCH -N1
    #SBATCH -c24
    #SBATCH -J align_estimate
    #SBATCH -o slurm_logs/align_estimate_%j.out
    #SBATCH -e slurm_logs/align_estimate_%j.err
    #SBATCH -t 24:00:00
    
    ASSEMBLY=/90daydata/shared/$USER/intro_rnaseq/09_Trinity/
    RAWDATA=/90daydata/shared/$USER/intro_rnaseq/00_RawData
    OUTDIR=/90daydata/shared/$USER/intro_rnaseq/10_Trinity_Salmon
    
    mkdir -p $OUTDIR
    module load trinityrnaseq
    
    ### PREP REFERENCE (Indexing)
    
    TRINITY \
    /usr/local/bin/util/align_and_estimate_abundance.pl \
        --transcripts $ASSEMBLY/Trinity_At.Trinity.fasta \
        --est_method salmon \
        --trinity_mode \
        --prep_reference
    
    ### ALIGN and ESTIMATE
    
    cd $ASSEMBLY
    # SRR4420293
    TRINITY /usr/local/bin/util/align_and_estimate_abundance.pl \
    --transcripts $ASSEMBLY/Trinity_At.Trinity.fasta \
    --seqType fq \
    --left $RAWDATA/SRR4420293_1.fastq.gz \
    --right $RAWDATA/SRR4420293_2.fastq.gz \
    --est_method salmon \
    --trinity_mode \
    --output_dir $OUTDIR/SRR4420293 \
    --thread_count 24
    
    # SRR4420294
    TRINITY /usr/local/bin/util/align_and_estimate_abundance.pl \
    --transcripts $ASSEMBLY/Trinity_At.Trinity.fasta \
    --seqType fq \
    --left $RAWDATA/SRR4420294_1.fastq.gz \
    --right $RAWDATA/SRR4420294_2.fastq.gz \
    --est_method salmon \
    --trinity_mode \
    --output_dir $OUTDIR/SRR4420294 \
    --thread_count 24
    
    # SRR4420295
    TRINITY /usr/local/bin/util/align_and_estimate_abundance.pl \
    --transcripts $ASSEMBLY/Trinity_At.Trinity.fasta \
    --seqType fq \
    --left $RAWDATA/SRR4420295_1.fastq.gz \
    --right $RAWDATA/SRR4420295_2.fastq.gz \
    --est_method salmon \
    --trinity_mode \
    --output_dir $OUTDIR/SRR4420295 \
    --thread_count 24
    
    # SRR4420296
    TRINITY /usr/local/bin/util/align_and_estimate_abundance.pl \
    --transcripts $ASSEMBLY/Trinity_At.Trinity.fasta \
    --seqType fq \
    --left $RAWDATA/SRR4420296_1.fastq.gz \
    --right $RAWDATA/SRR4420296_2.fastq.gz \
    --est_method salmon \
    --trinity_mode \
    --output_dir $OUTDIR/SRR4420296 \
    --thread_count 24
    
    # SRR4420297
    TRINITY /usr/local/bin/util/align_and_estimate_abundance.pl \
    --transcripts $ASSEMBLY/Trinity_At.Trinity.fasta \
    --seqType fq \
    --left $RAWDATA/SRR4420297_1.fastq.gz \
    --right $RAWDATA/SRR4420297_2.fastq.gz \
    --est_method salmon \
    --trinity_mode \
    --output_dir $OUTDIR/SRR4420297 \
    --thread_count 24
    
    # SRR4420298
    TRINITY /usr/local/bin/util/align_and_estimate_abundance.pl \
    --transcripts $ASSEMBLY/Trinity_At.Trinity.fasta \
    --seqType fq \
    --left $RAWDATA/SRR4420298_1.fastq.gz \
    --right $RAWDATA/SRR4420298_2.fastq.gz \
    --est_method salmon \
    --trinity_mode \
    --output_dir $OUTDIR/SRR4420298 \
    --thread_count 24
    

    Submit the script:

    sbatch 00_Scripts/09_trinity_salmon.sl
    

    • May 4, 6-7, 2026, 1 – 5 PM ET
      • Prerequisites:
        • Familiarity with basic command-line concepts, next-generation sequencing data types, and have a general understanding of gene expression.