Introduction to HPC Environments and Project Management and Organization

Provided by: ISU
Registration: Register Here

This workshop provides hands-on training in using SCINet’s high-performance computing (HPC) clusters for bioinformatics workflows. Participants will learn how to access and navigate SCINet’s systems as well as command line basics for managing and analyzing bioinformatics data including running BLAST and handling FASTA and FASTQ files. The workshop also covers project management and organization strategies to improve data organization and workflow efficiency.

Tutorial Setup Instructions

Steps to prepare for the tutorial session:

Login to Ceres Open OnDemand. For more information on login procedures for web-based SCINet access, see the SCINet access user guide.
Open a command-line session by clicking on “Clusters” -> “Ceres Shell Access” on the top menu. This will open a new tab with a command-line session on Ceres’ login node.
Create a workshop working directory by running the following commands. Note: you do not have to edit the commands with your username as it will be determined by the $USER variable.
```
mkdir -p /90daydata/shared/$USER/intro_hpc 
cd /90daydata/shared/$USER/intro_hpc 
cp -r /project/scinet_workshop2/Bioinformatics_series/wk1_workshop/day2/ . 
```

High-Performance Computing (HPC) for Bioinformatics: A Practical Introduction

1. What Is an HPC System?

Login node:
On ceres, when you first log in you are in the login node and in your home directory or folder. You can prepare your work while here.
```
ceres $ pwd
/home/$USER
```
You can navigate to your folder, list your files and check file sizes
Compute nodes: Where jobs are executed.
- Request resources on a compute node by running the following command.
```
srun --reservation=wk1_workshop -A scinet_workshop2 -t 05:00:00 -N1 -n4 --mem 8G --pty bash 
```
  - For those accessing post-workshop: Remove --reservation=wk1_workshop and change -A scinet_workshop2 to your own project account. For additional information, see our SLURM guide.
Storage system:
Shared file system accessible by all nodes.
Project space:
To check the size of your project space, we will use the commmand du - which will check the disk usage:
```
`du -hs`
```

90daydata:
This is where you made your folder and work from today.

ceres $ pwd
/90daydata/shared/sivanandan.chudalayandi/intro_bioinformatics/day2

Long term storage: JUNO

ssh nal-dtn.scinet.usda.gov
Last login: Wed Apr 16 14:24:55 2025 from 199.245.99.3
[$USER@nal-dtn-0 ~]$ 

2. Getting to know the command-line:

What is the command line:

Text based intertace to interact with the CPU
The “interpreter” is the shell. The most common is BASH
No clicking only typing
Scripting and Automation
More control to you

You can ask the shell to print a character

echo "How are you?"
# the quotes above aren't strictly needed but it is safer with quotes

Date time calendar:

date
#Wed Apr 16 19:52:14 CDT 2025

cal
#     April 2025     
#Su Mo Tu We Th Fr Sa
#       1  2  3  4  5
# 6  7  8  9 10 11 12
#13 14 15 16 17 18 19
#20 21 22 23 24 25 26
#27 28 29 30    

On the HPC, you are one among many users. You can ask the interpreter:

whoami
# $USER (your first name.last name)

who
# All logged-in users

Demo: Navigate to a directory, view files. Some basic navigation

pwd
# prints the working directory
# /90daydata/shared/$USER/intro_bioinformatics/day2 (this is the full path)

# We can also change directories
# go one directory "above" day2
cd ..
# come back to day2 by using full path or relative path
cd /90daydata/shared/$USER/intro_bioinformatics/day2 # FULL path
cd day2

# go home 

cd ~ # tilde the shortcut to go home
cd /home/$USER # the full path

Create a test file and write something to it

touch test.txt
echo "hello, this is $USER" >>test.txt

copy a file
```
cp test.txt test1.txt
```
Make a new directory and copy files to it
```
mkdir folder1
cp test* folder1
```
Copy one directory to another directory Note: the flag -r stands for recursive, the command will fail without “-r”
```
cp -r folder1 folder2
```
We can also move a file from one location to another or also to rename the file The mv command is used to rename and move files.
```
mv test1.txt test1a.txt  # Rename test1.txt to test1a.txt
mv test1a.txt folder2  # Move file test1a to folder2
```
Danger Zone: We can delete/remove a file or folder using rm. There is no recycle bin or trash can from where you can restore the file
```
rm test1.txt
rm -r folder2 # use the recursive (-r) flag
rmdir folder2 # use rmdir
```

Listing the files

ls  # List files and directories in the current directory 
ls -l # Long format listing with more details 
ls -a # Show hidden files too 
ls -lh# Human-readable file sizes 
ls -t # list files sorted by time; most recent first
ls -tr # sorted by time; most recent last

View file contents

cat Arabidopsis.gtf     # Display entire file 
less Arabidopsis.gtf    # View file page by page 
head -5 Arabidopsis.gtf  # View first 5 lines 
tail -5 Arabidopsis.gtf  # View last 5 lines 

3. GTF Files in Bioinformatics

A GTF (Gene Transfer Format) is a tab delimited text file. It describes genes and other features, exons, CDS etc.bGTF (Gene Transfer Format) files are essential for describing gene and transcript structures in genome annotation. Below are the most common uses:

Use Case	Description
Gene Annotation	GTF files define genes, transcripts, exons, CDS, start/stop codons, and their genomic positions and relationships.
Read Quantification	Tools like `featureCounts`, `HTSeq`, and `StringTie` use GTF files to assign reads to gene features such as exons.
Transcript Assembly	Tools like `Cufflinks` and `StringTie` use GTFs to guide transcript reconstruction or compare with known annotations.
Differential Expression Analysis	Count matrices derived using GTF annotations feed into tools like `DESeq2` or `edgeR`.
Gene Model Visualization	Genome browsers like IGV and UCSC display gene models using GTF annotations.
Genome Annotation Conversion	GTF files can be converted to/from other formats like GFF3 or BED.
Custom Feature Extraction	GTFs are parsed to extract specific features (e.g., all exons, all CDSs, or specific genes of interest).

Count the number of lines or characters in the file:

wc Arabidopsis.gtf
wc -l Arabdopsis.gtf  
## Its a lot: 888100  24503892 272940122 Arabidopsis.gtf  
# Let's make a smaller file
head Arabidopsis.gtf > Arbaidopsis_head.gtf  
# Now let's count again
wc Arabidopsis_head.gtf
#10  142 1607 Arabidopsis_head.gtf  
# Count only the number of lines
wc -l Arabidopsis_head.gtf
10

Count unique features

grep -v '^#' Arabidopsis_head.gtf | cut -f3 | sort | uniq -c
CDS
exon
gene
transcript

Extract only genes

awk '$3 == "gene"' Arabidopsis_head.gtf

Extract all exons and the correponding coordinates

awk '$3 == "exon" {print $1, $4, $5}' Arabidopsis_head.gtf

View and parse 9th column (the attributes)
```
cut -f9 Arabidopsis_head.gtf
awk '{match($0, /gene_name "([^"]+)"/, arr); if (arr[1] != "") print arr[1]}' Arabidopsis_head.gtf | sort | uniq | head
```
- awk ...: An awk command
- RegEx pattern /gene_name "([^"]+)"/
- gene_name ": match this pattern literally
- ([^"]+)": Captures what is within quotes -([^"]+): any character that isn’t double quote -+: Match one or more “non-quote” characters -": Match the closing quote
- arr: Our name for the array/list that grows with each line

Make separate feature files

awk '$3=="CDS"' Arabidopsis_head.gtf > Arabidopsis_cds.gtf
awk '$3=="exon"' Arabidopsis_head.gtf > Arbaidopsis_exons.gtf

Modify selected words/string:

sed 's/gene_id/GENE/g' Arabidopsis_head.gtf

EXERCISE:

Download the Saccharomyces cerevisiae GTF file
Extract the file from the gzipped archive
Count the total number of lines
Count the total number of Gene feature lines
Count the numbers of each unique feature type
Extract the gene ids and save as a separate file
Make a newer GTF file with only CDSs

SLURM Command Demo: Essential HPC Job Management

SLURM (Simple Linux Utility for Resource Management) is an open-source job scheduler widely used in high-performance computing (HPC) environments. It manages job allocation, job queuing, and resource scheduling on compute clusters. SLURM is responsible for distributing computational tasks across available nodes and optimizing resource usage, ensuring that jobs run efficiently according to the specified requirements (e.g., CPU cores, memory, and time). Users submit job scripts with resource requests, and SLURM handles the execution, monitoring, and completion of these jobs. It also provides commands like sbatch for submitting jobs, squeue for checking job status, and scontrol for querying job details. Here are some commonly used SLURM commands on HPC systems.

Check available partitions (queues)
```
sinfo
sinfo -p short
sinfo -p ceres
```
View my running Jobs
```
squeue --me
squeue -u $USER
```
Cancel a Job:
```
scancel <jobid>
```
Show Job Details:
```
scontrol show jobs <jobid>
```
Show node details:
```
scontrol show nodes <nodeid>
```
View Job History:
```
sacct -u $USER
```
Specific Job History:
```
sacct -j <jobid>
```
Estimate Job Start time
```
squeue --start -u $USER
```
View the HPCs SLURM configuration:
```
scontrol show config | less
```
Determine Job efficiency (accurate only after job ends)
```
seff <jobid>
```
Interactive Job (we are already on a interactive node) Let’s start another for one minute
```
salloc -N1 -n2 -p ceres -t 00:01:00
```
Batch Script and Batch Job Use your favorite editor (nano) or use VS code or notepad to write the batch script
```
#!/bin/sh 
#SBATCH -N 1
#SBATCH -n 8
#SBATCH -J blast
#SBATCH -o log/blast.o%j
#SBATCH -e log/blast.e%j
cd $SLURM_SUBMIT_DIR
scontrol show job ${SLURM_JOB_ID}
export NT=/reference/data/NCBI/blast/2025-01-16/
module load blast+/2.15.0
blastn -query gene.fna -task blastn -db $NT/nt -num_threads 8 -out gene-nt.bl.out.xml
```
Save the file as blast.sh Run the batch script as sbatch blast.sh

Explanation of the batch script
- Line 1: Use the bash shell
- Line 2: Use one Node
- Line 3: Use 8 tasks in this case same as processors or threads
- Line 4: Job name: blast
- Line 5: Write std output to a folder called log in a file named blast.o
- Line 6: Write std output to a folder called log in a file named blast.e
- Line 7: Change to the dir from which we are submitting the slurm script
- Line 8: Print job details, this will be appended to the std output
- Line 9: export the blast database to a variable called NT
- Line 10: Load the Blast module/spftware
- Line 11: The blastn command

Additional resource: Ceres Job Script Generator:

EXERCISE Submit the blastn job and monitor the run:

Use scontrol
You could ssh to the compute node on which the jobs is running
While in the compute node, use top to monitor the run

4. Modules and Environments

On Ceres, we have several software packages available as modules and also as isolated environments and apptainer (formerly singularity) images.

Modules: Pre-configured software packages that can be loaded when needed.

Conda Environments: Isolated “workspaces” for managing software dependencies specific to a project.

Apptainer: Portable containers that bundle software and its environment for consistent execution across different systems.

Modules:

Load and run blast, check version. Check if there is a module named hisat2 and samtools

module load blast+
blastn -version

#blastn: 2.15.0+
# Package: blast 2.15.0, build Oct 19 2023 13:35:57

module spider hisat

------------------------------------------------------------------------------------
  hisat2:
------------------------------------------------------------------------------------
     Versions:
        hisat2/2.2.1-py313-6sta5wz
        hisat2/2.2.1

------------------------------------------------------------------------------------
  For detailed information about a specific "hisat2" package (including how to load the 
modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider hisat2/2.2.1

module spider sam

------------------------------------------------------------------------------------
  samtools:
------------------------------------------------------------------------------------
     Versions:
        samtools/1.16.1
        samtools/1.17
        samtools/1.19.2-py311-py313-btw6g6q

------------------------------------------------------------------------------------
  For detailed information about a specific "samtools" package (including how to load th
e modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider samtools/1.19.2-py311-py313-btw6g6q
------------------------------------------------------------------------------------

Conda Environments On Ceres, in order to run conda, we have load a module called miniconda

Check your conda environment list
```
module load miniconda
conda env list
```
Check which environment is active currently
```
echo $CONDA_DEFAULT_ENV
conda list 
```
Check details about the environment:
```
conda info
```
Activate a specific environment:
```
source activate <name of environemnt>
```
Deactivate the environment:
```
conda deactivate
```

Apptainer
Apptainer (fornerly singulaity) is a container system for HPCs. It let’s us package the entire software including dependencies, Operating system, scripts, etc. Like mininconda, it is also available as a module on Ceres.

module load apptainer
apptainer -h
apptainer version

Build a blast software image

apptainer build blast_quay.sif docker://quay.io/biocontainers/blast:2.12.0--pl5262h3289130_0
apptainer exec blast_quay.sif blastn -version
#blastn: 2.12.0+
# Package: blast 2.12.0, build Jul 13 2021 09:03:00

Project Management in Bioinformatics

Project management through proper directory organization is clear and intuitive for beginners as well as experts. Proper documentation saves a lot of time and reduces potential for errors down the line.

Project organization is important:

Reproducibility
Collaboration
Scalability
Debugging and Maintenance
Transparency and Pubblications

Mock Project Directory structutre

.
├── 00_Raw_Data
├── 01_QC
├── 02_Trimming
├── 03_Trimmed_Data
├── 04_Alignment
├── 05_Counts
├── 06_DE_Analysis
├── 07_Plots
├── 08_Manuscript
├── 09_Processed_Data
├── README.md
├── config
├── envs
└── scripts

00_Raw_Data: Raw FASTQs, from sequencing core
01_QC → 07_DE_Analysis: analysis steps and results
08_Plots: Figures for interpretation and further ivestigation if needed
09_Manuscript: Final figures, text, references
10_Processed_Data: Final processed data for publication
config/: parameters for scripts, metadata etc.
scripts/: reusable bash/R/python scripts
envs/: Conda environment YAMLs or Apptainer images
README.md: General overview of the project, inputs, steps etc.

Workshop Materials

Workshop materials and recordings available for USDA employees at the links below

Workshop recording

- Thursday, April 17, 2025, 1 – 5 PM ET
  - Prerequisites:
    - Familiarity with basic command-line concepts.
  - Workshop recording