Introduction to Bioinformatics

Provided by: ISU
Registration: Register Here

This workshop provides an overview of bioinformatics covering applications, sequencing technologies, and basic workflows. Participants will also explore various file formats in bioinformatics and gain hands-on experience using NCBI’s BLAST web tool to perform sequence alignments and interpret results.

Pre-Workshop Instructions

To help minimize technical issues and delays at the start of the workshop, please try the following prior to the workshop.

Logging on to Ceres Open OnDemand (OOD): Please confirm you can successfully log in to Ceres OOD with your SCINet account (see instructions here). If you are successful, you will be able to see the Ceres OOD home page.
Ceres Shell Access: When on Ceres OOD, click on the top navigation bar: “Clusters” > “Ceres Shell Access”. A new tab will appear that looks like a shell terminal (e.g., like PowerShell). Please confirm you do not receive any error messages or requests to re-authenticate and that the final line looks like “[firstname.lastname@ceres ~]$”
RStudio Server: Back on the main Ceres OOD tab, click on the top or side navigation bar: “Interactive Apps” > “RStudio Server”.
- Fill the input fields with the following (input fields not listed below can be left at their default values):
  - Account: scinet_workshop2
  - Queue: ceres
  - QOS: 400thread
  - R Version: 4.4.1
  - Number of hours: 1
  - Number of cores: 1
  - Memory required: 8GB
  - Optional Slurm Arguments: (leave empty)
- Click the “Launch” button.
- Wait a moment for the job card to update from “Queued” to “Running”.
- Please confirm that clicking on the “Connect to RStudio Server” button opens a new tab with the RStudio Server interface.

Tutorial Setup Instructions

Steps to prepare for the tutorial session:

Login to Ceres Open OnDemand. For more information on login procedures for web-based SCINet access, see the SCINet access user guide.
Open a command-line session by clicking on “Clusters” -> “Ceres Shell Access” on the top menu. This will open a new tab with a command-line session on Ceres’ login node.
Request resources on a compute node by running the following command.
```
srun --reservation=wk1_workshop -A scinet_workshop2 -t 05:00:00 -n 1 --mem 8G --pty bash 
```
- For those accessing post-workshop: Remove --reservation=wk1_workshop and change -A scinet_workshop2 to your own project account. For additional information, see our SLURM guide.

Create a workshop working directory by running the following commands. Note: you do not have to edit the commands with your username as it will be determined by the $USER variable.

mkdir -p /90daydata/shared/$USER/intro_bioinformatics 
cd /90daydata/shared/$USER/intro_bioinformatics 
cp -r /project/scinet_workshop2/Bioinformatics_series/wk1_workshop/day1/ . 

Load the environment:

module load miniconda 
source activate /project/scinet_workshop2/Bioinformatics_series/conda/ 

Tutorial Introduction

In this tutorial we will use the command line and other bioinformatics tools to explore the different types of data you will encounter in bioinformatics. The exercises and examples provided will guide you through understanding file structures, examining sequences and their associated quality scores and parsing/filtering variant data. We will also explore public databases and use BLAST for performing sequence alignment.

Throughout the tutorial we will use the following command-line tools:

Basic text processing: grep, cut, bioawk, wc
Quality control: fastqc
Filtering variants: bcftools
Downloading public data: wget
Raw sequences from NCBI SRA: SRA Toolkit

Exploring Bioinformatics File Formats

Common file formats in bioinformatics
Format	Description	Use case	File extensions
FASTA	text file with nucleotide or protein sequences	stores raw sequence data	`.fa, .fasta, .fna`
FASTQ	text file with sequences and quality scores	next-generation sequencing reads	`.fq, .fastq`
SAM/BAM	sequence alignment map (SAM) and binary alignment map (BAM)	stores aligned reads	`.sam, .bam`
VCF	variant call format	stores genomic variants (e.g. SNPs)	`.vcf`
GFF/GTF	9 column text format with genomic features	stores information about gene annotation	`.gff, .gtf`

Part 1: Exploring FASTA Files

FASTA files contain a single-line description and ID followed by one or more lines of sequence data.

Line 1: starts with “>” followed by ID
Line 2: Sequence data

Let’s take a look at an actual file:

Are you in the right working directory?

cd /90daydata/shared/$USER/intro_bioinformatics /day1

We can look at the beginning of the file by using head, which displays the first few lines of files.
```
head files/GCF_000001735.4_TAIR10.1_genomic.fna 
```

We can also view the first 4 lines using -n:

head -n 4 files/GCF_000001735.4_TAIR10.1_genomic.fna

We can also look at the last few lines of the file by using tail:
```
tail files/GCF_000001735.4_TAIR10.1_genomic.fna
```
For a scrollable preview of a file we can use less:
```
less files/ GCF_000001735.4_TAIR10.1_genomic.fna
```

You try:

How would you look at the last 20 lines of the file?

Counting the number of sequences with `grep`:

grep is a search tool that is commonly used to find lines that match a pattern.

grep -c "^>" files/GCF_000001735.4_TAIR10.1_genomic.fna

grep: used for matching lines
-c: count; prints number of matching lines
"^>": regular expression pattern that matches any line that starts with >

Determine sequence lengths with `bioawk`:

Quick Guide on `bioawk`

bioawk is an extension of awk, which is a tool used for parsing and processing text files. bioawk has specialized features to handle biological file formats.

`awk`	`bioawk`
general text processing	specialized for biological files
does not understand the format of biological files	aware of biological file formats
manually split fields that are identified only by number (e.g., `$1`, `$2`)	Automatically generates named fields from file contents (e.g., `$seq`, `$qual`)
requires extra code/logic to parse biological data	has built-in functions for parsing sequences, reads and variants

Command	Description	Fields
`bioawk -c fastx`	parses FASTA and FASTQ files	`$name`, `$seq`, `$qual`
`bioawk -c vcf`	parses genetic variants (VCF)	`$chrom`, `$pos`,`$ref`,`$qual`, etc.

Basic format of a bioawk command:

bioawk -c <file format> 'action' input_file

Get sequence names and lengths

bioawk -c fastx '{print $name, length($seq)}' files/GCF_000001735.4_TAIR10.1_genomic.fna | head

Note: The pipe | in the line of code above is taking the output of the command on the left and using it as the input for the command on the right.

Note: The {} is telling bioawk to perform the action provided on each line/record

You try:

How would you modify the command above to print sequences with lengths longer than 400,000 bp?

Discussion Questions:

Why is it important to look at sequence lengths?
What could the sequence header information be used for in analysis downstream?

Part 2: Exploring FastQ files

FastQ files are similar to fasta files, but also contain the quality score of the sequence data.

Line 1: starts with “@” followed by ID
Line 2: Sequence data
Line 3: Starts with “+”
Line 4: Quality score for each base in the sequence

Quality scores indicate the confidence of the base call by the sequencer.

Quality scores are encoded as American Standard Code for Information Interchange (ASCII) characters.
Different types of encodings are available and vary across sequencing technologies

Phred quality score (Q) calculation:

[
Q= -10 \times \log_{10}(P)
]

Where:

(Q) = Phred quality score
(P) = probability that base call is incorrect

ASCII Char	Phred Score	Error Probability
!	0	1.00
I	40	0.0001
@	31	0.00079

Phred Score	Error Probability	Interpretation
10	1 in 10	90% accurate
20	1 in 100	99% accurate
30	1 in 1000	99.9% accurate
40	1 in 10000	99.99% accurate

Let’s take a look at an actual file:

head files/SRR4420293_1.fastq

head -n 4 files/SRR4420293_1.fastq

We can also use bioawk:

As a reminder of the format of bioawk commands:

bioawk -c <file format> 'action' input_file

bioawk -c fastx '{print $name, length($seq), $qual}' files/SRR4420293_1.fastq | head -n 5

Note: The pipe | in the line of code above is taking the output of the command on the left and using it as the input for the command on the right.

You try:

Count reads in the FASTQ file using grep

Other ways to count the number of reads:

Count lines using wc:
```
wc -l files/SRR4420293_1.fastq
```
But this gives line counts. Is that what we want? What should we do here to get the number of reads?
1. First let’s get the number of lines without the file name:
```
wc -l < files/SRR4420293_1.fastq
```
  “<” here is feeding the contents of the file into wc without passing the filename as an argument. This is called input redirection.
2. In order for us to do integer math directly in bash we need: $((….)) and for us to see the result of the calculation in the shell we will use echo.
```
echo $(( $(wc -l < files/SRR4420293_1.fastq)/4)) 
```
3. We also need to put the wc function in $(), which will cause bash to execute the command inside the parentheses and then use the output when evaluating the rest of the command:
```
echo $(( $(wc -l < files/SRR4420293_1.fastq)/4)) 
```
  Let’s again break down the pieces of this solution:
  - $(wc -l < files/SRR4420293_1.fastq): to get line count
  - $((....)): Do integer math directly in Bash:
  - echo: prints the result

Using bioawk:

bioawk -c fastx 'END {print NR}' files/SRR4420293_1.fastq

- `NR` - a built in variable in `bioawk` that stands for the number of records/reads
- `END`: end of the file

What’s the average read length?

bioawk -c fastx '{sum += length($seq)} END {print sum/NR} files/SRR4420293_1.fastq

- `sum += length($seq)` keeps adding each read's length to a running total

Let’s look at the file quality using FastQC

FastQC is a tool used to check the quality of sequencing reads from FASTQ files. The report tells us about base quality, sequence length distributions, adapter content, GC content and overrepresented sequences.

module load fastqc
mkdir fastqc_output
fastqc -o fastqc_output files/SRR4420293_1.fastq

-o: output directory

Discussion Questions:

What trends do you observe in quality across reads?
Which reads should we consider trimming or filtering?
How could poor-quality reads impact downstream analysis?

Part 3: Exploring Variant Call Format (VCF) Files

These files store sequence variants, SNPs and indels.

Metadata lines: each line starts with ## followed by key=value pairs
single header line: starts with single # and describes columns in the data lines
data lines: sequence variation data

Layout and structure of VCF files:

View the header only
```
grep '^#' files/chinook.vcf | less
```
View the first few variant lines:
```
grep -v '^#' files/chinook.vcg | less
```
Note: -v tells grep to invert the match and only show lines that do *not match the pattern provided*

Inspecting columns in VCF files

Column No.	Name	Description
1	CHROM	Chromosome name or contig
2	POS	Position of variant
3	ID	Identifier
4	REF	Reference alle
5	ALT	Alternate allele(s)
6	QUAL	Quality score for the variant
7	FILTER	Pass/fail status
8	INFO	Additional annotation

The cut command can be used to extract specific columns/fields from a file.
```
cut -f1 files/chinook.vcf | head -n 5 
```

Let’s remove the header:

cut -f1 files/chinook.vcf | grep -v '^#' | head -n 5 

What chromosomes/contigs are in this file?
- Let us look at the first column, remove the header and sort the chromosomes/contigs:
```
cut -f1 files/chinook.vcf | grep -v '^#' | sort | less 
```
- Now, let’s get rid of the repeats using uniq:
```
cut -f1 files/chinook.vcf | grep -v '^#' | sort | uniq | less 
```

What’s in the INFO column?

cut -f8 files/chinook.vcf | grep -v '^#' | head -n 5

Let’s look at the kinds of mutations
- To do this we will extract column 5 (ALT) and remove the header:
```
cut -f5 files/chinook.vcf | grep -v '^#' | head 
```
- Next, we will sort the variants :
```
cut -f5 files/chinook.vcf | grep -v '^#' | sort | head 
```
- Let’s get rid of the repeats and count how many times each uniq variant appears:
```
cut -f5 files/chinook.vcf | grep -v '^#' | sort | uniq -c | head 
```
- Lastly, let’s sort the results with the most frequent variant at the top
```
cut -f5 files/chinook.vcf | grep -v '^#' | sort | uniq -c | sort -nr| head 
```
  Note: using -nr sorts the results numerically and in reverse order

If we wanted to do more complex, format-aware analyses we could do that with bcftools and bioawk.

Look at a VCF file using `bcftools` and `bioawk`:

Using bioawk:

bioawk -c vcf '{print $chrom, $pos, $ref, $alt, $qual}' files/chinook.vcf | head -n 5 

While bioawk is a versatile tool for processing files in various biological sequence data formats, bcftools was specifically built for working with VCF and BCF files. To further understand the VCF structure, metadata, and complex filtering, we will focus on using bcftools for the remainder of the tutorial.

Using bcftools:
bcftools is a command-line tool used for viewing, filtering and manipulating VCF files. We will use bcftools to help us summarize variant info and for extracting and selecting variants.

Options associated with bcftools
Option	Explanation
-r	the output is restricted to a specific region
i	include records that satisfy the given condition
-v	filter to show the variant type
-H	skips the VCF header lines and prints the variant lines only

command	function
view	Filter and subset VCFs
query	Extract custom fields from VCF
stats	Generate VCF summary statistics

module load bcftools
bcftools view files/chinook.vcf | head -n 20

We can use the bcftools query command to inspect columns similar to how we used cut, but instead of calling column numbers we can use the column name:

bcftools query -f '%INFO\n' files/chinook.vcf | head 

bcftools query -f '%CHROM\n' files/chinook.vcf | head 

bcftools query -f '%QUAL\n' files/chinook.vcf | head 

Count the total number of variants:
```
bcftools view -H files/chinook.vcf| wc -l
```
Note: -H skips the header lines

Filtering variants

By quality:

bcftools view -i 'QUAL>1000' files/chinook.vcf > high_qual_bcf.vcf 

head high_qual_bcf.vcf

If we want to remove the headers, what can we do here?

By variant type:

bcftools view -v snps -H  files/chinook.vcf | less

bcftools view -v snps files/chinook.vcf -0z -o snps_only_bcf.vcf

By the number of reads supporting the variant (Depth):

bcftools view -i 'INFO/DP>10' -H files/chinook.vcf | less 

Can we combine filters? For example, what if we wanted to filter for QS >=1000 and depth >=30:
```
bcftools view -i 'QUAL > = 1000 && INFO/DP>30' -H files/chinook.vcf | less 
```

VCF file summary with bcftools
bcftools can generate summary statistics file on your VCF files:

bcftools stats files/chinook.vcf > stats_vcf.txt

less stats_vcf.txt

To plot:

plot-vcfstats -p stats_output stats_vcf.txt

Discussion Questions:

Why is filtering by quality important?
Why might SNPs be prioritized over other variant types?
How would you decide which variants are biologically relevant?
How might filters change depending on if you were studying disease vs breeding?

Exploring public repositories/databases

We will now explore a few public repositories. The most commonly used repositories/public databases are hosted by NCBI National Center for Biotechnology Information.

Let’s explore the website:

Multiple Databases
Landing Page:
- Submitting sequences
- Downloading sequences
- Tutorials
- Developing APIs/Code libraries
- Various tools
- Explore research
- Popular Resources

Use cases for a scientist

Literature search
Exploring a gene sequence
Exploring genomes
Downloading data

Another popular public database is the European Nucleotide Archive. This archive primarily started as a repository for storing raw sequence data, metadata etc. In this respect it is similar to the databases hosted by NCBI but they have different toolsets.

Core Role & Data Sharing: NCBI vs ENA
Feature	NCBI (USA)	ENA (Europe)
Main Role	Central US resource for genomic data storage and retrieval	Central European resource for nucleotide data archiving
Organization	National Center for Biotechnology Information (NIH)	European Nucleotide Archive (EMBL-EBI)
Key Databases	SRA, GenBank, RefSeq, GEO, dbSNP	ENA (includes raw reads, assemblies, annotations)
Data Submission	Accepts direct submissions via SRA Submission Portal	Submissions via Webin Portal
Data Access	Via web, `sra-tools`, and E-utilities API	Via web, RESTful APIs, FTP, and direct links
Data Sharing	Collaborates with INSDC (ENA & DDBJ) for daily sync	Also part of INSDC — fully synchronized with NCBI and DDBJ
Data Format	`.sra` format (needs `fastq-dump` to convert)	Direct FASTQ/FASTA/TSV downloads (simpler access)
Typical Use Case	US-based research and NIH-funded data	EU-based research or quick bulk access

International Collaboration

Both NCBI and ENA are members of the INSDC (International Nucleotide Sequence Database Collaboration), along with DDBJ (Japan).

They synchronize data daily, so any data submitted to one appears in the others.

User Interfaces & Portals: NCBI vs ENA
Feature	NCBI (National Center for Biotechnology Information)	ENA (European Nucleotide Archive)
Primary Portal	https://www.ncbi.nlm.nih.gov/	https://www.ebi.ac.uk/ena
Search Interface	Entrez (integrated search for all databases)	ENA Browser (data-centric search)
BLAST Access	NCBI BLAST web interface	EMBL-EBI BLAST service
FTP Access	ftp.ncbi.nlm.nih.gov	ftp.ebi.ac.uk/pub/databases/ena
APIs & Programmatic Access	E-utilities (Entrez API), Datasets, SRA-tools	ENA Portal API, ENA REST API
Submission Portals	BioProject, SRA, GenBank, GEO, dbGaP submission tools	Webin (single portal for all data types)
Visualization Tools	Genome Data Viewer, GEO Profiles	Interactive web viewers for entries
Data Download Options	Datasets command-line tool, direct FTP/HTTP links	Direct FTP, programmatic downloads
Help & Documentation	Extensive help docs, NCBI Handbook, video tutorials	ENA Docs, FAQs, webinars

Both portals are interoperable through the INSDC framework. Tools and formats often overlap in functionality.

Download data with SRA toolkit

Demonstrate some SRA sequence downloads from NCBI and compare with ENA

There are two ways to get the SRR list:

From the website:
- SRP433780
- Go to the link above and send the results to run selector
Command line step-by-step:

# Load the modules 
module load  edirect/23.6.20250307 
module load sratoolkit 

# get the complete SRA run information file and cull the SRR accesion list and save in a file 
esearch -db sra -query SRP433780 | efetch -format runinfo > SRP433780_runinfo.csv 
cut -d ',' -f1 SRP433780_runinfo.csv | grep SRR > SRR_accessions.txt 

# check SRR_accessions.txt and select one to download in two steps: 
prefetch SRR24255343 
fasterq-dump SRR24378108 --split-files --threads 4 

You try:

In The European Nucleotide archive: How do we download the fastq associated with this SRR accession?
Choose a database and search for a dataset of interest.
Download one FASTQ, FASTA or VCF file and use some of the commands discussed to examine the file.
Share something you learned about your dataset. For example: How many reads or variants are present in the file?

BLAST

The NCBI blast website is the most popular tool with the NCBI.

Tool	Query Type	Database Type	Description
blastn	Nucleotide	Nucleotide	Compares nucleotide query to nucleotide DB
blastp	Protein	Protein	Compares protein query to protein DB
blastx	Nucleotide (translated)	Protein	Translates nucleotide query, searches protein DB
tblastn	Protein	Nucleotide (translated)	Translates nucleotide DB, searches with protein query
tblastx	Nucleotide (translated)	Nucleotide (translated)	Translates both query and DB, compares proteins

NCBI Web-Based BLAST

Go to: “Nucleotide BLAST”

Paste a sample sequence:

>sample_seq 
AGTGTCTCCCGGTCGCGCGTGGAGGTCGGTCGCTCAGAGCTGCTGGGCGCAGTTTCTCCGCCTGCTGCTT 
CGGCGCGGCTGTATCGGCGAGCGAGCGAGTTCCCGCGAGTTCTCGGTGGCGCTCCCCCTTCCTTTCAGTC 
TCCACGGACTGGCCCCTCGTCCTTCTACTTGACCGCTCCCGTCTTCCGCCGCCTTCTGGCGCTTTCCGTT 
GGGCCGATTCCCGCCCGCTTCCTCCTGCTTCCCATCGAAGCTCTAGAAATGAATGTTTCCATCTCTTCAG 
AGATGAACCAGGTAATACGCGCTGGTTCTACGAACGACAGATGAGGGAGACGGCGCGGCTAGAATCCGAG 
AAGAAGGGATGGCGCCGGCGGATGGGAAGAGGGTGGGAGGCGCGGAAGCGGTGTCCTCATCAGGGGAGGC 
AGCCCCAAGCGGCCGCCGCCGCCCTCTGGGACGTGAAGCCCGCGCCGCGCTGGGCCCGCGCTCCAGCGCT 
GCCATGGTTGCCAAGTTGCGTTGGCGGCCGAGAGCGGGCGCCGGTCGCCTCGGAGAGCGCGGAGGCTGGA 
GCCCCTTTGCTACACTGGCGCGGGTGAGGCAGGCTGGGAGGAACAAGAGTTTTTTGTTCGAAGGGTTTTG 
GGGGGCCTGGGTTAGGGCGCCGCGGGCGGGGATGACCCGCCGGAAGGAGGGCGCGGGACTCCCCGTTCTT 
GCTGTCAGGAACGGACGCCTCGCCTCGGGTTTGCCTGGGGTTTGGGATTCTCTTCTGGAAGAGCTCTCGA 
GACTCGGCTCGTGTGGGCGGGCAGCCAGCCCCGGGCCTGGAGTAGGGTGGCACGGAGTCCCCGATCGCCG 
TGGGCCGCGGGGTCCTTTGTTCCCGCTCCACGTTGCCCGCTTTTCTTGCCAAGCGCGGGGAGAAGGGGGC 
GGGAGGAGGGAGGGAGCGGCTGCCCTGACGTGTCGGTACTGAGTGACTGCGGGGCTGGCCAATCCGGGGC 
GGGGGTGTGCGGGTGCTGGGGGCCTCGCCTCGCAGCCTGCGGAGTGGGCGCCGGCCTGGCTGCGCGAGGA 
GGAAGGCCTGGGACGCTCTTTCTTTTTTATGAAAGAAATCAGTGGCAAGATTTGCTCTTTTTCCGTCCCT 
CCACGCTTTTGGTTAAGTGTCTCTGATTATAAGCTCTTGATGATAGGAAGGTATCAGGCTGAGGGTTGAA 
CCTAGGGTAACTTGAACCACTACTTGAGAACTACATTTACTTTTTCCCCCAATAGGTAGTGGACATATCT 
ATTGGTTTGAGACAGCTGACATTTCAAGGAGAAATCAGATGTCCAAAAGGGCGCATTTTTGTGATGGAGC 
GTGCAGTGATGGAGCGTTAGAACACCTTGGCATCAAGCTGTTCGTTGAGTGTGTCGTGGTCTTCTGTCTA 
CTAATAGATGACAACTCTGGAAGCCTAGTACCACTCTTAACAGATGAATAAGTACAGCATGGACTAGACT 
GCCACAGGCCATGCTTTCTTTTAATAATTCTAGCACCAGTGATTGATTTAGGAAAAACAAAATACTGATG 
ATTACTTTTAGGCTAAAGCCTGCTGACACTTCTGTCTTAAAGATACTGAAAGAGTAGTTGTATTTGTTAA 
GTCTGGACGTGAGGTAAACATGGACTTTAGAATTGAATTGAGACTAGTATCATTTAACGCCAGCTGGCAG 
GCTGCTTATCAAGCTTATAATTTGGTAGCCAGAGGTAAACGGAACTTGAGCCACTGACCAAAGAGAGCCC 
TGGAATGGTTTTCCTCTGTGTTTTCCTAATAGATGATATCTCTGGTTAACTAATAGATACTAATTTCTAT 
AGCCATTGTCTTAATTTGTAGTTGATTTTCTAACTTTCCCCCCAAGACAAAACATTTCAGGTTTTAGTCT 
TAGTTTTAAATTAGTGTCTTTTTGGCTACTTGCTTTTGGAAGGTGGGATTTTTTCTTGCTTGAAGGATTT 
GTTAGATGGTAATTAAATGTAGTTTTGCAAATACGTTTTTAAATATAAATGTTTTCCTGTTAAGGAAGTA 
TCTTAATTGATATTAAGATGAAGTAACACTAAGTAAGTCATTTCATCCACTTTTTAGCAGTGCGATTGTA 

Select Database: core_nt or Nucleotide collection (or try each one separately)
Organism (Optional): Homo sapiens (if necessary)
Select program/algorithm and Click: “BLAST”
Output:
- Descriptions: Score, E-value, identities
- Graphical alignment
- Pairwise-matches
- Taxonomy etc.

Command-line BLAST

Requisites:

module load blast+/2.15.0 
# check help 
blastn -help 

# We already have the NT database available on Ceres 
export NT=/reference/data/NCBI/blast/2025-01-16/ 

blastn -query gene.fna -task blastn -db $NT/nt -num_threads 36 -out gene-nt1.blast.xml 

If the database weren’t already available, we would have to make the blast database:

makeblastdb -in db.fa -dbtype nucl -out db

Then run blastn

blastn -query query.fa -db db -out results.txt -outfmt 6

Sources

Bioinformatics Workbook [Online] Available at: https://bioinformaticsworkbook.org/#gsc.tab=0 (Accessed April 5, 2025)
Anderson, E.C (2024). Practical Computing and Bioinformatics for Conservation and Evolutionary Genomics. Accessed April 11, 2025

Workshop Materials

Workshop materials and recordings available for USDA employees at the links below

Workshop recording

- Tuesday, April 15, 2025, 1 – 4 PM ET
  - Instructor: Lavida Rogers (SCINet Office, USDA-ARS), Siva Chudalayandi, Rick Masonbrink, and Viswanathan Satheesh (Genome Informatics Facility, ISU)
  - Prerequisites:
    - Familiarity with basic command-line concepts.
  - Workshop recording

Introduction to Bioinformatics

Pre-Workshop Instructions

Tutorial Setup Instructions

Tutorial Introduction

Exploring Bioinformatics File Formats

Part 1: Exploring FASTA Files

You try:

Counting the number of sequences with grep:

Determine sequence lengths with bioawk:

Quick Guide on bioawk

You try:

Discussion Questions:

Part 2: Exploring FastQ files

Phred quality score (Q) calculation:

You try:

Other ways to count the number of reads:

Let’s look at the file quality using FastQC

Discussion Questions:

Part 3: Exploring Variant Call Format (VCF) Files

Layout and structure of VCF files:

Look at a VCF file using bcftools and bioawk:

Discussion Questions:

Exploring public repositories/databases

Download data with SRA toolkit

You try:

BLAST

NCBI Web-Based BLAST

Command-line BLAST

Sources

Workshop Materials

Counting the number of sequences with `grep`:

Determine sequence lengths with `bioawk`:

Quick Guide on `bioawk`

Look at a VCF file using `bcftools` and `bioawk`: