Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions content/authors/det/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
# Display name
title: Deb Triant

# Username (this should match the folder name)
authors:
- det

# Is this the primary user of the site?
superuser: false

# Role/position
role: Research Computing Scientist

# Organizations/Affiliations
organizations:
- name: University of Virginia Research Computing
url: "https://www.rc.virginia.edu"


interests:
- Bioinformatics
- HPC
- Research

---
17 changes: 17 additions & 0 deletions content/notes/bioinfo-intro/02-intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
title: Bioinformatics
date: 2025-08-23-03:19:53Z
type: docs
weight: 150
menu:
bioinfo-intro:
---

The term _Bioinformatics_ first appeared in 1970s and exploded in the 1990s with the Human Genome Project and the rise of high-throughput sequencing technologies. Earlier roots include computational tools for analyzing molecular data developed in the 1960s, with methodological precedents in wartime cryptanalytic work from the 1940s.

> [Read More](https://www.nature.com/articles/35042090)

{{< figure src=/notes/bioinfo-intro/img/bioinformatics-ss.png caption="Bioinformatics sits at the intersection of biology, computer science, mathematics/statistics, engineering, and biochemistry." width=70% height=70% >}}



59 changes: 59 additions & 0 deletions content/notes/bioinfo-intro/03-analys-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
title: Types of Bioinformatics Analyses
date: 2025-08-23-03:19:53Z
type: docs
weight: 200
menu:
bioinfo-intro:
parent: Bioinformatics
---

The following are common categories of analyses performed in modern genomics and systems biology.

**1. Proteomics**

Proteomics is the large-scale study of proteins.

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_6.png caption="Protein structure ribbon diagram" width=30% height=30% >}}

**2. Metabolomics**

Metabolomics focuses on the complete set of small molecules within a biological sample.

**3. RNA-Seq**

RNA Sequencing is used to quantify RNA molecules and gene expression.

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_7.png caption="RNA-seq protocol similarity heatmap" width=45% height=45% >}}

**4. Single-cell Analysis**

Single-cell analysis explores gene expression at the individual cell level.

**5. Genome Assembly and Annotation**

Genome assembly and annotation reconstructs complete genomes from short or long sequencing reads and labels genes, regulatory regions, and functional elements.

**6. Regulatory Genomics**

Regulatory genomics explores how DNA and other factors control gene expression patterns.

**7. Variant Calling and Haplotype Analysis**

Variant calling and haplotype analysis identifies base substitutions (Single Nucleotide Variants), helping identify mutations.

Example SNV:

C <span style="color:#3469c0">A</span> GCTTA <span style="color:#3469c0">G</span>

<span style="color:#ff0000">T</span> GCTTA <span style="color:#ff0000">T</span>

<span style="color:#3469c0">A</span> GCTTA <span style="color:#3469c0">G</span>

A <span style="color:#3469c0">A</span> GCTTACG <span style="color:#3469c0">G</span>

><small>Blue = reference base (G), red = alternate base (T).</small>

[Read More: RNA-Seq Methods](https://www.nature.com/articles/s41592-024-02298-3)


62 changes: 62 additions & 0 deletions content/notes/bioinfo-intro/04-databases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: Databases
date: 2025-08-23-03:19:53Z
type: docs
weight: 250
menu:
bioinfo-intro:
parent: Bioinformatics
---

**InterPro**

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_8.png width=45% height=45% >}}


[https://www.ebi.ac.uk/interpro/entry/pfam](https://www.ebi.ac.uk/interpro/entry/pfam)

---

**National Library of Medicine**

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_9.png width=45% height=45% >}}


[https://www.ncbi.nlm.nih.gov](https://www.ncbi.nlm.nih.gov)

---

**Ensembl**

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_11.png width=45% height=45% >}}


[https://www.ensembl.org/index.html](https://www.ensembl.org/index.html)

---

**Fang**

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_10.png width=45% height=45% >}}


[https://data.faang.org/home](https://data.faang.org/home)

---

**EMBL-EBI**

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_13.png width=65% height=65% >}}


[https://www.ebi.ac.uk](https://www.ebi.ac.uk)

---

**RGD**

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_12.png width=75% height=75% >}}


[https://rgd.mcw.edu](https://rgd.mcw.edu)

21 changes: 21 additions & 0 deletions content/notes/bioinfo-intro/05-technologies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
title: Sequencing Technologies
date: 2025-08-23-03:19:53Z
type: docs
weight: 300
menu:
bioinfo-intro:
parent: Bioinformatics
---


**Illumina**: Illumina tools generate short read sequences (< 1kb). They are widely used for whole-genome and exome sequencing, small RNA/microRNA profiling, and many single-cell applications.

**PacBio**: PacBio generates long read sequences (~ 25 kb). The PacBio Revio sequencer is available at UVA.

**Nanopore**: Nanopore generates "ultra-long" sequences (up to 1Mb).

**HiC**: HiC is a crosslinking technique used to capture interactions within a genome.



19 changes: 19 additions & 0 deletions content/notes/bioinfo-intro/06-pacbio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
title: PacBio HiFi Reads
date: 2025-08-23-03:19:53Z
type: docs
weight: 350
menu:
bioinfo-intro:
parent: Bioinformatics
---

The below figure compares how the different sequencing technologies map reads to the STRC gene.

{{< figure src=/notes/bioinfo-intro/img/pacbio.png width=90% height=90% >}}

This shows how PacBio produces reads that are both long and accurate.

[Read More](https://www.pacb.com)


70 changes: 70 additions & 0 deletions content/notes/bioinfo-intro/07-fileformats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: File Formats
date: 2025-08-23-03:19:53Z
type: docs
weight: 400
menu:
bioinfo-intro:
parent: Bioinformatics
---

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_16.png caption="Source: https://xkcd.com/927/" width=80% height=80% >}}

The format name usually denotes the file suffix.

**FASTA** files (suffix: `.fasta`, `.fna`, `.fa`) store sequencing data.

**FASTQ** files (suffix: `.fastq`) include sequencing data and quality scores.

**SAM/BAM** files (suffix: `.sam`/`.bam`) were developed for next-generation sequencing (NGS) data. SAM stands for Sequence Alignment Map. These files are used to store alignment information.

**VCF** (suffix: `.vcf`) stands for Variant Call Format. These files are used to store information about genetic variants. [Read More](https://samtools.github.io/hts-specs/VCFv4.2.pdf)

**GFF3** (suffix: `.gff3`) stands for Generic Feature Format (version 3). These files are used to store information about genomic features. [Read More](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)

**BED** (suffix: `.bed`) stands for Browser Extensible Data format. These files are used to store genomic regions. [Read More](https://github.com/arq5x/bedtools2)

### FASTA Format

A FASTA file begins with a header line, indicated by the `>` symbol, that contains an identifier and optional description The following lines contain the biological sequence itself.


<span style="color:#ff0000"> __>__ </span> NP_000552.2 Human glutathione transferase M1 (GSTM1) ```
MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAPDYDRSQWLNEKFKLGLDFPNLPYLIDGAHKITQSNAILCYIARKHNLCGETEEEKIRVDILENQTMDNHMQLGMICYNPEFEKLKPKYLEELPEKLKLYSEFLGKRPWFAGNKITFVDFLVYDVLDLHRIFEPKCLDAFPNLKDFISRFEGLEKISAYMKSSRFLPRPVFSKMAVWGNK```


### FASTQ Format

FASTQ files are helpful for base calling, quality control, and trimming.

Most sequencing tools return data in FASTQ format with quality scores included (ASCII code).

FASTQ files contain four lines:
1. ID, beginning with `@`
2. Sequence
3. Description line (typically a `+`)
4. Base qualities in ASCII format

```plaintext
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATA
+
!''*((((***+))%%%++)(%%%%).1***-
```

**FASTQ File Example: Multiple Reads**

```plaintext
@M00747:32:000000000-A16RG:1:1112:15153:29246 1:N:0:1
TCGATCGAGTAACTCGCTGCTGTCAGACTGGTTTTTGGTCGATCGACTATTGTTTCAGTCGCAAGAATATTGTGTCCAGTCGATCGACTGAATTCTGCTGTACGGCCACGGCGGATGCACGGTACAGCAGGCTCAGACGGATTAAACTGTT
+
5=9=9<=9,-5@<<55>,6+8AC>EE.88AE9CDD7>+7.CC9CD+++5@=-FCCA@EF@+**+*--55--AA---AA-5A<9C+3+<9)4++=E=+===<D94)00=9)))2@624(/(/2/-(.(6;9(((((.(.'((6-66<6(///
@M00747:32:000000000-A16RG:1:1112:15536:29246 1:N:0:1
GTAAAATTGAGGTAAATTGTGCGGAATTTAGCAATACCGTTTTTTTTATTATCACCGGATATCTATTCTGCTGTACGGCCAAGGAGGATGTACGGTACAGCAGGTGCGAACTCACTCCGACGCTCAAGTCAGTGACTTAATGATAAGCGTG
+
?????<BBBBBB5<?BFFFFFFECHEFFECCFF?9AAC>7@FHHHHHHFG?EAFGF@EEDEHHDGHHCBDFFGDFHF)<CCD@F,+3=CFBDFHBD++??DBDEEEDE:):CBEEEBCE68>?))5?**0?:AE*A*0//:/*:*:**.0)
@M00747:32:000000000-A16RG:1:1112:15513:29246 1:N:0:1
GCTAGTCTTGTGTTTAGTTTTATGTTTTGCATGTTGTAACGGATTCATAAACATAGGTGTTTGTTTCTTTTTATGGTTGTACAATTTGGCCCTAAGGCCCTACACTTACTTGTTTGTTTCTTTTATGGTACGACATTTGAGTGGTGGTTGA
+
```

27 changes: 27 additions & 0 deletions content/notes/bioinfo-intro/08-qualityscores.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: Quality Scores
date: 2025-08-23-03:19:53Z
type: docs
weight: 450
menu:
bioinfo-intro:
parent: Bioinformatics
---

`Q` (Quality) scores are defined as a property that is logarithmically related to the base calling error probabilities (`P`).

### Calculating Phred Quality Scores - Base calling accuracy

$$
Q = -10 \log_{10} P
$$

`Q` represents the sequencing quality score of a given base Q

`P` represents the probability of base call being wrong

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_21.png caption="Table source: https://www.illumina.com/Documents/products/technotes/technote_Q-Scores.pdf " width=90% height=90% >}}

While next-generation sequencing metrics vary from those of Sanger sequencing (e.g., no electropherogram peak heights), the process of generating a Phred quality scoring scheme is largely the same.

[More on Quality Scores](https://help.basespace.illumina.com/files-used-by-basespace/quality-scores)
38 changes: 38 additions & 0 deletions content/notes/bioinfo-intro/09-sambam.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: SAM/BAM Sequence Alignment
date: 2025-08-23-03:19:53Z
type: docs
weight: 500
menu:
bioinfo-intro:
parent: Bioinformatics
---

An alignment file provides context for raw data. It has 11 tab-delimited columns with one alignment record per line.

`SAM` is plain-text (human readable), whereas `BAM` is in binary format.

[SAMTools](http://samtools.sourceforge.net) is a suite of utilities for SAM/BAM files. [Picard](https://broadinstitute.github.io/picard/) is a set of tools for sequencing data.

### Example SAM file

```plaintext
D4ZHLFP1:53:D2386ACXX:6:2115:17945:68812 0 Mle_000001 18 42 108M * 0 0
TCCCCCTGCATGGTCCGTCTGCGTGCAATCGCATGAGTATGCCTCCAGCATGAGTTACCGATCGTGGACACCTGCTTG
GCCAAGATGTACTGAGATGCAT
C@CFDEFFHHGHHFGBGFEGGDGGGEHGHGGGJJJJIIGIIB9BFBFHGHHICEAHGGEGEDHIGEEDBECCACBDDC@CCDBCDD<
?2+4>@4>>CCCAA@@ AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0A107
YT:Z:UU
D4ZHLFP1:53:D2386ACXX:7:2110:5214:83081 0 Mle_000001 18 42 108M * 0 0
TCCCCCTGCATGGTCCGTCTGCGTGCAATCGCATGAGTATGCCTCCAGCATGAGTTACCGATCGTGGCAACCTGCTTGCCAA
GATGTACTGAGATGCAT
CCCFFFFHHHHHHHGGGEGIJIIGJFHJJJJIJIJJIJIJGIJJIJJIJFHJJJIJJHHFFCEEEEEDDDDDDDDDDDDD AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0A107
YT:Z:UU
D4ZHLFP1:53:D2386ACXX:7:2206:9985:31556 0 Mle_000001 18 42 108M * 0 0
TCCCCCTGCATGGTCCGTCTGCGTGCAATCGCATGAGTATGCCTCCAGCATGAGTTACCGATCGTGGCAACCTGCTTGCCAA
GATGTACTGAGATGCAT
CCCEFFFFHHHHHJJIJHJJIJIJJIJIJJJJIJIJJJIJJIJJJIJJJGEFFEEEEDDDDDDDDDDDDDDDDDDDDDDD AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0A107
YT:Z:UU
```

Helpful site for looking up `SAM` flags: [https://broadinstitute.github.io/picard/explain-flags.html](https://broadinstitute.github.io/picard/explain-flags.html)
21 changes: 21 additions & 0 deletions content/notes/bioinfo-intro/10-fastqc-qualityreads.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
title: Checking Read Quality - FASTQC
date: 2025-08-23-03:19:53Z
type: docs
weight: 550
menu:
bioinfo-intro:
parent: Bioinformatics
---

FASTQC provides an overview of sequencing read quality.

Sample FASTQC reports displaying varying metrics:

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_24.png width=85% height=85% caption="FASTQC report showing per-base sequence quality, with most bases maintaining high quality across reads and slight drops at the read ends typical of Illumina data." >}}

{{< figure src=/notes/bioinfo-intro/img/Intro-Bioinformatics-for-posting_20250604_25.png width=85% height=85% caption="FASTQC report showing per-base sequence quality, where read quality declines toward the end, indicating potential sequencing degradation or lower confidence in base calls at later positions." >}}

{{< figure src=/notes/bioinfo-intro/img/readq3.png width=85% height=85% caption="FASTQC report showing a decline in per-base sequence quality toward the end of reads, indicating significant quality drop-off and potential sequencing errors in later positions." >}}

[Read More](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
16 changes: 16 additions & 0 deletions content/notes/bioinfo-intro/11-rcresources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
title: Research Computing Resources
date: 2025-08-23-03:19:53Z
type: docs
weight: 600
menu:
bioinfo-intro:
---

Relevant Tutorials:

[Using UVA's HPC System from the Terminal](https://learning.rc.virginia.edu/notes/hpc-from-terminal/)

[HPC orientation session and office hours](https://www.rc.virginia.edu/support/#office-hours)


Loading