Glossary

File Formats

SAM – Sequence Alignment Map

A text-based file format that stores sequence data in a tab-delimited format. Format Specification for version 1.0. It is used to store the alignment of sequence reads to a reference genome. SAM files are used to visualize and analyze sequence data in genome browsers like IGV.

Additional keys are defined in the Optional Fields Specification.

BAM – Binary Alignment Map

A binary version of the SAM file format. It is a compressed, indexed, and binary representation of the sequence alignment data. BAM files are used to store sequence data efficiently. They can be visualized and analyzed in genome browsers like IGV.

BED – Browser Extensible Data

A tab-separated-file format that stores genomic regions as a list of intervals.

It looks like this:

chr1	11873	12227	ENSG00000223972.5	0	+
chr1	12612	12721	ENSG00000227232.5	0	-
chr1	13220	14409	ENSG00000278267.1	0	+

The first three columns represent the chromosome, start position, and end position of the interval. The other columns contain additional information about the interval. Importantly, the start position in a bed file is 0-based, and the end position refers to the first base that is outside the interval. So, to refer to the first base of chromosome 1, one would write

chr1 0 1 Interval_of_length_1

VCF – Variant Call Format

Tab-separated-file format that stores information about genetic variants. Format Specification for VCFv4.5. It is used to represent SNPs, insertions, deletions, and other types of genetic variations. VCF files contain information about the variant's position, reference allele, alternative allele, quality score, and other annotations. @vcfPaper

It looks like this:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1	sample2
chr1	10001	.	A	C	30	PASS	.	GT:DP	0/1:10	1/1:20
chr1	10002	.	T	G	20	PASS	.	GT:DP	0/1:15	1/1:25

The first row contains the column headers. The first eight columns are mandatory fields that describe the variant. The remaining columns contain information about the samples and their genotypes. Importantly, unlike in BED, the position column in VCF files is 1-based.

FASTQ

A text-based file format that stores DNA or RNA sequences, optionally with corresponding quality scores such as produced by most sequencing machines. It is used to store raw sequencing data from high-throughput sequencing experiments. FASTQ files contain the sequence data in the form of nucleotide bases (A, T, C, G) and Phred-scaled quality scores that represent the confidence in each base call.

BCF – Binary Call Format

A binary version of the VCF file format. It stores variant calls in a compressed and indexed binary format. BCF files are more efficient for storing large variant datasets and can be used with tools like BCFtools.

BGZF – Blocked GNU Zip Format

A block compression format in which the input file is compressed as a series of GZIP-compressed blocks that can be indexed for efficient random access.

qname – query template name

A unique identifier for a sequence read in a SAM or BAM file. It is used to track the origin of the read and link it to other information in the file. Example: NB502094:69:HN2H2BGX5:2:21111:24699:8526

Genetics

TAPS – TET-Assisted Pyridine-Borane Sequencing

A method for detecting DNA modifications, such as methylation, at single-base resolution. TAPS uses a chemical reaction to convert 5-methylcytosine (5mC) to thymine (T) while leaving unmodified cytosines unchanged. Allows the detection of 5mC sites using standard sequencing technologies. Alternative to bisulfite sequencing, which can introduce biases and artifacts during the conversion process. You can read more about it here.

5Base – Illumina 5-Base sequencing

Illumina launched a chemistry that uses a custom enzyme to convert methylated (and hydroxymethylated) cytosine to thymine. The resulting data look very similar to TAPS data. Read more here.

bisulfite

A chemical treatment that converts only unmethylated cytosines to uracil in DNA. Used in bisulfite sequencing to detect DNA methylation patterns.

variant

A genomic locus that differs from the given reference genome. Variants can be SNVs, insertions, deletions, or structural changes.

allele

In cells with more than one homologous chromosomes, an allele refers to each of the observed sequences. In particular, a diploid genome will have two alleles per autosomal position.

reference allele

allele that matches the reference genome at a specific locus. Opposite to alternative allele.

alternative allele

allele that is different from the reference allele at a specific locus. Opposite to reference allele.

reference genome

Sequence of the species' genome that is used as a reference for comparing and analyzing genetic data. In particular, most alignment algorithms will use a reference genome to map reads to.

haplotype

A haplotype is a set of variants that co-occur on the same chromosome sequence, and are therefore linked.

methylation

Cytosine methylation is the enzymatic addition of a methyl group to cytosine by DNA methyltransferases. In animals, DNA methylation occurs at CpG dinucleotides.

methylated

Refers to a base that has undergone methylation.

SNV – Single Nucleotide Variant

A variation in a single nucleotide that occurs at a specific position in the genome. SNV is commonly used to refer to somatic variation, ie. a difference between eg. a cancer and its host.

SNP – Single Nucleotide Polymorphism

Formally, a SNP is a variant that occurs at a given minimum frequency in the population. However, it is nowadays often used to refer to any inherited variant in an individual.

locus

A specific position on a chromosome, for example the position of a variant in the genome.

assay

A test or procedure used to measure a specific property or characteristic of a sample. In genetics, assays are used to detect and measure genetic variations, gene expression levels, protein interactions, and other biological processes.

GIAB – Genome In a Bottle

"A public-private-academic consortium hosted by NIST to develop the technical infrastructure (reference standards, reference methods, and reference data) to enable translation of whole human genome sequencing to clinical practice and innovations in technologies." -- nist.gov

diploid genome

Genome that contains two sets of chromosome:pl, one set inherited from each parent.

indel – insertion/deletion

Variation in the genome where a small number of nucleotides are inserted or deleted.

CpG – Cytosine-phosphate-Guanine

A site in the DNA sequence where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites are often associated with gene regulation through DNA methylation.

VAF – Variant Allele Frequency

The proportion of observations of a specific alternative allele in a sample, compared to the total number of observations locus:

$VAF = \frac{Reads supporting the variant allele}{Total reads covering the site}$

de-novo CpG

A CpG site that is not present in the reference genome but seems to occur in the sample when there is a variant that turns a NG (or a CN) into a CG.

Programming

Rust

A systems programming language, known for its safety, speed, and concurrency. @rustPaper

CLI – Command Line Interface

Terminal-based interface for interacting with software using text commands.

LLVM

Compiler infrastructure project, used as backend for various programming languages including the official Rust compiler.

closure

A function that "closes over" bindings in the surrounding scope, i.e., variables from around its definition location are accessible inside of it.

flamegraph

A visualization of profiled software, allowing the user to see which functions are consuming the most time. The x-axis represents the stack depth, and the y-axis represents the function name. The width of each box represents the amount of time spent in that function. @gregg2016flame

LTO – Link Time Optimization

A compiler optimization technique that performs optimizations across the entire program at link time, rather than at compile time. This can lead to improved performance and reduced code size.

CI – Continuous Integration

A software development practice where automated tests and builds are run on a regular basis (e.g. on every push to the repository) to ensure that code changes do not introduce new bugs or issues.

PGO – Profile-Guided Optimization

A compiler optimization technique that uses profiling information from previous runs of the program to optimize the code for better performance. This can lead to improved runtime performance and reduced code size.

Sequence Analysis

CIGAR – Compact Idiosyncratic Gapped Alignment Report

A string that represents the alignment of a sequence read to a reference genome. It encodes information about the alignment, including matches, mismatches, insertions, deletions, and skips. It is used in the SAM and BAM file formats.

mapQ – mapping quality

A Phred-scaled probability that the mapping position of a read is incorrect. A higher mapping quality indicates a more reliable alignment. Numbers are typically between 0 and 60.

baseQ – base quality

A Phred-scaled probability that the base call is incorrect. A higher base quality indicates a more reliable base call. Numbers are typically between 0 and 40.

OT – Original Top

The "original top" of the DNA fragment that was sequenced. The "top" by definition is the sequence that matches the reference sequence.

OB – Original Bottom

The "original bottom" of the DNA fragment that was sequenced.

record

Alternative term for read.

read – sequence read

A sequence of DNA or RNA that is generated by a high-throughput sequencing technology. In "short-read sequencing", they are typically between a few dozen and a few hundred base pairs long.

alignment

Represents the position and base-to-base mapping of a sequence read to a reference genome.

pileup

A summary of the read alignments at a specific position in the reference genome in BAM and SAM files. As a data structure, a pileup is a matrix where each row represents a read and each column represents a position in the reference genome. From this, the most likely base at that position can be inferred and used to call variants.

Statistics

RMS – Root Mean Square

Statistical measure of the average deviation of a set of values from their mean.

Chi-square test

Determine whether there is a significant association between the observed frequencies of the variables with the expected frequencies to assess the independence of the variables.

The test statistic is calculated as the sum of the squared differences between the observed and expected frequencies, divided by the expected frequencies. The resulting value is compared to a critical value from the Chi-square distribution to determine statistical significance: $χ^{2} = i \sum \frac{( O _{i} - E _{i} ) ^{2}}{E _{i}}$ where $O_{i}$ is the observed frequency, $E_{i}$ is the expected frequency, and the sum is taken over all categories.

The Chi-square test is commonly used in genetics to assess the association between genetic variants and phenotypes, as well as in population genetics to test for deviations from Hardy-Weinberg equilibrium, which describes the expected frequencies of genotypes in a population under certain assumptions.

RF – Random Forest

Supervised learning algorithm, used for classification and regression.

ML – Machine Learning

Algorithms that can receive input data and use statistical analysis to predict an output value within an acceptable range.

Phred-scaled

A Phred-scaled quality score is a logarithmic measure of the probability that a base call is incorrect. The formula is $Q = - 10 \cdot lo g P$ where $P$ is the probability of an incorrect base call. A higher Phred score indicates a more reliable base call. The name "Phred" comes from the base calling software of the same name.

maximum likelihood

A method used to estimate the parameters of a statistical model. It finds the parameter values that maximize the likelihood of the observed data. The maximum likelihood estimate is the set of parameter values that make the observed data most probable. It is a common method for fitting models to data in statistics and machine learning.

Keyboard shortcuts

Rastair