Glossary
File Formats
SAM – Sequence Alignment Map
A text-based file format that stores sequence data in a tab-delimited format. Format Specification for version 1.0. It is used to store the alignment of sequence reads to a reference genome. SAM files are used to visualize and analyze sequence data in genome browsers like IGV.
Additional keys are defined in the Optional Fields Specification.
BAM – Binary Alignment Map
A binary version of the SAM file format. It is a compressed, indexed, and binary representation of the sequence alignment data. BAM files are used to store sequence data efficiently. They can be visualized and analyzed in genome browsers like IGV.
BED – Browser Extensible Data
A tab-separated-file format that stores genomic regions as a list of intervals.
It looks like this:
chr1 11873 12227 ENSG00000223972.5 0 +
chr1 12612 12721 ENSG00000227232.5 0 -
chr1 13220 14409 ENSG00000278267.1 0 +
The first three columns represent the chromosome, start position, and end position of the interval. The other columns contain additional information about the interval. Importantly, the start position in a bed file is 0-based, and the end position refers to the first base that is outside the interval. So, to refer to the first base of chromosome 1, one would write
chr1 0 1 Interval_of_length_1
VCF – Variant Call Format
Tab-separated-file format that stores information about genetic variants. Format Specification for VCFv4.5. It is used to represent SNPs, insertions, deletions, and other types of genetic variations. VCF files contain information about the variant's position, reference allele, alternative allele, quality score, and other annotations. @vcfPaper
It looks like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2
chr1 10001 . A C 30 PASS . GT:DP 0/1:10 1/1:20
chr1 10002 . T G 20 PASS . GT:DP 0/1:15 1/1:25
The first row contains the column headers. The first eight columns are mandatory fields that describe the variant. The remaining columns contain information about the samples and their genotypes. Importantly, unlike in BED, the position column in VCF files is 1-based.
FASTQ
A text-based file format that stores DNA or RNA sequences, optionally with corresponding quality scores such as produced by most sequencing machines. It is used to store raw sequencing data from high-throughput sequencing experiments. FASTQ files contain the sequence data in the form of nucleotide bases (A, T, C, G) and Phred-scaled quality scores that represent the confidence in each base call.
BCF – Binary Call Format
A binary version of the VCF file format. It stores variant calls in a compressed and indexed binary format. BCF files are more efficient for storing large variant datasets and can be used with tools like BCFtools.
BGZF – Blocked GNU Zip Format
A block compression format in which the input file is compressed as a series of GZIP-compressed blocks that can be indexed for efficient random access.
qname – query template name
A unique identifier for a sequence read in a SAM or BAM file. It is used to track the origin of the read and link it to other information in the file. Example: NB502094:69:HN2H2BGX5:2:21111:24699:8526
Genetics
TAPS – TET-Assisted Pyridine-Borane Sequencing
A method for detecting DNA modifications, such as methylation, at single-base resolution. TAPS uses a chemical reaction to convert 5-methylcytosine (5mC) to thymine (T) while leaving unmodified cytosines unchanged. Allows the detection of 5mC sites using standard sequencing technologies. Alternative to bisulfite sequencing, which can introduce biases and artifacts during the conversion process. You can read more about it here.
5Base – Illumina 5-Base sequencing
Illumina launched a chemistry that uses a custom enzyme to convert methylated (and hydroxymethylated) cytosine to thymine. The resulting data look very similar to TAPS data. Read more here.
bisulfite
A chemical treatment that converts only unmethylated cytosines to uracil in DNA. Used in bisulfite sequencing to detect DNA methylation patterns.
variant
A genomic locus that differs from the given reference genome. Variants can be SNVs, insertions, deletions, or structural changes.
allele
In cells with more than one homologous chromosomes, an allele refers to each of the observed sequences. In particular, a diploid genome will have two alleles per autosomal position.
reference allele
allele that matches the reference genome at a specific locus. Opposite to alternative allele.
alternative allele
allele that is different from the reference allele at a specific locus. Opposite to reference allele.
reference genome
Sequence of the species' genome that is used as a reference for comparing and analyzing genetic data. In particular, most alignment algorithms will use a reference genome to map reads to.
haplotype
A haplotype is a set of variants that co-occur on the same chromosome sequence, and are therefore linked.
methylation
Cytosine methylation is the enzymatic addition of a methyl group to cytosine by DNA methyltransferases. In animals, DNA methylation occurs at CpG dinucleotides.
methylated
Refers to a base that has undergone methylation.
SNV – Single Nucleotide Variant
A variation in a single nucleotide that occurs at a specific position in the genome. SNV is commonly used to refer to somatic variation, ie. a difference between eg. a cancer and its host.
SNP – Single Nucleotide Polymorphism
Formally, a SNP is a variant that occurs at a given minimum frequency in the population. However, it is nowadays often used to refer to any inherited variant in an individual.
locus
A specific position on a chromosome, for example the position of a variant in the genome.
assay
A test or procedure used to measure a specific property or characteristic of a sample. In genetics, assays are used to detect and measure genetic variations, gene expression levels, protein interactions, and other biological processes.
GIAB – Genome In a Bottle
"A public-private-academic consortium hosted by NIST to develop the technical infrastructure (reference standards, reference methods, and reference data) to enable translation of whole human genome sequencing to clinical practice and innovations in technologies." -- nist.gov
diploid genome
Genome that contains two sets of chromosome:pl, one set inherited from each parent.
indel – insertion/deletion
Variation in the genome where a small number of nucleotides are inserted or deleted.
CpG – Cytosine-phosphate-Guanine
A site in the DNA sequence where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites are often associated with gene regulation through DNA methylation.
VAF – Variant Allele Frequency
The proportion of observations of a specific alternative allele in a sample, compared to the total number of observations locus:
de-novo CpG
A CpG site that is not present in the reference genome but seems to occur in the sample when there is a variant that turns a NG (or a CN) into a CG.
Programming
Rust
A systems programming language, known for its safety, speed, and concurrency. @rustPaper
CLI – Command Line Interface
Terminal-based interface for interacting with software using text commands.
LLVM
Compiler infrastructure project, used as backend for various programming languages including the official Rust compiler.
closure
A function that "closes over" bindings in the surrounding scope, i.e., variables from around its definition location are accessible inside of it.
flamegraph
A visualization of profiled software, allowing the user to see which functions are consuming the most time. The x-axis represents the stack depth, and the y-axis represents the function name. The width of each box represents the amount of time spent in that function. @gregg2016flame
LTO – Link Time Optimization
A compiler optimization technique that performs optimizations across the entire program at link time, rather than at compile time. This can lead to improved performance and reduced code size.
CI – Continuous Integration
A software development practice where automated tests and builds are run on a regular basis (e.g. on every push to the repository) to ensure that code changes do not introduce new bugs or issues.
PGO – Profile-Guided Optimization
A compiler optimization technique that uses profiling information from previous runs of the program to optimize the code for better performance. This can lead to improved runtime performance and reduced code size.
Sequence Analysis
CIGAR – Compact Idiosyncratic Gapped Alignment Report
A string that represents the alignment of a sequence read to a reference genome. It encodes information about the alignment, including matches, mismatches, insertions, deletions, and skips. It is used in the SAM and BAM file formats.
mapQ – mapping quality
A Phred-scaled probability that the mapping position of a read is incorrect. A higher mapping quality indicates a more reliable alignment. Numbers are typically between 0 and 60.
baseQ – base quality
A Phred-scaled probability that the base call is incorrect. A higher base quality indicates a more reliable base call. Numbers are typically between 0 and 40.
OT – Original Top
The "original top" of the DNA fragment that was sequenced. The "top" by definition is the sequence that matches the reference sequence.
OB – Original Bottom
The "original bottom" of the DNA fragment that was sequenced.
record
Alternative term for read.
read – sequence read
A sequence of DNA or RNA that is generated by a high-throughput sequencing technology. In "short-read sequencing", they are typically between a few dozen and a few hundred base pairs long.
alignment
Represents the position and base-to-base mapping of a sequence read to a reference genome.
pileup
A summary of the read alignments at a specific position in the reference genome in BAM and SAM files. As a data structure, a pileup is a matrix where each row represents a read and each column represents a position in the reference genome. From this, the most likely base at that position can be inferred and used to call variants.
Statistics
RMS – Root Mean Square
Statistical measure of the average deviation of a set of values from their mean.
Chi-square test
Determine whether there is a significant association between the observed frequencies of the variables with the expected frequencies to assess the independence of the variables.
The test statistic is calculated as the sum of the squared differences between the observed and expected frequencies, divided by the expected frequencies. The resulting value is compared to a critical value from the Chi-square distribution to determine statistical significance: where is the observed frequency, is the expected frequency, and the sum is taken over all categories.
The Chi-square test is commonly used in genetics to assess the association between genetic variants and phenotypes, as well as in population genetics to test for deviations from Hardy-Weinberg equilibrium, which describes the expected frequencies of genotypes in a population under certain assumptions.
RF – Random Forest
Supervised learning algorithm, used for classification and regression.
ML – Machine Learning
Algorithms that can receive input data and use statistical analysis to predict an output value within an acceptable range.
Phred-scaled
A Phred-scaled quality score is a logarithmic measure of the probability that a base call is incorrect. The formula is where is the probability of an incorrect base call. A higher Phred score indicates a more reliable base call. The name "Phred" comes from the base calling software of the same name.
maximum likelihood
A method used to estimate the parameters of a statistical model. It finds the parameter values that maximize the likelihood of the observed data. The maximum likelihood estimate is the set of parameter values that make the observed data most probable. It is a common method for fitting models to data in statistics and machine learning.
Tools
SAMtools
A suite of utilities for manipulating SAM and BAM. It can manipulate SAM and BAM files. @samPaper
BCFtools
A set of utilities for manipulating variant calls in the VCF and BCF file formats. @samtoolsPaper
BEDtools
A popular suite of utilities for manipulating genomic intervals stored in BED files.
GATK – Genome Analysis Toolkit
A software package for analyzing sequencing data with tools for variant discovery, genotyping, and other analyses.
IGV – Integrative Genomics Viewer
A tool for interactive exploration of SAM and BAM files. Also see @screenshot-igv.
BWA – Burrows-Wheeler Aligner
A sequence alignment program for mapping low-divergent sequences against a large reference genome.
Minimap2
A sequence alignment program that aligns DNA or mRNA sequences against a reference database. Often used for long-read data.
htslib
A C library for reading/writing sequencing data, used by tools like SAMtools and BCFtools to manipulate SAM, BAM, and VCF files@htslibPaper. Rastair uses the Rust bindings provided by the rust-htslib crate.
bgzip
Compression tool that produces BGZF files, commonly used to compress VCF and BAM files. Part of SAMtools.
tabix
An indexing tool for tab-delimited files, such as VCF and BED files. It creates an index file that allows for efficient random access to specific regions of the file. Part of SAMtools. Homepage