SNV calling from NGS data

SNV calling from NGS data refers to a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms (see SNP genotyping). Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

Most NGS based methods for SNV detection are designed to detect germline variations in the individual's genome. These are the mutations that an individual biologically inherits from their parents, and are the usual type of variants searched for when performing such analysis (except for certain specific applications where somatic mutations are sought). Very often, the searched for variants occur with some (possibly rare) frequency, throughout the population, in which case they may be referred to as single nucleotide polymorphisms (SNPs). Technically the term SNP only refers to these kinds of variations, however in practice they are often used synonymously with SNV in the literature on variant calling. In addition, since the detection of germline SNVs requires determining the individual's genotype at each locus, the phrase "SNP genotyping" may also be used to refer to this process. However this phrase may also refer to wet-lab experimental procedures for classifying genotypes at a set of known SNP locations.

The usual process of such techniques are based around:

The usual output of these procedures is a VCF file.

In an ideal error free world with high read coverage, the task of variant calling from the results of a NGS data alignment would be simple; at each locus (position on the genome) the number of occurrences of each distinct nucleotide among the reads aligned at that position can be counted, and the true genotype would be obvious; either AA if all nucleotides match allele A, BB if they match allele B, or AB if there is a mixture. However, when working with real NGS data this sort of naive approach is not used, as it cannot account for the noise in the input data. The nucleotide counts used for base calling contain errors and bias, both due do the sequenced reads themselves, and the alignment process. This issue can be mitigated to some extent by sequencing to a greater depth of read coverage, however this is often expensive, and many practical studies require making inferences on low coverage data.

...
Wikipedia