Copy number variation (CNV) is a relatively new field in genomics and it is defined as a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals in the human population. Copy number variation is a type of structural variation: specifically, it is a type of duplication or deletion event that affects a considerable number of base pairs. However, note that although modern genomics research is mostly focused on human genomes, copy number variations also occur in a variety of other organisms including E. coli. Recent research indicates that approximately two thirds of the entire human genome is composed of repeats and 4.8-9.5% of the human genome can be classified as copy number variations. In mammals, copy number variations play an important role in generating necessary variation in the population as well as disease phenotype.
Copy number variations can be generally categorized into two main groups: short repeats and long repeats. However, there are no clear boundaries between the two groups and the classification depends on the nature of the loci of interest. Short repeats include mainly bi-nucleotide repeats (two repeating nucleotides e.g. A-C-A-C-A-C...) and tri-nucleotide repeats. Long repeats include repeats of entire genes. This classification based on size of the repeat is the most obvious type of classification as size is an important factor in examining the types of mechanisms that most likely gave rise to the repeats, hence the likely effects of these repeats on phenotype.
One of the most well known examples of a short copy number variation is the tri-nucleotide repeat of the CAG base pairs in the Huntingtin gene, the gene that is responsible for the neurological disorder Huntington's disease. For this particular case, once the CAG tri-nucleotide repeats more than 44-46 times, Huntington’s disease will likely develop in the individual and it will likely be inherited by his or her offspring. Interestingly, the number of repeats of the CAG tri-nucleotide is correlated with the age of onset of Huntington’s disease. These types of short repeats are often thought to be due to errors in polymerase activity during replication including polymerase slippage, template switching, and fork switching which will be discussed in detail later. The short repeat size of these copy number variations lends itself to errors in the polymerase as these repeated regions are prone to misrecognition by the polymerase and replicated regions may be replicated again, leading to extra copies of the repeat. In addition, if these tri-nucleotide repeats are in the same reading frame in the coding portion of a gene, it may lead to a long chain of the same amino acid, possibly creating protein aggregates in the cell, and if these short repeats fall into the non-coding portion of the gene, it may affect gene expression and regulation. On the other hand, a variable number of repeats of entire genes is less commonly identified in the genome. One example of a whole gene repeat is the alpha-amylase 1 gene (AMY1) that encodes alpha-amylase which has a significant copy number variation between different populations with different diets. Although the specific mechanism that allows the AMY1 gene to increase or decrease its copy number is still a topic of debate, some hypotheses suggest that the non-homologous end joining or the microhomology-mediated end joining is likely responsible for these whole gene repeats. Repeats of entire genes has immediate effects on expression of that particular gene, and the fact that the copy number variation of the AMY1 gene has been related to diet is a remarkable example of recent human evolutionary adaptation. Although these are the general groups that copy number variations are grouped into, the exact number of base pairs copy number variations affect depends on the specific loci of interest. Currently, using data from all reported copy number variations, the mean size of copy number variant is around 118kb, and the median is around 18kb.