Machine Learning, a subfield of Computer Science involving the development of algorithms that learn how to make predictions based on data, has a number of emerging applications in the field of Bioinformatics. Bioinformatics deals with computational and mathematical approaches for understanding and processing biological data. Prior to the emergence of machine learning algorithms, bioinformatics algorithms had to be explicitly programmed by hand which, for problems such as Protein structure prediction, proves extremely difficult. Machine learning techniques such as Deep Learning enable the algorithm to make use of automatic feature learning which means that based on the dataset alone, the algorithm can learn how to combine multiple features of the input data into a more abstract set of features from which to conduct further learning. This multi-layered approach to learning patterns in the input data allows such systems to make quite complex predictions when trained on large datasets. In recent years, the size and number of available biological datasets have skyrocketed, enabling bioinformatics researchers to make use of these machine learning systems. Machine learning has been applied to six main subfields of bioinformatics: genomics, proteomics, microarrays, systems biology, evolution, and text mining.
Genomics involves the study of the genome, the complete DNA sequence, of organisms. While genomic sequence data has historically been sparse due to the technical difficulty in sequencing a piece of DNA, the number of available sequences is growing exponentially. However, while raw data is becoming increasingly available and accessible, the biological interpretation of this data is occurring at a much slower pace. Therefore, there is an increasing need for the development of machine learning systems that can automatically determine the location of protein-encoding genes within a given DNA sequence. This is a problem in computational biology known as gene prediction.