Closest string

In theoretical computer science, the closest string is an NP-hard computational problem, which tries to find the geometrical center of a set of input strings.

To understand the word "center", it is necessary to define a distance between two strings. Usually, this problem is studied with the Hamming distance in mind.

More formally, given n length-m strings s₁, s₂, ..., s_n, the closest string problem seeks for a new length-m string s such that d(s,s_i) ≤ k for all i, where d denotes the Hamming distance, and where k is as small as possible. A decision problem version of the closest string problem, which is NP-complete, instead takes k as another input and asks whether there is a string within Hamming distance k of all the input strings.

The closest string problem can be seen as an instance of the 1-center problem in which the distances between elements are measured using Hamming distance.

In bioinformatics, the closest string problem is an intensively studied facet of the problem of finding signals in DNA.

Instances of closest string may contain information that is not essential to the problem. In some sense, the usual input of closest string contains information, that does not contribute to the hardness of the problem. For example, if some strings contain the character a, but none contains the character z, replacing all as with zs would yield an essentially equivalent instance, that is: from a solution of the modified instance, the original solution can be restored, and vice versa.

When all input strings that share the same length are written on top of each other, they form a matrix. Certain row types have essentially the same implications to the solution. For example, replacing a column with entries (a, a, b) with another column (x, x, y) might lead to a different solution string, but cannot affect solvability, because both columns express the same structure, viz. the first two entries being equal, but different from the third one.

...
Wikipedia