In statistics, the Behrens–Fisher problem, named after Walter Ulrich Behrens and Ronald Fisher, is the problem of interval estimation and hypothesis testing concerning the difference between the means of two normally distributed populations when the variances of the two populations are not assumed to be equal, based on two independent samples.
One difficulty with discussing the Behrens–Fisher problem and proposed solutions, is that there are many different interpretations of what is meant by "the Behrens–Fisher problem". These differences involve not only what is counted as being a relevant solution, but even the basic statement of the context being considered.
Let X1, ..., Xn and Y1, ..., Ym be i.i.d. samples from two populations which both come from the same location-scale family of distributions. The scale parameters are assumed to be unknown and not necessarily equal, and the problem is to assess whether the location parameters can reasonably be treated as equal. Lehmann states that "the Behrens–Fisher problem" is used both for this general form of model when the family of distributions is arbitrary and for when the restriction to a normal distribution is made. While Lehmann discusses a number of approaches to the more general problem, mainly based on nonparametrics, most other sources appear to use "the Behrens–Fisher problem" to refer only to the case where the distribution is assumed to be normal: most of this article makes this assumption.
Solutions to the Behrens–Fisher problem have been presented that make use of either a classical or a Bayesian inference point of view and either solution would be notionally invalid judged from the other point of view. If consideration is restricted to classical statistical inference only, it is possible to seek solutions to the inference problem that are simple to apply in a practical sense, giving preference to this simplicity over any inaccuracy in the corresponding probability statements. Where exactness of the significance levels of statistical tests is required, there may be an additional requirement that the procedure should make maximum use of the statistical information in the dataset. It is well known that an exact test can be gained by randomly discarding data from the larger dataset until the sample sizes are equal, assembling data in pairs and taking differences, and then using an ordinary t-test to test for the mean-difference being zero: clearly this would not be "optimal" in any sense.