External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). External sorting typically uses a hybrid sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file.
One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:
Historically, instead of a sort, sometimes a replacement-selection algorithm was used to perform the initial distribution, to produce on average half as many output chunks of double the length.
The above example is a two-pass sort: first sort, then merge. Note that there was a single k-way merge, rather than, say, a series of 2-way merge passes as in a typical in-memory merge sort. This is because each merge pass reads and writes every value from and to disk.
However, there is a limitation to single-pass merging. As the number of chunks increases, we divide memory into more buffers, so each buffer is smaller, so we have to make many smaller reads rather than fewer larger ones. Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single merge pass isn't efficient: the disk seeks required to fill the input buffers with data from each of the 500 chunks (we read 100MB / 501 ~ 200KB from each chunk at a time) take up most of the sort time. Using two merge passes solves the problem. Then the sorting process might look like this:
Like in-memory sorts, efficient external sorts require O(n log n) time: exponential increases in data size require linear increases in the number of passes. If one makes liberal use of the gigabytes of RAM provided by modern computers, the logarithmic factor grows very slowly: under reasonable assumptions, one could sort at least 500 GB of data on a hard disk using 1 GB of main memory before a third pass became advantageous, and could sort many times that before a fourth pass became useful. Low-seek-time media like SSDs also increase the amount that can be sorted before additional passes help.