Trace Cache (also known as execution trace cache) is a very specialized cache which stores the dynamic stream of instructions known as trace. It helps in increasing the instruction fetch bandwidth and decreasing power consumption (in the case of Intel Pentium 4 ) by storing traces of instructions that have already been fetched and decoded. Trace Processor is an architecture designed around the Trace Cache and processes the instructions at trace level granularity.
The earliest academic publication of trace cache was "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching". This widely acknowledged paper was presented by Eric Rotenberg, Steve Bennett, and Jim Smith at 1996 MICRO conference. An earlier publication is (US patent 5381533), by Alex Peleg and Uri Weiser of Intel Corp., "Dynamic flow instruction cache memory organized around trace segments independent of virtual address line", a continuation of an application filed in 1992, later abandoned.
Wider superscalar processors demand multiple instructions to be fetched in a single cycle for higher performance. Instructions to be fetched are not always in contiguous memory locations (basic blocks) because of branch and jump instructions. So processors need additional logic and hardware support to fetch and align such instructions from non-contiguous basic blocks. If multiple branches are predicted as not-taken, then processors can fetch instructions from multiple contiguous basic blocks in a single cycle. However, if any of the branches is predicted as taken, then processor should fetch instructions from the taken path in that same cycle. This limits the fetch capability of a processor.
Consider these four basic blocks (A, B, C, D) as shown in the figure that correspond to a simple if-else loop. These blocks will be stored contiguously as ABCD in the memory. If the branch D is predicted not-taken, the fetch unit can fetch the basic blocks A,B,C which are placed contiguously. However, if D is predicted taken, the fetch unit has to fetch A,B,D which are non-contiguously placed. Hence, fetching these blocks which are non contiguously places, in a single cycle will be very difficult. So, in situations like these trace cache comes in aid to the processor.