*** Welcome to piglix ***

Interrupt storm


In operating systems, an interrupt storm is an event during which a processor receives an inordinate number of interrupts that consume the majority of the processor's time. Interrupt storms are typically caused by hardware devices that do not support interrupt rate limiting.

Because interrupt processing is typically a non-preemptible task in time-sharing operating systems, an interrupt storm will cause sluggish response to user input, or even appear to freeze the system completely. This state is commonly known as live lock. In such a state, the system is spending most of its resources processing interrupts instead of completing other work. To the end-user, it does not appear to be processing anything at all as there is often no output. An interrupt storm is sometimes mistaken for thrashing, since they both have similar symptoms (unresponsive or sluggish response to user input, little or no output).

Common causes include: misconfigured or faulty hardware, faulty device drivers, flaws in the operating system, or metastability in one or more components. The latter condition rarely occurs outside of prototype or amateur-built hardware.

Most modern hardware and operating systems have methods for mitigating the effect of an interrupt storm. For example, most Ethernet controllers implement interrupt "rate limiting", which causes the controller to wait a programmable amount of time between each interrupt it generates. When not present within the device, similar functionality is usually written into the device driver, and/or the operating system itself.

The most common cause is when a device "behind" another signals an interrupt to a APIC (Advanced Programmable Interrupt Controller). Most computer peripherals generate interrupts through an APIC as the number of interrupts is most always less (typically 15 for the modern PC) than the number of devices. The OS must then query each driver registered to that interrupt to ask if the interrupt originated from its hardware. Faulty drivers may always claim "yes", causing the OS to not query other drivers registered to that interrupt (only one interrupt can be processed at a time). The device which originally requested the interrupt therefore does not get its interrupt serviced, so a new interrupt is generated (or is not cleared) and the processor becomes swamped with continuous interrupt signals. Any operating system can live lock under an interrupt storm caused by such a fault. A kernel debugger can usually break the storm by unloading the faulty driver, allowing the driver "underneath" the faulty one to clear the interrupt, if user input is still possible.


...
Wikipedia

...