Tuesday, August 28, 2007

Those Dang DPCs Clogging the MMCSS

Vista's funky networking performance amid multimedia playback elicited a reply from Microsoft's own Mark Russinovich:

Besides activity by other threads, media playback can also be affected by network activity. When a network packet arrives at [the] system, it triggers a CPU interrupt, which causes the device driver for the device at which the packet arrived to execute an Interrupt Service Routine (ISR). Other device interrupts are blocked while ISRs run, so ISRs typically do some device book-keeping and then perform the more lengthy transfer of data to or from their device in a Deferred Procedure Call (DPC) that runs with device interrupts enabled. While DPCs execute with interrupts enabled, they take precedence over all thread execution, regardless of priority, on the processor on which they run, and can therefore impede media playback threads.

Network DPC receive processing is among the most expensive, because it includes handing packets to the TCP/IP driver, which can result in lengthy computation. The TCP/IP driver verifies each packet, determines the packet’s protocol, updates the connection state, finds the receiving application, and copies the received data into the application’s buffers.

Mark goes on to show that copying a file from one machine to another consumes a staggering 41% of the available processor. In Joey's words, that is horrid and just an awful situation.

Like Vista, Linux separates interrupt handling into two distinct components, a top half (the ISR) and a bottom half. The bottom half is a mechanism for deferring work away from the interrupt handler (see Chapter 7 in Linux Kernel Development). Vista's DPC mechanism is a bottom half implementation that sounds similar to Linux's workqueues, which allow the deferment of work from one context (typically interrupt) to another. As with the DPC mechanism, workqueues run in process context, with interrupts enabled, generally (although not necessarily) with priority over other tasks on the system. Workqueues, as with DPCs, are well-suited for deferring the processing of networking work from the ISR to a later point, when interrupts are enabled.

Unlike DPCs, however, the Linux parallel does not consume nearly half of your CPU. In fact, in repeated tests involving both "copying a large file from another system" and a simple unabated ping flood, I was unable to consume any tangible amount of processor. That is, Linux can achieve high utilization of a GigE network interface with only minimal CPU usage.

Critical optimizations such as zero-copy aside, there is no excusable reason why processing IP packets should so damagingly affect the system. Thus, this absolutely abysmal networking performance should be an issue in and of itself. Unfortunately, however, the Windows developers decided to focus on a secondary effect:

Tests of [Multimedia Class Scheduler Service (MMCSS), a mechanism for the automatic priority-enhancement of multimedia playback,] during Vista development showed that, even with thread-priority boosting, heavy network traffic can cause enough long-running DPCs to prevent playback threads from keeping up with their media streaming requirements, resulting in glitching.

In other words, consuming half of your processor is (surprise!) detrimental to multimedia playback performance. At this point, it becomes clear that the process scheduler folks and the networking folks are bitter enemies and do not converse. Consequently, the obvious solution of fixing the abhorrent networking performance was bypassed for a quick bandaid:

MMCSS’ glitch-resistant mechanisms were therefore extended to include throttling of network activity. It does so by issuing a command to the NDIS device driver, which is the driver that gives packets received by network adapter drivers to the TCP/IP driver, that causes NDIS to “indicate”, or pass along, at most 10 packets per millisecond (10,000 packets per second).

Putting aside the larger problem for the moment, there are several issues with this solution. It prioritizes multimedia playback over networking performance, which, as the resulting clamor has shown, is not everyone's personal policy preference. It is almost assuredly a layering violation. It picks a fixed and hard-coded packet limit (ten per millisecond), which won't scale across different hardware—think significantly faster processors or substantially slower networking drivers. It ignores the commonality of GigE. And, finally, the solution is complicated, as the convoluted description and resulting bugs in the implementation demonstrate.

Moreover, I can only imagine how this solution performs while streaming video over the network.

Mr Russinovich concludes:

The hard-coded limit was short-sighted with respect to today’s systems that have faster CPUs, multiple cores and Gigabit networks, and in addition to fixing the bug that affects throttling on multi-adapter systems, the networking team is actively working with the MMCSS team on a fix that allows for not so dramatically penalizing network traffic, while still delivering a glitch-resistant experience.

We shall see. Vista is no where near ready for deployment and adopters should—as always—wait until there are several service packs for and the server variant of the OS before upgrading. In the meantime, let me recommend an alternative or two.

Curious about more of Vista's internals? Read Mark's three part exposé, Inside the Windows Vista Kernel: Part 1, 2, and 3. Mark is also the co-author of Microsoft Windows Internals, an excellent tome on the design of Windows XP and Windows Server 2003.