Hello everyone, I'm Ray and I'm here to talk a bit about a dump I recently looked at and a little-referenced memory counter called ResAvail Pages (resident available pages).
The problem statement was: The server hangs after a while.
Not terribly informative, but that's where we start with many cases. First some good housekeeping:
0: kd> vertarget
Windows 7 Kernel Version 7601 (Service Pack 1) MP (2 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Built by: 7601.18113.amd64fre.win7sp1_gdr.130318-1533
Machine Name: "ASDFASDF1234"
Kernel base = 0xfffff800`01665000 PsLoadedModuleList = 0xfffff800`018a8670
Debug session time: Thu Aug 8 09:39:26.992 2013 (UTC - 4:00)
System Uptime: 9 days 1:08:39.307
Of course Windows 7 Server == Server 2008 R2.
One of the basic things I check at the beginning of these hang dumps with vague problem statements is the memory information.
0: kd> !vm 21
*** Virtual Memory Usage ***
Physical Memory: 2097038 ( 8388152 Kb)
Page File: \??\C:\pagefile.sys
Current: 12582912 Kb Free Space: 12539700 Kb
Minimum: 12582912 Kb Maximum: 12582912 Kb
Available Pages: 286693 ( 1146772 Kb)
ResAvail Pages: 135 ( 540 Kb)
********** Running out of physical memory **********
Locked IO Pages: 0 ( 0 Kb)
Free System PTEs: 33526408 ( 134105632 Kb)
******* 12 system cache map requests have failed ******
Modified Pages: 4017 ( 16068 Kb)
Modified PF Pages: 4017 ( 16068 Kb)
NonPagedPool Usage: 113241 ( 452964 Kb)
NonPagedPool Max: 1561592 ( 6246368 Kb)
PagedPool 0 Usage: 35325 ( 141300 Kb)
PagedPool 1 Usage: 28162 ( 112648 Kb)
PagedPool 2 Usage: 24351 ( 97404 Kb)
PagedPool 3 Usage: 24350 ( 97400 Kb)
PagedPool 4 Usage: 24516 ( 98064 Kb)
PagedPool Usage: 136704 ( 546816 Kb)
PagedPool Maximum: 33554432 ( 134217728 Kb)
********** 222 pool allocations have failed **********
Session Commit: 6013 ( 24052 Kb)
Shared Commit: 6150 ( 24600 Kb)
Special Pool: 0 ( 0 Kb)
Shared Process: 1214088 ( 4856352 Kb)
Pages For MDLs: 67 ( 268 Kb)
PagedPool Commit: 136768 ( 547072 Kb)
Driver Commit: 15548 ( 62192 Kb)
Committed pages: 1648790 ( 6595160 Kb)
Commit limit: 5242301 ( 20969204 Kb)
So we're failing to allocate pool, but we aren't out of virtual memory for paged pool or nonpaged pool. Let's look at the breakdown:
0: kd> dd nt!MmPoolFailures l?9
fffff800`01892160 000001be 00000000 0000000000000002
fffff800`01892170 00000000 0000000000000000 00000000
fffff800`01892180 00000000
Where:
yellow = Nonpaged high/medium/low priority failures
green = Paged high/medium/low priority failures
cyan = Session paged high/medium/low priority failures
So we actually failed both nonpaged AND paged pool allocations in this case. Why? We're "Running out of physical memory", obviously. So where does this running out of physical memory message come from? In the above example this is from the ResAvail Pages counter.
ResAvail Pages is the amount of physical memory there would be if every working set was at its minimum size and only what needs to be resident in RAM was present (e.g. PFN database, system PTEs, driver images, kernel thread stacks, nonpaged pool, etc).
Where did this memory go then? We have plenty of Available Pages (Free + Zero + Standby) for use. So something is claiming memory it isn't actually using. In this type of situation one of the things I immediately suspect is process working set minimums. Working set basically means the physical memory used by a process.
So let's check.
0: kd> !process 0 1
<a lot of processes in this output>.
PROCESS fffffa8008f76060
SessionId: 0 Cid: 0adc Peb: 7fffffda000 ParentCid: 0678
DirBase: 204ac9000 ObjectTable: 00000000 HandleCount: 0.
Image: cscript.exe
VadRoot 0000000000000000 Vads 0 Clone 0 Private 1. Modified 3. Locked 0.
DeviceMap fffff8a000008a70
Token fffff8a0046f9c50
ElapsedTime 9 Days 01:08:00.134
UserTime 00:00:00.000
KernelTime 00:00:00.015
QuotaPoolUsage[PagedPool] 0
QuotaPoolUsage[NonPagedPool] 0
Working Set Sizes (now,min,max) (5, 50, 345) (20KB, 200KB, 1380KB)
PeakWorkingSetSize 1454
VirtualSize 65 Mb
PeakVirtualSize 84 Mb
PageFaultCount 1628
MemoryPriority BACKGROUND
BasePriority 8
CommitCharge 0
I have only shown one example process above for brevity's sake, but there were thousands returned. 241,423 to be precise. None had abnormally high process working set minimums, but cumulatively their usage adds up.
The “now” process working set is lower than the minimum working set. How is that possible? Well, the minimum and maximum are not hard limits, but suggested limits. For example, the minimum working set is honored unless there is memory pressure, in which case it can be trimmed below this value. There is a way to set the min and/or max as hard limits on specific processes by using the QUOTA_LIMITS_HARDWS_MIN_ENABLE flag via SetProcessWorkingSetSize.
You can view if the minimum and maximum working set values are configured in the _EPROCESS->Vm->Flags structure. Note these numbers are from another system as this structure was already torn down for the processes we were looking at.
0: kd> dt _EPROCESS fffffa8008f76060 Vm
nt!_EPROCESS
+0x398 Vm : _MMSUPPORT
0: kd> dt _MMSUPPORT fffffa8008f76060+0x398
nt!_MMSUPPORT
+0x000 WorkingSetMutex : _EX_PUSH_LOCK
+0x008 ExitGate : 0xfffff880`00961000 _KGATE
+0x010 AccessLog : (null)
+0x018 WorkingSetExpansionLinks : _LIST_ENTRY [ 0x00000000`00000000 - 0xfffffa80`08f3c410 ]
+0x028 AgeDistribution : [7] 0
+0x044 MinimumWorkingSetSize : 0x32
+0x048 WorkingSetSize : 5
+0x04c WorkingSetPrivateSize : 5
+0x050 MaximumWorkingSetSize : 0x159
+0x054 ChargedWslePages : 0
+0x058 ActualWslePages : 0
+0x05c WorkingSetSizeOverhead : 0
+0x060 PeakWorkingSetSize : 0x5ae
+0x064 HardFaultCount : 0x41
+0x068 VmWorkingSetList : 0xfffff700`01080000 _MMWSL
+0x070 NextPageColor : 0x2dac
+0x072 LastTrimStamp : 0
+0x074 PageFaultCount : 0x65c
+0x078 RepurposeCount : 0x1e1
+0x07c Spare : [2] 0
+0x084 Flags : _MMSUPPORT_FLAGS
0: kd> dt _MMSUPPORT_FLAGS fffffa8008f76060+0x398+0x84
nt!_MMSUPPORT_FLAGS
+0x000 WorkingSetType : 0y000
+0x000 ModwriterAttached : 0y0
+0x000 TrimHard : 0y0
+0x000 MaximumWorkingSetHard : 0y0
+0x000 ForceTrim : 0y0
+0x000 MinimumWorkingSetHard : 0y0
+0x001 SessionMaster : 0y0
+0x001 TrimmerState : 0y00
+0x001 Reserved : 0y0
+0x001 PageStealers : 0y0000
+0x002 MemoryPriority : 0y00000000 (0)
+0x003 WsleDeleted : 0y1
+0x003 VmExiting : 0y1
+0x003 ExpansionFailed : 0y0
+0x003 Available : 0y00000 (0)
How about some more detail?
0: kd> !process fffffa8008f76060
PROCESS fffffa8008f76060
SessionId: 0 Cid: 0adc Peb: 7fffffda000 ParentCid: 0678
DirBase: 204ac9000 ObjectTable: 00000000 HandleCount: 0.
Image: cscript.exe
VadRoot 0000000000000000 Vads 0 Clone 0 Private 1. Modified 3. Locked 0.
DeviceMap fffff8a000008a70
Token fffff8a0046f9c50
ElapsedTime 9 Days 01:08:00.134
UserTime 00:00:00.000
KernelTime 00:00:00.015
QuotaPoolUsage[PagedPool] 0
QuotaPoolUsage[NonPagedPool] 0
Working Set Sizes (now,min,max) (5, 50, 345) (20KB, 200KB, 1380KB)
PeakWorkingSetSize 1454
VirtualSize 65 Mb
PeakVirtualSize 84 Mb
PageFaultCount 1628
MemoryPriority BACKGROUND
BasePriority 8
CommitCharge 0
No active threads
0: kd> !object fffffa8008f76060
Object: fffffa8008f76060 Type: (fffffa8006cccc90) Process
ObjectHeader: fffffa8008f76030 (new version)
HandleCount: 0 PointerCount: 1
The highlighted information shows us that this process has no active threads left but the process object itself (and its 20KB working set use) were still hanging around because a kernel driver had a reference to the object that it never released. Sampling other entries shows the server had been leaking process objects since it was booted.
Unfortunately trying to directly track down pointer leaks on process objects is difficult and requires an instrumented kernel, so we tried to check the easy stuff first before going that route. We know it has to be a kernel driver doing this (since it is a pointer and not a handle leak) so we looked at the list of 3rd party drivers installed. Note: The driver names have been redacted.
0: kd> lm
start end module name
<snip>
fffff880`04112000 fffff880`04121e00 driver1 (no symbols) <-- no symbols usually means 3rd party
fffff880`04158000 fffff880`041a4c00 driver2 (no symbols)
<snip>
0: kd> lmvm driver1
Browse full module list
start end module name
fffff880`04112000 fffff880`04121e00 driver1 (no symbols)
Loaded symbol image file: driver1.sys
Image path: \SystemRoot\system32\DRIVERS\driver1.sys
Image name: driver1.sys
Browse all global symbols functions data
Timestamp: Wed Dec 13 12:09:32 2006 (458033CC)
CheckSum: 0001669E
ImageSize: 0000FE00
Translations: 0000.04b0 0000.04e4 0409.04b0 0409.04e4
0: kd> lmvm driver2
Browse full module list
start end module name
fffff880`04158000 fffff880`041a4c00 driver2 (no symbols)
Loaded symbol image file: driver2.sys
Image path: \??\C:\Windows\system32\drivers\driver2.sys
Image name: driver2.sys
Browse all global symbols functions data
Timestamp: Thu Nov 30 12:12:07 2006 (456F10E7)
CheckSum: 0004FE8E
ImageSize: 0004CC00
Translations: 0000.04b0 0000.04e4 0409.04b0 0409.04e4
Fortunately for both the customer and us we turned up a pair of drivers that predated Windows Vista (meaning they were designed for XP/2003) that raised an eyebrow. Of course we need a more solid evidence link than just "it's an old driver", so I did a quick search of our internal KB. This turned up several other customers who had these same drivers installed, experienced the same problem, then removed them and the problem went away. That sounds like a pretty good evidence link. We implemented the same plan for this customer successfully.