Quantcast
Channel: MSDN Blogs
Viewing all articles
Browse latest Browse all 29128

ResAvail Pages and Working Sets

$
0
0

Hello everyone, I'm Ray and I'm here to talk a bit about a dump I recently looked at and a little-referenced memory counter called ResAvail Pages (resident available pages).

 

The problem statement was:  The server hangs after a while.

 

Not terribly informative, but that's where we start with many cases. First some good housekeeping:

 

0: kd> vertarget

Windows 7 Kernel Version 7601 (Service Pack 1) MP (2 procs) Free x64

Product: Server, suite: TerminalServer SingleUserTS

Built by: 7601.18113.amd64fre.win7sp1_gdr.130318-1533

Machine Name: "ASDFASDF1234"

Kernel base = 0xfffff800`01665000 PsLoadedModuleList = 0xfffff800`018a8670

Debug session time: Thu Aug  8 09:39:26.992 2013 (UTC - 4:00)

System Uptime: 9 days 1:08:39.307

 

Of course Windows 7 Server == Server 2008 R2.

 

One of the basic things I check at the beginning of these hang dumps with vague problem statements is the memory information.

 

0: kd> !vm 21

 

*** Virtual Memory Usage ***

Physical Memory:     2097038 (   8388152 Kb)

Page File: \??\C:\pagefile.sys

  Current:  12582912 Kb  Free Space:  12539700 Kb

  Minimum:  12582912 Kb  Maximum:     12582912 Kb

Available Pages:      286693 (   1146772 Kb)

ResAvail Pages:          135 (       540 Kb)

 

********** Running out of physical memory **********

 

Locked IO Pages:           0 (         0 Kb)

Free System PTEs:   33526408 ( 134105632 Kb)

 

******* 12 system cache map requests have failed ******

 

Modified Pages:         4017 (     16068 Kb)

Modified PF Pages:      4017 (     16068 Kb)

NonPagedPool Usage:   113241 (    452964 Kb)

NonPagedPool Max:    1561592 (   6246368 Kb)

PagedPool 0 Usage:     35325 (    141300 Kb)

PagedPool 1 Usage:     28162 (    112648 Kb)

PagedPool 2 Usage:     24351 (     97404 Kb)

PagedPool 3 Usage:     24350 (     97400 Kb)

PagedPool 4 Usage:     24516 (     98064 Kb)

PagedPool Usage:      136704 (    546816 Kb)

PagedPool Maximum:  33554432 ( 134217728 Kb)

 

********** 222 pool allocations have failed **********

 

Session Commit:         6013 (     24052 Kb)

Shared Commit:          6150 (     24600 Kb)

Special Pool:              0 (         0 Kb)

Shared Process:      1214088 (   4856352 Kb)

Pages For MDLs:           67 (       268 Kb)

PagedPool Commit:     136768 (    547072 Kb)

Driver Commit:         15548 (     62192 Kb)

Committed pages:     1648790 (   6595160 Kb)

Commit limit:        5242301 (  20969204 Kb)

 

So we're failing to allocate pool, but we aren't out of virtual memory for paged pool or nonpaged pool.  Let's look at the breakdown:

 

0: kd> dd nt!MmPoolFailures l?9

fffff800`01892160  000001be 00000000 0000000000000002

fffff800`01892170  00000000 0000000000000000 00000000

fffff800`01892180  00000000

 

Where:

    yellow   = Nonpaged high/medium/low priority failures

    green    = Paged high/medium/low priority failures

    cyan      = Session paged high/medium/low priority failures

 

So we actually failed both nonpaged AND paged pool allocations in this case.  Why?  We're "Running out of physical memory", obviously.  So where does this running out of physical memory message come from?  In the above example this is from the ResAvail Pages counter.

 

ResAvail Pages is the amount of physical memory there would be if every working set was at its minimum size and only what needs to be resident in RAM was present (e.g. PFN database, system PTEs, driver images, kernel thread stacks, nonpaged pool, etc).

 

Where did this memory go then?  We have plenty of Available Pages (Free + Zero + Standby) for use.  So something is claiming memory it isn't actually using.  In this type of situation one of the things I immediately suspect is process working set minimums. Working set basically means the physical memory used by a process.

 

So let's check.

 

0: kd> !process 0 1

 

<a lot of processes in this output>.

 

PROCESS fffffa8008f76060

    SessionId: 0  Cid: 0adc    Peb: 7fffffda000  ParentCid: 0678

    DirBase: 204ac9000  ObjectTable: 00000000  HandleCount:   0.

    Image: cscript.exe

    VadRoot 0000000000000000 Vads 0 Clone 0 Private 1. Modified 3. Locked 0.

    DeviceMap fffff8a000008a70

    Token                             fffff8a0046f9c50

    ElapsedTime                       9 Days 01:08:00.134

    UserTime                          00:00:00.000

    KernelTime                        00:00:00.015

    QuotaPoolUsage[PagedPool]         0

    QuotaPoolUsage[NonPagedPool]      0

    Working Set Sizes (now,min,max)  (5, 50, 345) (20KB, 200KB, 1380KB)

    PeakWorkingSetSize                1454

    VirtualSize                       65 Mb

    PeakVirtualSize                   84 Mb

    PageFaultCount                    1628

    MemoryPriority                    BACKGROUND

    BasePriority                      8

    CommitCharge                      0

 

I have only shown one example process above for brevity's sake, but there were thousands returned.  241,423 to be precise.  None had abnormally high process working set minimums, but cumulatively their usage adds up.

 

The “now” process working set is lower than the minimum working set.  How is that possible?  Well, the minimum and maximum are not hard limits, but suggested limits.  For example, the minimum working set is honored unless there is memory pressure, in which case it can be trimmed below this value.  There is a way to set the min and/or max as hard limits on specific processes by using the QUOTA_LIMITS_HARDWS_MIN_ENABLE flag via SetProcessWorkingSetSize.

 

You can view if the minimum and maximum working set values are configured in the _EPROCESS->Vm->Flags structure.  Note these numbers are from another system as this structure was already torn down for the processes we were looking at.

 

0: kd> dt _EPROCESS fffffa8008f76060 Vm

nt!_EPROCESS

   +0x398 Vm : _MMSUPPORT

0: kd> dt _MMSUPPORT fffffa8008f76060+0x398

nt!_MMSUPPORT

   +0x000 WorkingSetMutex  : _EX_PUSH_LOCK

   +0x008 ExitGate         : 0xfffff880`00961000 _KGATE

   +0x010 AccessLog        : (null)

   +0x018 WorkingSetExpansionLinks : _LIST_ENTRY [ 0x00000000`00000000 - 0xfffffa80`08f3c410 ]

   +0x028 AgeDistribution  : [7] 0

   +0x044 MinimumWorkingSetSize : 0x32

   +0x048 WorkingSetSize   : 5

   +0x04c WorkingSetPrivateSize : 5

   +0x050 MaximumWorkingSetSize : 0x159

   +0x054 ChargedWslePages : 0

   +0x058 ActualWslePages  : 0

   +0x05c WorkingSetSizeOverhead : 0

   +0x060 PeakWorkingSetSize : 0x5ae

   +0x064 HardFaultCount   : 0x41

   +0x068 VmWorkingSetList : 0xfffff700`01080000 _MMWSL

   +0x070 NextPageColor    : 0x2dac

   +0x072 LastTrimStamp    : 0

   +0x074 PageFaultCount   : 0x65c

   +0x078 RepurposeCount   : 0x1e1

   +0x07c Spare            : [2] 0

   +0x084 Flags            : _MMSUPPORT_FLAGS

0: kd> dt _MMSUPPORT_FLAGS fffffa8008f76060+0x398+0x84

nt!_MMSUPPORT_FLAGS

   +0x000 WorkingSetType   : 0y000

   +0x000 ModwriterAttached : 0y0

   +0x000 TrimHard         : 0y0

   +0x000 MaximumWorkingSetHard : 0y0

   +0x000 ForceTrim        : 0y0

   +0x000 MinimumWorkingSetHard : 0y0

   +0x001 SessionMaster    : 0y0

   +0x001 TrimmerState     : 0y00

   +0x001 Reserved         : 0y0

   +0x001 PageStealers     : 0y0000

   +0x002 MemoryPriority   : 0y00000000 (0)

   +0x003 WsleDeleted      : 0y1

   +0x003 VmExiting        : 0y1

   +0x003 ExpansionFailed  : 0y0

   +0x003 Available        : 0y00000 (0)

 

How about some more detail?

 

0: kd> !process fffffa8008f76060

PROCESS fffffa8008f76060

    SessionId: 0  Cid: 0adc    Peb: 7fffffda000  ParentCid: 0678

    DirBase: 204ac9000  ObjectTable: 00000000  HandleCount:  0.

    Image: cscript.exe

    VadRoot 0000000000000000 Vads 0 Clone 0 Private 1. Modified 3. Locked 0.

    DeviceMap fffff8a000008a70

    Token                             fffff8a0046f9c50

    ElapsedTime                       9 Days 01:08:00.134

    UserTime                          00:00:00.000

    KernelTime                        00:00:00.015

    QuotaPoolUsage[PagedPool]         0

    QuotaPoolUsage[NonPagedPool]      0

    Working Set Sizes (now,min,max)  (5, 50, 345) (20KB, 200KB, 1380KB)

    PeakWorkingSetSize                1454

    VirtualSize                       65 Mb

    PeakVirtualSize                   84 Mb

    PageFaultCount                    1628

    MemoryPriority                    BACKGROUND

    BasePriority                      8

    CommitCharge                      0

 

No active threads

 

0: kd> !object fffffa8008f76060

Object: fffffa8008f76060  Type: (fffffa8006cccc90) Process

    ObjectHeader: fffffa8008f76030 (new version)

    HandleCount: 0  PointerCount: 1

 

The highlighted information shows us that this process has no active threads left but the process object itself (and its 20KB working set use) were still hanging around because a kernel driver had a reference to the object that it never released.  Sampling other entries shows the server had been leaking process objects since it was booted.

 

Unfortunately trying to directly track down pointer leaks on process objects is difficult and requires an instrumented kernel, so we tried to check the easy stuff first before going that route.  We know it has to be a kernel driver doing this (since it is a pointer and not a handle leak) so we looked at the list of 3rd party drivers installed.  Note: The driver names have been redacted.

 

0: kd> lm

start             end                 module name

<snip>

fffff880`04112000 fffff880`04121e00   driver1    (no symbols)   <-- no symbols usually means 3rd party       

fffff880`04158000 fffff880`041a4c00   driver2    (no symbols)          

<snip>

 

0: kd> lmvm driver1   

Browse full module list

start             end                 module name

fffff880`04112000 fffff880`04121e00   driver1    (no symbols)          

    Loaded symbol image file: driver1.sys

    Image path: \SystemRoot\system32\DRIVERS\driver1.sys

    Image name: driver1.sys

    Browse all global symbols  functions  data

    Timestamp:        Wed Dec 13 12:09:32 2006 (458033CC)

    CheckSum:         0001669E

    ImageSize:        0000FE00

    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

0: kd> lmvm driver2

Browse full module list

start             end                 module name

fffff880`04158000 fffff880`041a4c00   driver2    (no symbols)          

    Loaded symbol image file: driver2.sys

    Image path: \??\C:\Windows\system32\drivers\driver2.sys

    Image name: driver2.sys

    Browse all global symbols  functions  data

    Timestamp:        Thu Nov 30 12:12:07 2006 (456F10E7)

    CheckSum:         0004FE8E

    ImageSize:        0004CC00

    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

 

Fortunately for both the customer and us we turned up a pair of drivers that predated Windows Vista (meaning they were designed for XP/2003) that raised an eyebrow.  Of course we need a more solid evidence link than just "it's an old driver", so I did a quick search of our internal KB.  This turned up several other customers who had these same drivers installed, experienced the same problem, then removed them and the problem went away.  That sounds like a pretty good evidence link. We implemented the same plan for this customer successfully.


Viewing all articles
Browse latest Browse all 29128

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>