Sometimes a process hang is caused by doing something dangerous while holding the loader lock, if you can get a dump file, it is pretty easy to run !locks command to find out the locks causing the trouble. It can be a little bit complicated if one of the locking mechanism is not a critical section(for example, event or mutex), but it is still manageable by looking for similar function calls among the stacks. However today I ran into a dump file from a hung process which shows the loader lock was locked by a thread that has been terminated and being left orphaned:
0:000> !locks CritSec ntdll!LdrpLoaderLock+0 at 00007ffa0aa3d8a8 WaiterWoken No LockCount 9 RecursionCount 1 OwningThread 710 EntryCount 0 ContentionCount 39 *** Locked Scanned 76 critical sections 0:000> ~~k ^ Illegal thread error in '~~k'
Most likely, thread 710 was terminated while it was holding the loader lock. Because of this, there were many other threads in same process were being blocked and waiting for it to be unlocked, which unfortunately will never happen. Since the thread 710 no longer exists, there was no TEB, we don’t know what it was running when being terminated, and most importantly, we don’t know who terminated it. If you have worked on certain amount of crash dumps, eventually you will run into some internal NT structures, one of which is _CLIENT_ID:
0:000> dt _CLIENT_ID combase!_CLIENT_ID +0x000 UniqueProcess : Ptr64 Void +0x008 UniqueThread : Ptr64 Void
It is simply a structure of process id and thread id, and it is being used by many other structures like _TEB. It has been a technique to search them in memory in order to identify the parent structure that contains them. Let’s try this here to see what we can get.
0:000> | . 0 id: 608 examine name: C:\Program Files\ABC\xyz.exe 0:000> s 0 L?ffffffffffffffff 00 00 00 00 08 06 00 00 00 00 00 00 10 07 00 00 00007ffa`07eb6214 00 00 00 00 08 06 00 00-00 00 00 00 10 07 00 00 ................
First we use “|” command to get process id 608, then use the “s” command to search for memory that can be interpreted as a _CLIENT_ID structure that has 608 as process id and 710 as thread id. “s” command did return an address that contains matching data, let’s find out what it is:
0:000> !address 00007ffa`07eb6214 ... Usage: Image Base Address: 00007ffa`07eb3000 End Address: 00007ffa`07eb7000 Region Size: 00000000`00004000 State: 00001000 MEM_COMMIT Protect: 00000004 PAGE_READWRITE Type: 01000000 MEM_IMAGE Allocation Base: 00007ffa`07dc0000 Allocation Protect: 00000080 PAGE_EXECUTE_WRITECOPY Image Path: C:\Windows\System32\KERNELBASE.dll Module Name: KERNELBASE Loaded Image Name: KERNELBASE.dll Mapped Image Name: More info: lmv m KERNELBASE More info: !lmi KERNELBASE More info: ln 0x7ffa07eb6214 More info: !dh 0x7ffa07dc0000 0:000> ln 0x7ffa07eb6214 (00007ffa`07eb6200) KERNELBASE!BaseTerminatedLoaderLockOwner+0x14 | (00007ffa`07eb6268) KERNELBASE!BaseDataFileHandleTableElementCount
It is very surprising, the memory found belongs to a global data structure of KERNELBASE.dll, and based on the name given by symbol, looks like Windows is tracking the threads if they were terminated when holding the loader lock. Could it save the stack as well?
0:000> dps 00007ffa`07eb6200 00007ffa`07eb6200 00000000`00000001 00007ffa`07eb6208 00000000`00000608 00007ffa`07eb6210 00000000`00000c64 00007ffa`07eb6218 00000000`00000608 00007ffa`07eb6220 00000000`00000710 00007ffa`07eb6228 00007ffa`07e64263 KERNELBASE!TerminateThread+0xaf 00007ffa`07eb6230 00007ff9`fa494995 somedll+0x4995 00007ffa`07eb6238 00007ff9`fa492421 somedll+0x2421 00007ffa`07eb6240 00007ff9`fa4924e9 somedll+0x24e9 00007ffa`07eb6248 00007ffa`0a1113d2 kernel32!BaseThreadInitThunk+0x22 00007ffa`07eb6250 00007ffa`0a9803c4 ntdll!RtlUserThreadStart+0x34 00007ffa`07eb6258 00000000`00000000 00007ffa`07eb6260 00000000`00000000 00007ffa`07eb6268 00000000`00000000 00007ffa`07eb6270 00000000`00000000 00007ffa`07eb6278 00000000`00000000
It did! After fixing symbols for somedll and re-run dps command, I received the full stack of the thread when it is calling TerminateThread when the thread being terminated was holding loader lock, it’s time to notify the team responsible for somedll.
Bonus read: Raymond has another two excellent articles about finding ghost threads causing crashing from kernel or user dumps:
How to view the stack of threads that were terminated as part of process teardown from the kernel debugger
64-bit calling convention means that under most scenarios, the first four parameters are passed using rcx, rdx, r8 and r9 registers rather than using stack. It’s difficult to find out actual parameters passed to functions in the calling stack from a 64-bit crash dump. CodeMachine has a great article X64 Deep Dive on this topic, it also talks about x64 compiler optimization, exception handling, parameter passing as well as parameter retrieval. It is a must read if crash dump analysis is part of your job.
There are lots of times we want to find out “this” pointer from the crash dump. “this” pointer is passed using rcx, and if the function accesses class members, very likely rcx will be saved into rsi or rdi after function’s prolog. I created a MFC application using VS2012 project wizard, inserted DebugBreak() at the end of CMFCApplication1App::InitInstance, built it using x64|Release configuration, ran it, save the crash dump and open it with WinDbg, this is what the initial stack looks like:
0:000> kL Child-SP RetAddr Call Site 000000cc`aa65e468 000007fb`01df12d2 ntdll!NtWaitForMultipleObjects+0xa 000000cc`aa65e470 000007fb`0211d20e KERNELBASE!WaitForMultipleObjectsEx+0xe5 000000cc`aa65e750 000007fb`0211cfd2 kernel32!WerpReportFaultInternal+0x1fa 000000cc`aa65e7f0 000007fb`01e6fc87 kernel32!WerpReportFault+0x42 000000cc`aa65e820 000007fb`04db9183 KERNELBASE!UnhandledExceptionFilter+0x1d7 000000cc`aa65e920 000007fb`04ce4fea ntdll! ?? ::FNODOBFM::`string'+0x9e3 000000cc`aa65e950 000007fb`04ce464d ntdll!_C_specific_handler+0x8e 000000cc`aa65e9c0 000007fb`04ce567c ntdll!RtlpExecuteHandlerForException+0xd 000000cc`aa65e9f0 000007fb`04cc4bba ntdll!RtlDispatchException+0x392 000000cc`aa65f100 000007fb`01e9478a ntdll!KiUserExceptionDispatch+0x2e 000000cc`aa65f828 000007f7`5be74c8b KERNELBASE!DebugBreak+0x2 000000cc`aa65f830 000007f7`5c059a09 MFCApplication1!CMFCApplication1App::InitInstance+0x2cb 000000cc`aa65f920 000007f7`5c0417fc MFCApplication1!AfxWinMain+0x75 000000cc`aa65f960 000007fb`020d167e MFCApplication1!__tmainCRTStartup+0x148 000000cc`aa65f9a0 000007fb`04ce3501 kernel32!BaseThreadInitThunk+0x1a 000000cc`aa65f9d0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
Let’s say I want to know what does MFCApplication1!CMFCApplication1App look like at this time. First let us disassemble first few instructions in CMFCApplication1App::InitInstance:
0:000> u MFCApplication1!CMFCApplication1App::InitInstance l10 MFCApplication1!CMFCApplication1App::InitInstance [d:\temp\mfcapplication1\mfcapplication1\mfcapplication1.cpp @ 73]: 000007f7`5be749c0 488bc4 mov rax,rsp 000007f7`5be749c3 55 push rbp 000007f7`5be749c4 488d68a1 lea rbp,[rax-5Fh] 000007f7`5be749c8 4881ece0000000 sub rsp,0E0h 000007f7`5be749cf 48c745affeffffff mov qword ptr [rbp-51h],0FFFFFFFFFFFFFFFEh 000007f7`5be749d7 48895810 mov qword ptr [rax+10h],rbx 000007f7`5be749db 48897018 mov qword ptr [rax+18h],rsi 000007f7`5be749df 48897820 mov qword ptr [rax+20h],rdi 000007f7`5be749e3 488b0536f72b00 mov rax,qword ptr [MFCApplication1!__security_cookie (000007f7`5c134120)] 000007f7`5be749ea 4833c4 xor rax,rsp 000007f7`5be749ed 48894547 mov qword ptr [rbp+47h],rax 000007f7`5be749f1 488bf1 mov rsi,rcx <-- rcx was saved to rsi here 000007f7`5be749f4 c745b708000000 mov dword ptr [rbp-49h],8 000007f7`5be749fb c745bbff000000 mov dword ptr [rbp-45h],0FFh 000007f7`5be74a02 488d4db7 lea rcx,[rbp-49h] 000007f7`5be74a06 ff157c961f00 call qword ptr [MFCApplication1!_imp_InitCommonControlsEx (000007f7`5c06e088)]
We know that “this” pointer of MFCApplication1!CMFCApplication1App object was passed using rcx, rcx was saved into rsi not far from the prolog, and since the crash was happening inside this function, let’s restore the crash context record and check whether we can get MFCApplication1!CMFCApplication1App pointer from rsi:
0:000> .ecxr rax=0000000000000001 rbx=0000000000000000 rcx=000007fb04b04aaa rdx=0000000000000000 rsi=000007f75c1418b0 rdi=000000ccaa860ba0 rip=000007fb01e9478a rsp=000000ccaa65f828 rbp=000000ccaa65f8b9 r8=000000ccaa65f828 r9=000000ccaa65f8b9 r10=0000000000000000 r11=0000000000000202 r12=000007f75c1418b0 r13=0000000000000000 r14=000007f75c1418b0 r15=00000000ffffffff iopl=0 nv up ei pl nz na pe nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202 KERNELBASE!DebugBreak+0x2: 000007fb`01e9478a cc int 3 0:000> dt /r MFCApplication1!CMFCApplication1App @rsi +0x000 __VFN_table : 0x000007f7`5c0df898 =000007f7`5c074d10 classCObject : CRuntimeClass +0x000 m_lpszClassName : 0x000007f7`5c074d40 "CObject" +0x008 m_nObjectSize : 0n8 +0x00c m_wSchema : 0xffff +0x010 m_pfnCreateObject : (null) +0x018 m_pBaseClass : (null) +0x020 m_pNextClass : (null) +0x028 m_pClassInit : (null) =000007f7`5c071698 classCCmdTarget : CRuntimeClass +0x000 m_lpszClassName : 0x000007f7`5c071808 "CCmdTarget" +0x008 m_nObjectSize : 0n64 +0x00c m_wSchema : 0xffff +0x010 m_pfnCreateObject : (null) +0x018 m_pBaseClass : 0x000007f7`5c074d10 CRuntimeClass +0x000 m_lpszClassName : 0x000007f7`5c074d40 "CObject" +0x008 m_nObjectSize : 0n8 +0x00c m_wSchema : 0xffff +0x010 m_pfnCreateObject : (null) +0x018 m_pBaseClass : (null) +0x020 m_pNextClass : (null) +0x028 m_pClassInit : (null) +0x020 m_pNextClass : (null) +0x028 m_pClassInit : (null) ... +0x180 m_strRegSection : ATL::CStringT<wchar_t,StrTraitMFC<wchar_t,ATL::ChTraitsCRT<wchar_t> > > +0x000 m_pszData : 0x000000cc`aa851548 "Workspace" ... +0x1cc m_bHiColorIcons : 0n1
It looks good. Of course it is completely possible that after the copying “this” pointer from rcx, rsi was changed again before the crash. The only way be sure about it is to disassemble the function from start to the crash point and read the assembly code. I made a bet here that rsi wan’t changed, and I will dump the complete structure to make sure it looks intact.
Another tricky part of 64-bit crash dump analysis is that executable images are as important as symbols. Many times WinDbg needs access to the binaries to be able to fully restore the call stack, there are lots more function information stored in 64-bit binaries than 32-bit binaries. If images are not available, it is still possible to manually construct the stack by walking through output from dps , but it is going to be very painful since rbp is no longer used as frame base pointer.