Use Windows Performance Toolkit to Trackdown Slow Service Start Problem
- ETW: Event Tracing for Windows
- WPT: Windows Performance Toolkit
- ADK: Windows Assessment and Deployment Kit
Bruce Dawson has an excellent article “Xperf Wait Analysis–Finding Idle Time” which explains how to use Windows Performance Toolkit (WPT) to analyze CPU idle issue. He actually has a series of articles about WPT and I would recommend read all of them if you are new to WPT.
Our product has several Windows services, and to reduce boot time impact to the computer, we have an internal guideline that the each service should be started in 0.5 second during boot. One day I received report that a service took more than 1 second to start, they supplied the boot tracing etl file collected by XbootMgr (part of WPT and now replaced by Windows Performance Recorder in Windows Assessment and Deployment Kit). I opened the etl file, loaded symbols, opened services table and filtered to the service in question:
- Container Init Time: this starts when the service process being created by services.exe, ends when the service process calls StartServiceCtrlDispatcher; if there are multiple services sharing same host process, Container Init Time will only be counted for the first service.
- Service Init Time: this starts when ServiceMain was called, ends when service called SetServiceStatus with status SERVICE_RUNNING.
- Duration: this is how long service took to start, the value came from “Ended At” minus “Started At”, or “Container Init Time” plus “Service Init Time”.
From the table, we can see that MyService did take 1.079s to start, most of the time was spent in Container Init Time, this means we should make our focus on looking at what the process did at its start up. Fortunately the etl file has ReadyThead (please see this article for details) information and stack-walking was enabled on it, let’s try to find out what is its first thread and what it did.
Zoom Into The Range
Before jumping to ReadyThread, it is recommended to zoom into the time range we are interested in, it not only reduces noises but also improves the data loading and ordering performance. Since I already filtered the data to MyService only, there will be a vertical blue bar in the graph, you can right click on it and select “Zoom” command; alternatively you can right click anywhere on the graph, select “Select Time Range…” and fill in Start and End manually. For example, I set the start and end to same value as MyService start and end time since I will only be interested in data occurred during this time frame:
First Thread in MyService.exe
First, I need to find out which is the first thread in MyService.exe. Let’s open ReadThread table, filter selection to MyService.exe, make “SwitchInTime(s)” and “NewThreadId” the second and third column, order data by “SwitchInTime(s)”, and here is what I saw:
Again, please see this article for information about the meaning of these columns.
Here not only we know thread 1592 was first thread ran in MyService.exe, but also we noticed something was wrong: thread 1592 was first switched into CPU to be run at 13.207940543s, the next time it was switched in was at 14.080725766s, more than 0.8s after the first switch in. It looks like it was blocked and waiting for some other threads. Let’s re-order the columns to find out which thread awoke (also known as “readied”) thread 1592:
From the graph above, thread 672 in services.exe is the readying thread of thread 1592. Also from the stack, we can tell that services.exe created thread 1592 (in suspended status) at around 13.207940543s, but didn’t call ResumeThread to let it run until at about 14.080671757s (column ReadyTime(s) in the graph), was it blocked too? Let’s take a look at what’s happening in thread 672 of services.exe.
We saw similar thing was happening in thread 672 as well: after being switched into CPU at 13.208368780s, the next switch in happened at 14.080603959s, again 0.8s after last switch in. Its readying thread was thread 800, in same process. It’s time to take a look at that thread, and it started feeling like falling into a rabbit hole…
I expanded thread 800 in services.exe and following is a snippet of switches from it:
It looks actually pretty normal, the switch in time here was distributed evenly without any noticeable gap from 13.20s all the way down to 14.08s. It looks like a dead end. Or is it?
Switch To CPU Busy Diagnostic
So far we are using ReadyThread to diagnostic CPU idle issue, which are caused by thread was being blocked for too long. If the search reached to a thread without visible delay or blocking, it is time to switch to CPU busy diagnostic model. Most often people are using CPU sampling graph (also known as profiling) for CPU busy issues, here we can also use ReadyThread to investigate which function took most of CPU time, and it is actually more accurate than CPU sampling. Still in ReadyThread table, let’s re-order the columns: remove everything, leave only NewProcess, NewThreadId, NewThreadStack and %CPU Usage, order data by %CPU Usage:
This is cool. The butterfly stack graph shows that during 13.20s to 14.08s, in thread 800, most of CPU time was spent in function services.exe!RSetServiceObjectSecurity and children functions called by it. Based on stack, we also know that thread 800 was responding to an ALPC message and was modifying a service’s security descriptor. As the result of the modification, it resulted a registry flush which took about 0.8s to finish. This whole process was inside a semaphore object (recall that thread 800 readied thread 672 by calling ntkrnlmp.exe!NtReleaseSemaphore from services.exe!RSetServiceObjectSecurity), which ultimately delayed start of MyService.
Who Did It?
Now we know that the root cause is somebody was modifying service security descriptor during boot, caused registry flush, and delayed other service’s start. The next step is to find out who did it. Based on the knowledge I know how Windows service works, it is sechost.dll that creates worker threads and calls ServiceMain function implemented by the service, and there is a good chance sechost.dll!SetServiceObjectSecurity was called from this thread. Let’s re-order the column in ReadyThread table and see whether we can find somebody called sechost.dll!SetServiceObjectSecurity. It only took a minute for me to find it:
AnotherSvc.exe’s thread 1584 made a call to sechost.dll!SetServiceObjectSecurity, it transformed into an ALPC message, handled by thread 800 in services.exe, and once it is done, thread 800 readied thread 1584 as well as another thread 672 in services.exe almost at same time due to the releasing semaphore. Now all the pieces are put together and we got a clear picture of what was going on.
I contacted the team responsible for AnotherSvc.exe and they fixed the issue by moving the call to sechost.dll!SetServiceObjectSecurity at a later stage.
Here we used ReadyThread table (and different columns and orderings) to track down the root cause of the service slow start, across 4 threads in 3 different processes, this is so far the most complicated slow performance issue I have met, and I hope won’t run into another in future.