Nguyên nhân iis worker process high cpu

High CPU usage in the IIS worker process is the second most common performance complaint for production IIS websites.

In this guide, we are going to explore why this happens, and why W3WP high CPU usage can negatively impact and even take down IIS websites [due to hangs, thread pool exhaustion, queueing/503 queue full errors, and more].

We’ll also outline a more effective way to monitor IIS CPU usage, and detect when it causes a problem.

Finally, we’ll provide you with a practical approach to diagnosing the underlying causes of CPU usage in your application code, so that you can definitively resolve AND prevent CPU overloads in production.

But first, let’s take a look at why diagnosing high CPU incidents in production tends to be so hard...

The reason why your IIS worker process has high CPU usage

After using LeanSentry to help diagnose and resolve performance issues in 30K+ IIS websites over the last decade, we’ve discovered one simple but valuable fact:

The actual cause of high CPU usage in production is almost always NOT what you think it is!

It’s usually not because of high traffic, denial of service attacks, or “just because our application has to do a lot of work”. It’s also not because of IIS being misconfigured [a favorite goto for application developers unable to reproduce the CPU usage in their local dev environments].

Instead, the top causes of w3wp exe high CPU usage tend to be application code aspects that you would normally never think about or see during testing, but nonetheless happen when the application experiences peak traffic or an unexpected workload. Things like:

  1. Logging library logging a large number of database errors to disk.
  2. Monitor lock contention on an application lock.
  3. MVC action parameter binding, or serializing large JSON responses.
  4. Query compilation of a particularly complex LINQ query expression.

[sound familiar?]

This partially explains why code reviews, and even proactive testing/tuning in a test environment, often fails to find the true cause of production CPU overloads. Without knowing exactly what code is causing the high w3wp CPU in production at the EXACT TIME of the overload, you are probably optimizing the wrong code!

It turns out that this is actually good news!

First, because the cause of high CPU usage in the application is often of “secondary” nature to the application’s functionality, it can be easily modified or removed without affecting the application’s functionality. For example, the logging can be modified not to log a specific event, or lock contention can be removed by implementing a low lock pattern.

Second, it means that extensive rewrites or performance testing of application code is not typically required. This can save a lot of development time.

Instead, all we need to do is determine the application code causing the high CPU usage in the IIS worker process, at the exact time when it causes a hang or website performance degradation in production.

If we can do this, we can minimally optimize the right code and prevent this from happening in the future.

Proactive performance testing: the tale of two camps

In my experience, teams often fall into one of two camps when it comes to CPU optimization:

We ignore the application’s CPU usage until it becomes a problem. These teams assume that the CPU utilization “is what it is”, in other words it’s the computational cost of hosting the website’s workload. As a result, these applications tend to run hot, and often experience CPU overloads which can cause downtime and poor performance.

The knee jerk reaction is to throw more hardware at the problem, which then also ensures that the hosting costs/cloud costs for running your application are 2-5x higher than they really need to be. At the same time, the application likely still experiences high CPU usage and overloads during peak traffic.

We proactively test and tune the application code before deployment! These teams spend a lot of time in their release cycle running tests and optimizing the code. Yet, the return on investment for these activities can be spotty, because they take a lot of developer effort … and the application can still experience CPU overloads in production! This happens because it’s nearly impossible to properly simulate a real production workload in a test environment … which means that they are likely to optimize the wrong code. Additionally, scheduling optimization time for the dev teams is often an expensive exercise and does not usually keep up with the pace of application changes.

Both camps experience more CPU overloads than desired, and end up spending more time and resources dealing with high CPU usage.

Instead, what we found works best is a “lean” opportunistic approach: capture the CPU peaks in production and optimize them aggressively. This leaks to minimum development work upfront, delivers the right fixes for the actual bottlenecks exposed by production workloads, and ensures that over time the application becomes faster, more efficient, and achieves lower hosting costs.

Monitoring IIS worker process CPU usage

To properly detect instances of CPU overload, we have to look a bit further than the CPU usage of the server or the IIS worker process.

This is because, in an ideal world, your application’s CPU usage is “elastic”. Meaning, the w3wp.exe consumes more CPU as it handles a higher workload, and is able to use up the entire processor bandwidth of the server without experiencing significant performance degradation. This is the case for many simpler CPU intensive software workloads like rendering, compression, and even some very simple web workloads e.g. serving static files out of the cache.

Unfortunately, most modern web applications are not elastic enough when it comes to CPU usage. Instead of “stretching” when the CPU usage increases, the application chokes instead. Instead of experiencing a slight slowdown, your website might begin to throw 503 Queue Full errors, experience very slow response times, or hang.

Worse yet, these issues may begin to crop well before your server is at 100% CPU usage.

High worker process CPU usage often causes severe performance degradation because of the complex interplay between the async/parallel nature of modern web application code, thread pool starvation and exhaustion, and garbage collection. We explain these regressive mechanisms in detail below.

Before we do that, let’s dig into how to properly monitor IIS CPU usage and detect CPU overloads.

Detecting CPU overloads

Your IIS monitoring strategy for CPU overloads needs to include monitoring IIS website performance together with CPU usage. The CPU overload exists when the CPU usage of the worker process or server is high, AND performance is degraded.

To perform accurate CPU overload detection, LeanSentry CPU Diagnostics use a large number of IIS and process metrics, including a number of threading and request processing performance counters.

If you are monitoring this manually or using a basic [non-diagnostic] APM tool that simply watches performance counters, you can boil this monitoring down to three main components:

Monitoring slow requests caused by high CPU

Is a high percentage of your requests completing slower than desired?

An older way to measure this would be to look at response times for completing requests, e.g. using average latency or 99% percentile response time. This approach has many issues, including being easily skewed by outliers [e.g. a handful of very slow requests due to external database delays] or hiding significant issues by diluting the metric with many very fast requests [e.g. thousands of very static file requests].

At LeanSentry, we use a metric called Satisfaction score [similar to Apdex] which counts up the number of slow requests in your IIS logs as a percentage of your overall traffic. We allow you to specify custom response time thresholds for the website and for specific urls, so the “slow request” determination is meaningful for different parts of your application.

LeanSentry tracking the percentage of your traffic that is slow.

[If you are not using LeanSentry, you can compute your own satisfaction score monitoring using our IIS log analysis guide.]

If the IIS worker process is experiencing high CPU that’s affecting your website performance, the percentage of slow requests will rapidly increase.

If your workload is elastic wrt. CPU usage, you may see a very small change in slow requests, even if all your requests are slightly slower. In this case, congratulations, you are making great use of your server processing bandwidth!

Monitoring application queueing and 503 Queue Full errors

Server CPU overload will often cause application pools to experience queueing. Application pool queueing happens when the IIS worker process is unable to dequeue the incoming requests fast enough, usually because:

  1. The server CPU is completely overloaded.
  2. There are not enough threads in the IIS thread pool to dequeue incoming requests.
  3. Your website has a VERY HIGH throughput [RPS].

When this happens, the requests queue up in the application pool queue.

If you have LeanSentry error monitoring, it will automatically detect queueing and analyze IIS thread pool problems causing queueing, including determining the application code causing the CPU overload [and thereby causing queueing]. If you don’t have LeanSentry, we’ll review options for doing the CPU code analysis yourself below.

LeanSentry detects an IIS 503 QueueFull incident, and automatically diagnoses the application code triggering it.

A simple way to monitor IIS application pool queueing is by watching these two metrics:

Metric Data source
Application pool queue length

The number of requests waiting for the IIS worker process to dequeue them.

Performance counters:

HTTP Service Request Queues\CurrentQueueSize

You should monitor these separately for each application pool.

503 Queue Full errors

Requests rejected by HTTP.SYS with the 503 Queue Full error code, due to the application pool queue being full.

The HTTPERR error logs, located in:

c:\windows\system32\logfiles\HTTPERR

If you have 503 Queue Full errors, you’ll see entries like:

2021-09-08 23:01:06 ::1%0 61091 ::1%0 8990 HTTP/1.1 GET /test.aspx - - 503 4 QueueFull TestApp TCP
2021-09-08 23:01:06 ::1%0 61092 ::1%0 8990 HTTP/1.1 GET /test.aspx - - 503 4 QueueFull TestApp TCP
2021-09-08 23:01:06 ::1%0 61093 ::1%0 8990 HTTP/1.1 GET /test.aspx - - 503 4 QueueFull TestApp TCP
…

You can also monitor the HTTP Service Request Queues\RejectedRequests performance counter, but we prefer the HTTPERR log because the rejected requests counter can represent many different types of application pool failures outside of QueueFull.

If the CPU overload is severe enough, or is combining with IIS thread pool issues, you’ll see the application pool queue growing, and eventually causing 503 Queue Full errors when the queue size exceeds the configured application pool queue limit [1000 by default].

Normally, if your server is coping well with the workload, you should have zero 503 errors, and ideally an empty application pool queue [or a queue with more than 100 requests queued].

Monitoring IIS hangs

Secondly, we want to watch for signs of high CPU hangs, which are usually caused by deadlocks or severe performance degradation due to thread pool starvation.

If you have LeanSentry, it will automatically detect these types of hangs and determine the issue causing performance degradation, down to the offending application code:

LeanSentry detecting hangs, and diagnosing the root cause of the hang to be thread pool exhaustion due to blockage in a specific part of the application code.

If you don’t have LeanSentry, you can perform your own simple hang monitoring by watching two things:

  1. The number of active requests to your website.
  2. The currently executing requests that appear “blocked”.

If your worker process has high CPU usage and is experiencing a high CPU hang, it will always show a large increase in active requests [because the requests are getting “stuck” and not completing].

This metric is better to monitor than RPS, because RPS is strongly affected by the rate of incoming requests to your website and can vary widely whether or not a hang exists. A hang cannot exist on a production website without a large number of “active requests”.

Additionally, a hang will show requests “stuck” for a long time [we use 10 sec by default], as opposed to a large number of “new” requests. If you have a large number of relatively new requests and high CPU, again, congratulations, your website is stretching to its workload and does not have a hang.

Metric Data source
Active Requests

The number of requests being processed inside the IIS worker process.

Performance counters:

W3SVC_W3WP\Active Requests

This number is reported per IIS worker process, for example W3SVC_W3WP[15740_DefaultAppPool]\Active Requests. Because of this, you may need to link these counters to the associated application pools and aggregate them if you have multiple worker processes per pool [web gardens].

Blocked requests

The requests being executed in the IIS worker process, with information on how long they’ve been processing and where in the request processing pipeline they are currently “stuck”.

Appcmd command:

appcmd list requests /elapsed:10000

This command lists all requests that have been executing in the worker process for more than 10 seconds. While it’s normal to have requests take a bit longer to finish under CPU load, a hang will show requests being “stuck” for a much longer time than usual.

If you do have a hang, you’ll see a number of requests “stuck” or queued up, most likely in the “ExecuteRequestHandler” application handler stage:

InetMgr showing requests “stuck” during a hang.

If your server CPU is overloaded, but your workload is elastic, you are likely to observe many active requests with a relatively short time elapsed. This is reasonable, since everything is taking longer to execute.

However, if you are seeing a lot of requests with elapsed times of 10 seconds or higher, you have an inelastic workload and likely have a high CPU hang. We’ll dig into why this happens and how to resolve high CPU hangs below.

Monitoring CPU usage of the IIS worker process

To detect high CPU usage in the IIS worker process, you can simply monitor the following:

  1. The CPU usage of w3wp.exe, using the performance counter Process\% Processor Time.
  2. The CPU usage of the server itself, using the Processor\% Processor Time[_Total] performance counter.
  3. The processor queue length, using the System\Processor Queue Length performance counter.

In combination with the slow request, queue/503, and hang monitoring above, this can help us figure out what kind of issues we may have.

Bringing it all together

Using the metrics above, this IIS performance monitoring strategy can both detect and classify instances of CPU overload to help shape your response:

  1. If the IIS worker process has high CPU usage and is experiencing performance degradation, but the server is NOT completely overloaded [Server processor time

Chủ Đề