The first question every SQL Server DBA should ask when a query slows down isn't "what changed in the code?" — it's "what was the database waiting for?" Wait stats are SQL Server's own answer to that question, and the difference between a senior DBA and a junior one often comes down to whether they read them. This playbook walks through the four wait-stat sources every diagnostic engagement should hit, the eight wait types that account for ~95% of real-world incidents, and the remediation paths for each.
sys.dm_os_wait_statsis the headline source — but it's cumulative since instance start, so always capture a baseline + delta over a 5–15 minute window during the slowdown, not raw totals.- The Query Store's
sys.query_store_wait_stats(SQL Server 2017+) attributes waits to specific queries and plans, which the instance-wide DMV cannot do. - Eight wait types account for the overwhelming majority of incidents: PAGEIOLATCH_*, LCK_M_*, CXPACKET / CXCONSUMER, SOS_SCHEDULER_YIELD, RESOURCE_SEMAPHORE, WRITELOG, PAGELATCH_*, and ASYNC_NETWORK_IO.
- Ignore "benign" waits:
CLR_AUTO_EVENT,BROKER_*,SLEEP_*,WAITFOR,DBMIRROR_EVENTS_QUEUE,HADR_FILESTREAM_IOMGR_IOCOMPLETION— these are present even on idle instances. - Wait stats are a starting point, not an answer. Always pair the top wait with
sys.dm_exec_requests,sys.dm_io_virtual_file_stats, or the Query Store before changing anything.
What Wait Stats Actually Measure
Every SQL Server worker thread has three possible states: RUNNING (executing on a CPU scheduler), RUNNABLE (ready but queued behind another worker on the scheduler), and SUSPENDED (waiting for something — a page from disk, a lock release, a network buffer drain, etc.). Wait stats record cumulative time spent in the SUSPENDED state, broken down by what each thread was waiting on. Time in RUNNABLE — sometimes called "signal wait time" — is reported separately and indicates CPU pressure.
The math that lets you reason about the instance:
total_wait_ms = signal_wait_ms (RUNNABLE — CPU pressure)
+ resource_wait_ms (SUSPENDED — actual resource wait)If signal waits are more than ~25% of total wait time, the instance is CPU-bound. If signal waits are low but a specific resource wait type dominates, that resource is the bottleneck. This single ratio cuts most diagnostic engagements in half before you even look at individual wait types.
Four Wait-Stat Sources, Ranked by Resolution
SQL Server exposes wait data through four different views, each at a different granularity. Use the one that matches the question you're trying to answer.
1. sys.dm_os_wait_stats — Instance-Wide, Cumulative
The classic source. Aggregated since the last instance restart (or the last DBCC SQLPERF (sys.dm_os_wait_stats, CLEAR) call). Useful for "what is this server's overall character?" — never for "why is this query slow right now."
-- Top resource waits, excluding the well-known benign ones
SELECT TOP 10
wait_type,
wait_time_ms / 1000.0 AS wait_sec,
waiting_tasks_count,
(wait_time_ms - signal_wait_time_ms) / NULLIF(waiting_tasks_count, 0) AS avg_wait_ms,
100.0 * wait_time_ms / SUM(wait_time_ms) OVER() AS pct_of_total
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT', 'CLR_SEMAPHORE',
'DBMIRROR_DBM_EVENT', 'DBMIRROR_EVENTS_QUEUE', 'DBMIRROR_WORKER_QUEUE',
'DIRTY_PAGE_POLL', 'DISPATCHER_QUEUE_SEMAPHORE',
'FT_IFTS_SCHEDULER_IDLE_WAIT', 'HADR_CLUSAPI_CALL', 'HADR_FILESTREAM_IOMGR_IOCOMPLETION',
'HADR_LOGCAPTURE_WAIT', 'HADR_NOTIFICATION_DEQUEUE', 'HADR_TIMER_TASK',
'HADR_WORK_QUEUE', 'KSOURCE_WAKEUP', 'LAZYWRITER_SLEEP', 'LOGMGR_QUEUE',
'ONDEMAND_TASK_QUEUE', 'PWAIT_ALL_COMPONENTS_INITIALIZED',
'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', 'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP',
'REQUEST_FOR_DEADLOCK_SEARCH', 'RESOURCE_QUEUE',
'SERVER_IDLE_CHECK', 'SLEEP_BPOOL_FLUSH', 'SLEEP_DBSTARTUP',
'SLEEP_DCOMSTARTUP', 'SLEEP_MASTERDBREADY', 'SLEEP_MASTERMDREADY',
'SLEEP_MASTERUPGRADED', 'SLEEP_MSDBSTARTUP', 'SLEEP_SYSTEMTASK',
'SLEEP_TASK', 'SLEEP_TEMPDBSTARTUP', 'SNI_HTTP_ACCEPT',
'SP_SERVER_DIAGNOSTICS_SLEEP', 'SQLTRACE_BUFFER_FLUSH',
'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 'SQLTRACE_WAIT_ENTRIES',
'WAIT_FOR_RESULTS', 'WAITFOR', 'WAITFOR_TASKSHUTDOWN',
'XE_DISPATCHER_JOIN', 'XE_DISPATCHER_WAIT', 'XE_TIMER_EVENT'
)
AND wait_time_ms > 0
ORDER BY wait_time_ms DESC;If your server has been up for 60 days and an incident lasted 30 minutes, the bad waits during that window are diluted ~2,880-to-1 against historical totals. Always take two snapshots bracketing the incident and diff them. The query above is fine as a quick "what's this instance's overall shape" check; it is not a diagnostic tool for a live incident.
2. sys.dm_exec_session_wait_stats — Per Session, Live
Per-session breakdown of waits, refreshed continuously. Joined with sys.dm_exec_requests and sys.dm_exec_sql_text, this tells you exactly which active session is causing the dominant wait right now:
SELECT
r.session_id,
r.status,
r.wait_type,
r.wait_time AS current_wait_ms,
r.wait_resource,
r.blocking_session_id,
DB_NAME(r.database_id) AS database_name,
SUBSTRING(t.text, (r.statement_start_offset/2) + 1,
((CASE r.statement_end_offset WHEN -1 THEN DATALENGTH(t.text)
ELSE r.statement_end_offset END - r.statement_start_offset)/2) + 1) AS current_stmt,
qp.query_plan
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
CROSS APPLY sys.dm_exec_query_plan(r.plan_handle) qp
WHERE r.session_id > 50 -- exclude system
AND r.wait_type IS NOT NULL
ORDER BY r.wait_time DESC;3. sys.query_store_wait_stats — Per Query Plan (SQL 2017+)
Available when Query Store is enabled and is the only DMV that ties wait time back to a specific query plan. Indispensable for "after we deployed v3.2 the database got slow" investigations — Query Store will show the exact plan hash that started accumulating LCK_M_X waits after the deploy.
-- Top queries by wait category over the last hour
SELECT TOP 25
qsws.wait_category_desc,
SUM(qsws.total_query_wait_time_ms) AS wait_ms,
qsqt.query_sql_text
FROM sys.query_store_wait_stats qsws
JOIN sys.query_store_plan qsp ON qsws.plan_id = qsp.plan_id
JOIN sys.query_store_query qsq ON qsp.query_id = qsq.query_id
JOIN sys.query_store_query_text qsqt ON qsq.query_text_id = qsqt.query_text_id
JOIN sys.query_store_runtime_stats_interval qsrsi
ON qsws.runtime_stats_interval_id = qsrsi.runtime_stats_interval_id
WHERE qsrsi.end_time > DATEADD(HOUR, -1, GETUTCDATE())
GROUP BY qsws.wait_category_desc, qsqt.query_sql_text
ORDER BY wait_ms DESC;4. Extended Events — Surgical, Forensic
When DMVs aren't granular enough — for example, you need to know the exact resource a query was waiting on, not just the wait type — an Extended Event session capturing wait_completed filtered to a specific session_id or query_hash will give you per-wait records with full context. Use sparingly: high-frequency wait events are expensive to capture.
The Eight Wait Types That Cover ~95% of Incidents
Out of the 800+ wait types SQL Server can record, eight account for nearly all real-world performance incidents. Memorize what each one means and you can short-cut most diagnostics.
1. PAGEIOLATCH_SH / PAGEIOLATCH_EX — Disk I/O Wait
What it means: A worker is waiting for a data page to be read from disk into the buffer pool. SH = shared (reading), EX = exclusive (writing).
Diagnose: Cross-reference with sys.dm_io_virtual_file_stats to find which file/tempdb is slow. Calculate average read/write latency per file:
SELECT
DB_NAME(vfs.database_id) AS db,
mf.name AS file_name,
vfs.num_of_reads,
vfs.io_stall_read_ms / NULLIF(vfs.num_of_reads, 0) AS avg_read_ms,
vfs.num_of_writes,
vfs.io_stall_write_ms / NULLIF(vfs.num_of_writes, 0) AS avg_write_ms
FROM sys.dm_io_virtual_file_stats(NULL, NULL) vfs
JOIN sys.master_files mf ON vfs.database_id = mf.database_id
AND vfs.file_id = mf.file_id
ORDER BY (vfs.io_stall_read_ms + vfs.io_stall_write_ms) DESC;Targets: <5 ms is great, 5–20 ms is acceptable, >20 ms means storage is your bottleneck. Fixes: faster storage tier, more memory (so pages stay in buffer pool), fewer table scans (add missing indexes), TempDB on dedicated NVMe.
2. LCK_M_* — Lock Wait
What it means: A transaction is waiting to acquire a lock another session holds. Common variants: LCK_M_S (Shared), LCK_M_X (eXclusive), LCK_M_IX (Intent Exclusive), LCK_M_U (Update).
Diagnose: Use the live query from section 2 above, watching blocking_session_id to walk the chain to the head blocker. The head blocker is what to remediate.
Fixes: Check isolation level (READ COMMITTED SNAPSHOT often eliminates the bulk of reader-writer locks on OLTP), look for missing indexes that force lookups holding more locks, shorten transactions, check whether the head blocker is an idle session holding a transaction open ("BEGIN TRAN; do work; sleep" pattern in application code).
3. CXPACKET / CXCONSUMER — Parallelism Wait
What it means: Parallel query coordination — threads waiting on sibling threads in a parallel plan. CXPACKET is the older catch-all; CXCONSUMER (introduced in 2017) is the "harmless" portion that just indicates one thread finished early. CXPACKET minus CXCONSUMER is the portion that actually indicates skew or coordination cost.
Fixes: Tune MAXDOP (instance-wide via configuration or per-query via OPTION (MAXDOP n)) and cost threshold for parallelism. Default cost threshold of 5 is far too low for modern hardware — most production instances should set it to 30–50. Microsoft's guidance for OLTP workloads is MAXDOP equal to logical processors per NUMA node up to 8, but this is a starting point — measure, don't guess.
4. SOS_SCHEDULER_YIELD — CPU Pressure / Compile
What it means: A worker voluntarily yielded the CPU because its 4ms quantum expired. By itself, not a problem — but if it's the top wait, you're CPU-bound. Often correlates with bad query plans (Cartesian products, missing indexes leading to table scans, or runaway parallel plans).
Diagnose: Check signal-wait percentage from sys.dm_os_wait_stats; if > 25%, the instance is CPU-bound. Check sys.dm_exec_query_stats ordered by total_worker_time to find the heaviest CPU consumers.
5. RESOURCE_SEMAPHORE — Memory Grant Wait
What it means: A query needs a memory grant for sort/hash operations and is waiting for memory to become available. Always indicates either undersized memory or poor cardinality estimates causing oversized grants.
Fixes: Update statistics (stale stats cause cardinality misestimates and inflated memory grants), add Resource Governor pools to cap individual query grants, or — if you really are memory-starved — add RAM and increase max server memory.
6. WRITELOG — Transaction Log Flush Wait
What it means: A commit is waiting for the transaction log buffer to be flushed to disk. The log file's write latency is the single most important storage metric for OLTP.
Targets: <2 ms write latency on the log file. 2–10 ms is tolerable. >10 ms means commits are blocking, which cascades into every other wait.
Fixes: Dedicated NVMe for the log file, never share spindles with data, never put logs on EBS gp2 (use io2 or io2-Block-Express on AWS, Premium SSD v2 on Azure). Enable delayed durability sparingly for batch-heavy workloads.
7. PAGELATCH_* — In-Memory Latch Contention
What it means: Different from PAGEIOLATCH — these are short-lived in-memory latches protecting buffer pool pages. PAGELATCH_EX contention is the classic TempDB allocation bottleneck (PFS/GAM/SGAM pages) on busy OLTP workloads.
Fixes: TempDB-specific — use multiple equally-sized data files (one per logical core, up to 8). On SQL 2016+ this is the default. Enable trace flag 1118 (uniform extent allocation) — also default on SQL 2016+ for TempDB. If application-database TempLatch contention persists, check for hot identity columns (last page insert pattern) and consider OPTIMIZE_FOR_SEQUENTIAL_KEY (SQL 2019+).
8. ASYNC_NETWORK_IO — Client Network Wait
What it means: SQL Server is waiting for the client application to consume results. Almost never a SQL Server problem — it's an application that's fetching huge result sets row-by-row, sleeping inside a cursor, or running on slow client hardware.
Fixes: Application-side. Make sure the client is reading results greedily, use TOP/OFFSET-FETCH to paginate, and never run reporting workloads that pull 10M rows over the wire when an aggregation could happen server-side.
Diagnostic Workflow: 30 Minutes from Page to Plan
Putting it together, here's the workflow we use on incident calls:
- Capture baseline — snapshot
sys.dm_os_wait_statsinto a temp table. - Wait 5 minutes through the slowdown.
- Snapshot again — compute delta. Top three resource waits in the delta are your real suspects.
- Cross-reference with the live
sys.dm_exec_requestsview to find which sessions are accumulating those waits right now. - Drill into Query Store for per-query attribution. If the top wait is
LCK_M_X, Query Store will show which queries are accumulating that wait against which tables. - Compute signal wait % from the delta — confirms whether CPU is the underlying issue.
- Match wait type to remediation from the section above. Apply one change at a time, then re-measure.
The temptation in a production incident is to make four config changes at once. Don't. Make one, measure for 15 minutes, then decide. Two changes at once means you don't know which one worked — and if performance gets worse, you don't know which one made it worse.
Common False Positives
Three wait types appear high in production reports but rarely indicate real problems:
| Wait type | Why it looks bad | When it's actually a problem |
|---|---|---|
CXPACKET (alone, on 2017+) |
Often the top wait by accumulated time | Only when CXPACKET minus CXCONSUMER is large, and you have skew in parallel branches |
ASYNC_NETWORK_IO |
Can dominate on reporting workloads | It's almost never SQL Server's fault — fix the client |
SOS_SCHEDULER_YIELD |
Always present, often top-5 | Only matters when signal wait % exceeds 25% — that's the real CPU-pressure signal |
Capacity Planning From Wait Stats
Wait stats aren't just for incident response — they're the cleanest input to capacity planning. Trend the top resource wait categories over 12 weeks. If WRITELOG is climbing linearly with transaction volume, you're approaching log-storage saturation. If PAGEIOLATCH_SH is climbing while buffer pool size is constant, the working set is outgrowing memory. These predictors are weeks ahead of "users complaining" and let you size hardware before incidents happen.
Working with JusDB on SQL Server Performance
JusDB's certified SQL Server DBAs run this exact diagnostic playbook on dozens of production incidents every month. Our SQL Server consulting engagements deliver written diagnostic reports with wait-stat baselines, query-store attributions, and remediation roadmaps — not Slack-thread guesses. For ongoing operations, our remote DBA service includes continuous wait-stat baselining as part of the monitoring contract, so we catch capacity drift before it becomes an incident.
Explore SQL Server Consulting → | SQL Server Remote DBA → | Talk to a SQL Server DBA
Related reading: