SQL Server Always On Availability Groups: Setup and Tuning

SQL Server Always On Availability Groups (AG) represent Microsoft's premier HA and DR solution, replacing the aging Database Mirroring feature that was deprecated back in SQL Server 2012. Whether you're protecting a mission-critical OLTP workload or offloading read replicas to secondary nodes, AG gives you fine-grained control over failover behavior, commit semantics, and routing logic. The catch is that misconfiguring any single component — from the Windows Server Failover Cluster (WSFC) quorum settings to the listener subnet binding — can silently degrade your RPO and RTO guarantees without any obvious error in the SQL Server error log. This guide walks through the full setup, the tradeoffs between synchronous and asynchronous commit modes, listener configuration, and the DMVs you need to keep your AG healthy in production.

TL;DR

Always On AG provides database-level HA with automatic or manual failover, built on top of Windows Server Failover Clustering (WSFC).
Synchronous commit guarantees zero data loss but adds latency; you are limited to 2 synchronous secondary replicas (3 total with SQL Server 2017+ supporting up to 5 sync replicas).
Asynchronous commit minimises latency overhead for geo-distributed replicas but introduces data loss risk on failover — always treat async replicas as DR-only targets.
The AG listener abstracts client connections from physical node names; configure MultiSubnetFailover=True in your connection string for fast failover.
Use sys.dm_hadr_availability_replica_states and sys.dm_hadr_database_replica_states to monitor synchronisation lag and replica health continuously.
Automatic failover requires synchronous commit and healthy WSFC quorum; forced failover overrides quorum and carries a data loss risk.

Always On AG Architecture

An Always On Availability Group is a logical container for one or more user databases that fail over together as a unit. The architecture sits on top of Windows Server Failover Clustering, which provides the quorum mechanism and the shared health detection layer, but unlike Failover Cluster Instances (FCI), AG does not require shared storage. Each replica maintains its own full copy of the databases on local storage, with changes streamed from the primary via the log-shipping-like redo thread.

The core components you need to understand before touching a single line of T-SQL are:

Primary replica — the only node that accepts read/write connections by default. All transactions originate here and are shipped to secondaries.
Secondary replicas — receive and apply the transaction log stream. In SQL Server 2022, you can have up to 8 secondaries (1 primary + 8 secondaries = 9 total replicas).
Availability Group Listener — a virtual network name (VNN) with one or more IP addresses registered in DNS and WSFC. Applications connect to the listener, not the node name, so failover is transparent.
WSFC Quorum — determines cluster health. With an even number of nodes, add a file share witness or cloud witness to avoid split-brain. Losing quorum means the AG goes offline even if SQL Server itself is running.
Endpoint — a database mirroring endpoint (port 5022 by default) used exclusively for AG log transport between replicas. This is distinct from the SQL Server listener port (1433).

Warning

Always On AG requires the same SQL Server edition on all replicas. Mixing Enterprise and Standard editions is unsupported and will cause the secondary to refuse connections on the mirroring endpoint. Additionally, all databases in the AG must be in the FULL recovery model before they can join — switching a database to SIMPLE recovery while it is part of an AG will immediately remove it from the group with an error.

Synchronous vs Asynchronous Commit Modes

The commit mode you assign to each secondary replica is the single most consequential configuration decision in your AG design. It directly controls your RPO (Recovery Point Objective) and indirectly controls your RTO (Recovery Time Objective).

Attribute	Synchronous Commit	Asynchronous Commit
Data loss on failover	Zero (RPO = 0)	Possible (RPO > 0)
Transaction latency impact	Adds round-trip network wait for log hardening on secondary	Minimal; primary does not wait for secondary acknowledgement
Automatic failover eligibility	Yes (requires synchronous + automatic failover mode)	No — manual or forced failover only
Maximum replicas in this mode	3 (SQL Server 2017+), 2 (SQL Server 2012–2016)	Up to remaining replica slots
Typical use case	Same datacenter or low-latency Metro-E link	Remote DR site, cloud replica, cross-region

With synchronous commit, SQL Server holds the committing transaction open until the secondary replica has hardened the log block to disk and sent an acknowledgement back to the primary. This adds a full network round-trip to every transaction commit. On a 1ms LAN, that overhead is imperceptible. On a 40ms WAN, your OLTP throughput will collapse. Never configure synchronous commit for a replica more than ~5ms away from the primary unless you have extensively load-tested the application under that latency.

Asynchronous commit lets the primary commit immediately and stream the log to the secondary without waiting. The secondary can lag by seconds, minutes, or hours depending on network conditions. If you fail over to an async replica, SQL Server performs a forced manual failover, which explicitly acknowledges that you may be accepting data loss. You will see the warning message in SQL Server Management Studio and must confirm the operation.

Tip

Use the log_send_queue_size and redo_queue_size columns in sys.dm_hadr_database_replica_states to quantify async lag in kilobytes. Convert to an approximate time delta by dividing by your average log generation rate (visible in sys.dm_hadr_database_replica_states.log_send_rate). Set up an alert when redo_queue_size exceeds your RPO threshold in bytes.

Setting Up an Availability Group (T-SQL)

Before creating the AG, you must create a mirroring endpoint on each replica and ensure the SQL Server service account (or certificate, if you use certificate-based auth) has CONNECT permission on the endpoint. The script below assumes Windows Authentication with a domain service account.

sql

-- Run on ALL replicas: create the mirroring endpoint
-- Replace 5022 with your chosen port if needed
CREATE ENDPOINT [Hadr_endpoint]
    STATE = STARTED
    AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
    FOR DATA_MIRRORING (
        ROLE = ALL,
        AUTHENTICATION = WINDOWS NEGOTIATE,
        ENCRYPTION = REQUIRED ALGORITHM AES
    );
GO

-- Grant CONNECT to the SQL Server service account on each replica
-- (repeat for each replica's service account if they differ)
GRANT CONNECT ON ENDPOINT::[Hadr_endpoint] TO [DOMAIN\SqlServiceAccount];
GO

With endpoints in place, create the AG on the primary replica. The script below configures one synchronous secondary for automatic failover and one asynchronous secondary for a remote DR site.

sql

-- Run on PRIMARY replica only
CREATE AVAILABILITY GROUP [AG_Production]
WITH (
    AUTOMATED_BACKUP_PREFERENCE    = SECONDARY,
    FAILURE_CONDITION_LEVEL        = 3,          -- Moderate: failover on critical SQL Server errors
    HEALTH_CHECK_TIMEOUT           = 30000,      -- 30 seconds
    DB_FAILOVER                    = ON,         -- Fail over if any database in the AG is unhealthy
    DTC_SUPPORT                    = NONE
)
FOR DATABASE [AppDB], [AuditDB]
REPLICA ON
    -- Primary replica
    N'SQL-NODE1' WITH (
        ENDPOINT_URL               = N'TCP://SQL-NODE1.corp.local:5022',
        AVAILABILITY_MODE          = SYNCHRONOUS_COMMIT,
        FAILOVER_MODE              = AUTOMATIC,
        SEEDING_MODE               = AUTOMATIC,
        SECONDARY_ROLE (
            ALLOW_CONNECTIONS      = NO           -- Primary only handles reads; no readable secondary here
        )
    ),
    -- Local synchronous secondary (same datacenter, low latency)
    N'SQL-NODE2' WITH (
        ENDPOINT_URL               = N'TCP://SQL-NODE2.corp.local:5022',
        AVAILABILITY_MODE          = SYNCHRONOUS_COMMIT,
        FAILOVER_MODE              = AUTOMATIC,
        SEEDING_MODE               = AUTOMATIC,
        SECONDARY_ROLE (
            ALLOW_CONNECTIONS      = READ_ONLY,   -- Readable secondary for reporting
            READ_ONLY_ROUTING_URL  = N'TCP://SQL-NODE2.corp.local:1433'
        )
    ),
    -- Remote asynchronous replica (DR site, higher latency)
    N'SQL-NODE3-DR' WITH (
        ENDPOINT_URL               = N'TCP://SQL-NODE3-DR.dr.local:5022',
        AVAILABILITY_MODE          = ASYNCHRONOUS_COMMIT,
        FAILOVER_MODE              = MANUAL,      -- Async replicas cannot use automatic failover
        SEEDING_MODE               = AUTOMATIC,
        SECONDARY_ROLE (
            ALLOW_CONNECTIONS      = READ_ONLY,
            READ_ONLY_ROUTING_URL  = N'TCP://SQL-NODE3-DR.dr.local:1433'
        )
    );
GO

-- Run on EACH SECONDARY replica to join the AG
-- (Run on SQL-NODE2 first, then SQL-NODE3-DR)
ALTER AVAILABILITY GROUP [AG_Production] JOIN;
GO

-- Grant CREATE DATABASE permission for automatic seeding on each secondary
ALTER AVAILABILITY GROUP [AG_Production] GRANT CREATE ANY DATABASE;
GO

Warning

Automatic seeding (SEEDING_MODE = AUTOMATIC) streams a full backup of each database directly over the mirroring endpoint to the secondary without requiring a manual restore. This is convenient but will saturate your mirroring endpoint bandwidth during initial seeding. For large databases (>500 GB), consider manual seeding (restore with NORECOVERY then join) during a maintenance window to avoid impacting production log transport on the same endpoint.

Configuring the AG Listener

The AG listener is the DNS/network abstraction that lets applications connect to the AG without knowing which node is currently primary. Without a listener, your applications must update their connection strings every time a failover occurs — which is operationally unacceptable.

sql

-- Add the listener to the existing AG
-- Run on the current PRIMARY replica
ALTER AVAILABILITY GROUP [AG_Production]
ADD LISTENER N'AG-Prod-Listener' (
    WITH IP (
        -- Primary datacenter VIP
        (N'10.10.1.50', N'255.255.255.0'),
        -- DR site VIP (for multi-subnet failover)
        (N'10.20.1.50', N'255.255.255.0')
    ),
    PORT = 1433
);
GO

After creating the listener, verify it registered correctly in DNS and WSFC, then test connectivity. The two most critical application-side settings are:

MultiSubnetFailover=True — instructs the .NET SqlClient driver to attempt connections to all IP addresses of the listener in parallel rather than sequentially. Without this, failover to a secondary subnet can take up to 21 seconds due to sequential TCP timeouts.
ApplicationIntent=ReadOnly — routes read-only workloads to a secondary replica via the read-only routing list you configured in READ_ONLY_ROUTING_URL. Without this hint, even a reporting query goes to the primary.

sql

-- Verify listener registration
SELECT
    ag.name                  AS availability_group,
    agl.dns_name             AS listener_name,
    aglip.ip_address         AS ip_address,
    aglip.ip_subnet_mask     AS subnet_mask,
    aglip.state_desc         AS ip_state
FROM sys.availability_groups          ag
JOIN sys.availability_group_listeners agl  ON ag.group_id    = agl.group_id
JOIN sys.availability_group_listener_ip_addresses aglip
                                           ON agl.listener_id = aglip.listener_id
ORDER BY ag.name, aglip.ip_address;
GO

For read-only routing to work end-to-end, you must also define a read-only routing list on the primary replica specifying the order in which secondaries should receive read-only connections:

sql

-- Configure read-only routing list on primary
-- Routes ReadOnly connections first to SQL-NODE2, then to SQL-NODE3-DR as fallback
ALTER AVAILABILITY GROUP [AG_Production]
MODIFY REPLICA ON N'SQL-NODE1' WITH (
    PRIMARY_ROLE (
        READ_ONLY_ROUTING_LIST = (N'SQL-NODE2', N'SQL-NODE3-DR')
    )
);
GO

Monitoring and Failover

Reactive monitoring — waiting for users to report errors — is not an acceptable strategy for an AG managing production data. You need continuous visibility into replica health, synchronisation lag, and log transport throughput. The following query is the foundation of any AG monitoring dashboard or alerting job.

sql

-- Comprehensive AG health check
-- Returns replica connectivity, sync state, and per-database redo/send queue
SELECT
    ag.name                                              AS availability_group,
    ar.replica_server_name                               AS replica,
    ars.role_desc                                        AS role,
    ars.operational_state_desc                           AS operational_state,
    ars.connected_state_desc                             AS connected_state,
    ars.synchronization_health_desc                      AS sync_health,
    ars.last_connect_error_number                        AS last_error_number,
    ars.last_connect_error_description                   AS last_error_desc,
    drs.database_name                                    AS database_name,
    drs.synchronization_state_desc                       AS db_sync_state,
    drs.log_send_queue_size          / 1024.0            AS log_send_queue_mb,
    drs.log_send_rate                / 1024.0            AS log_send_rate_mb_s,
    drs.redo_queue_size              / 1024.0            AS redo_queue_mb,
    drs.redo_rate                    / 1024.0            AS redo_rate_mb_s,
    -- Estimated lag in seconds (redo queue / redo rate)
    CASE
        WHEN drs.redo_rate > 0
        THEN CAST(drs.redo_queue_size * 1.0 / drs.redo_rate AS DECIMAL(10,1))
        ELSE NULL
    END                                                  AS estimated_lag_seconds,
    drs.is_suspended                                     AS data_movement_suspended,
    drs.suspend_reason_desc                              AS suspend_reason
FROM sys.dm_hadr_availability_replica_states             ars
JOIN sys.availability_replicas                           ar
    ON ars.replica_id  = ar.replica_id
JOIN sys.availability_groups                             ag
    ON ar.group_id     = ag.group_id
LEFT JOIN sys.dm_hadr_database_replica_states            drs
    ON ars.replica_id  = drs.replica_id
ORDER BY ag.name, ars.role_desc DESC, ar.replica_server_name, drs.database_name;
GO

Understanding the three failover modes is essential before you design your runbook:

Automatic failover — triggered by WSFC when the primary fails health checks. Requires synchronous commit mode and automatic failover mode on both the primary and at least one secondary. No DBA action required. Zero data loss guaranteed.
Planned manual failover — DBA-initiated switch to a synchronous secondary that is fully synchronised (synchronization_state_desc = 'SYNCHRONIZED'). Used for patching, hardware maintenance, and load balancing. Zero data loss, minimal downtime.
Forced failover (with possible data loss) — emergency failover to any replica regardless of synchronisation state. Used when the primary is permanently lost and no synchronous secondary is available. Always audits the estimated_lag_seconds before executing to understand exposure.

sql

-- Planned manual failover (run on the TARGET secondary)
-- Only valid when synchronization_state_desc = 'SYNCHRONIZED'
ALTER AVAILABILITY GROUP [AG_Production] FAILOVER;
GO

-- Forced failover WITH DATA LOSS (emergency only — run on TARGET secondary)
-- Confirm estimated_lag_seconds from monitoring query above first
ALTER AVAILABILITY GROUP [AG_Production] FORCE_FAILOVER_ALLOW_DATA_LOSS;
GO

Warning

FORCE_FAILOVER_ALLOW_DATA_LOSS places the old primary into a RESOLVING state. If the old primary comes back online after a forced failover, it will not automatically rejoin the AG. You must manually resume data movement or remove and re-add the replica. Failing to handle this correctly can result in two nodes both believing they are primary (split-brain), which WSFC quorum is designed to prevent — but only if quorum is intact.

Key Takeaways

Always On AG provides database-level HA without shared storage, built on WSFC. Every replica holds a complete, independent copy of the data.
Synchronous commit guarantees zero data loss (RPO=0) and enables automatic failover, but adds network round-trip latency to every commit. Limit synchronous replicas to nodes within ~5ms round-trip latency.
Asynchronous commit is appropriate for geo-distributed DR replicas. Treat async replicas as last-resort failover targets and always quantify lag before executing FORCE_FAILOVER_ALLOW_DATA_LOSS.
The AG listener with MultiSubnetFailover=True in the connection string is mandatory for sub-30-second client reconnection across subnets after a failover.
Configure read-only routing lists on the primary replica and use ApplicationIntent=ReadOnly in reporting connection strings to offload reads to secondaries without application changes.
Monitor sys.dm_hadr_availability_replica_states and sys.dm_hadr_database_replica_states continuously. Alert on synchronization_health_desc != 'HEALTHY', suspended data movement, and redo queue size exceeding your RPO threshold.
After a forced failover, the former primary enters RESOLVING state and must be manually rejoined — it will not self-heal. Build this into your DR runbook.

Working with JusDB on SQL Server HA

Designing and operating a production Always On AG involves far more than running a few T-SQL scripts. WSFC quorum design, certificate-based endpoint authentication for cross-domain replicas, Direct Seeding bandwidth management, and integrating AG health into your observability stack all require deep, hands-on SQL Server expertise. JusDB's SQL Server engineers have configured and operated AG environments from 2-node local clusters to 5-replica multi-datacenter architectures with read-scale secondaries serving hundreds of concurrent reporting queries.

If you need help assessing your current HA posture, designing a new AG topology, migrating from Database Mirroring or log shipping, or tuning an existing AG that is falling behind on synchronisation, our team can parachute in as an extension of your infrastructure team — from initial design through go-live and ongoing health monitoring.

Explore JusDB SQL Server HA Services → | Talk to a DBA

Need hands-on SQL Server help?

JusDB's certified SQL Server DBAs handle performance tuning, Always On AG, migrations, and 24/7 operations.

SQL Server Consulting → | SQL Server Remote DBA → | 24/7 SQL Server Support →

SQL Server Always On Availability Groups: Setup and Tuning

Always On AG Architecture

Synchronous vs Asynchronous Commit Modes

Setting Up an Availability Group (T-SQL)

Configuring the AG Listener

Monitoring and Failover

Working with JusDB on SQL Server HA

Share this article

Keep reading

PostgreSQL 19 Beta: Every New Feature That Matters to DBAs

High Performance with MongoDB: A Top-Down Tuning Guide

Migrate On-Premises SQL Server to Amazon RDS: Native Backup/Restore vs AWS DMS