This is a checklist put together and followed for an upgrade of a pair of physical SQL Server 2012 servers which hosted a single availability group of several terabytes of data with minimal downtime.
The availability group was configured with synchronous commit and automatic failover.
List of Steps
- If possible, stop any ETL processes that put data into the affected databases – less traffic here means less traffic to catch up with when you resume the AG HADR.
- Make sure all AGs hosted by the servers are on the same server
- Put servers, clusters, listeners, everything into Maintenance Mode in SCOM / SQLMonitor / Solarwinds / whatever other monitoring system you use.
- Disable Transaction Log backup jobs on all servers hosting the Availability Group(s)
- Take Log backups
- Configure the Availability Group(s) to manual failover
- Suspend data movement for each database on the secondary server
- Backup the system databases on the secondary server
- Upgrade the secondary server and reboot if required (it might not be required, but it is recommended to do this anyway)
- Resume data movement on the secondary, and allow the servers to synchronise.
- Failover the availability group(s) to the upgraded server with no data loss
- Check that data movement is suspended to the new secondary / unupgraded server due to version incompatibility
- Upgrade other server and then reboot
- Resume data movement on secondary, and allow to synchronise
- (optional) failover AG with no data loss to test all is OK (this isn’t really required, but can be useful for extra peace of mind)
- Reconfigure availability group(s) back to automatic failover (if that’s how it was)
- Re-enable transaction log backup jobs
- Take servers out of maintenance mode in SCOM/SQLMonitor/Solarwinds/Whatever
A couple of useful SQL queries
Suspend / Resume HADR for all databases on the server
This is the suspend version – change suspend to resume and rerun later
DECLARE @sql VARCHAR(MAX) = ''; SELECT @sql = @sql + 'ALTER DATABASE ' + QUOTENAME(DB_NAME(hars.database_id)) + ' SET HADR SUSPEND;' + CHAR(10) + CHAR(13) FROM sys.dm_hadr_database_replica_states AS hars INNER JOIN sys.dm_hadr_availability_group_states AS hags ON hags.group_id = hars.group_id WHERE hars.is_local = 1 ORDER BY DB_NAME(hars.database_id); PRINT @sql; -- optional EXEC (@sql);
Check health of replicas in the AG (run on primary)
SELECT ag.name, ar.replica_server_name, hars.is_local, hars.role_desc, hars.operational_state_desc, hars.connected_state_desc, hars.recovery_health_desc, hars.synchronization_health_desc, hars.last_connect_error_number, hars.last_connect_error_description, hars.last_connect_error_timestamp, ag.failure_condition_level, ag.health_check_timeout, ar.replica_server_name, ar.owner_sid, ar.endpoint_url, ar.availability_mode_desc, ar.failover_mode_desc, ar.session_timeout, ar.primary_role_allow_connections_desc, ar.secondary_role_allow_connections_desc, ar.backup_priority, ag.automated_backup_preference_desc, ar.read_only_routing_url FROM sys.dm_hadr_availability_replica_states AS hars INNER JOIN sys.availability_groups AS ag ON ag.group_id = hars.group_id INNER JOIN sys.availability_replicas AS ar ON ar.group_id = ag.group_id AND ar.replica_id = hars.replica_id;
Monitor progress of catchup/synchronisation (run on primary)
SELECT ag.name, adc.database_name, ar.replica_server_name, hdrs.is_local, hdrs.synchronization_state, hdrs.synchronization_state_desc, hdrs.is_commit_participant, hdrs.synchronization_health_desc, hdrs.database_state_desc, hdrs.is_suspended, hdrs.suspend_reason, hdrs.suspend_reason_desc, hdrs.last_sent_time, hdrs.last_received_time, hdrs.last_hardened_time, hdrs.last_redone_time, hdrs.log_send_queue_size, hdrs.log_send_rate, hdrs.redo_queue_size, hdrs.redo_rate, hdrs.filestream_send_rate, CEILING(hdrs.redo_queue_size / hdrs.redo_rate) / 60.0 AS est_redo_completion_minutes, -- approximately hdrs.last_commit_time, hdrs.low_water_mark_for_ghosts, hdrs.secondary_lag_seconds -- for SQL 2016 and above FROM sys.dm_hadr_database_replica_states AS hdrs INNER JOIN sys.availability_groups AS ag ON ag.group_id = hdrs.group_id INNER JOIN sys.availability_databases_cluster AS adc ON adc.group_database_id = hdrs.group_database_id INNER JOIN sys.availability_replicas AS ar ON ar.replica_id = hdrs.replica_id WHERE hdrs.log_send_queue_size <> 0 OR hdrs.redo_queue_size <> 0;
Pingback: Upgrading Servers in an Availability Group – Curated SQL
There’s no actual need to disable log backups and suspend data movement, is there?
The checklist above was, in part, provided by a Microsoft database guy, and that was definitely one of his recommendations, not mine…