Automatic Page Repair in SQL Server Always On Availability Groups

This is the 9th article in the series of SQL Server Always On Availability Groups.

Introduction

SQL Server Always On is a disaster and high availability feature. In the previous articles, we explored the following features.

  • AG failover in case of primary replica goes down
  • Manual failover
  • Database level health detection that initiates a failover in case any AG database changes its state from online and SQL Server not able to write in the transaction log

In the above points, we safeguard SQL Server for the instance and DB failover. Suppose your AG replica is healthy, but one of your AG databases becomes corrupt. You get a logical consistency error if you execute the DBCC CHECKDB command. You might have an excellent copy of that database page in another participating replica. Let’s find the answer to a few questions in this article.

  • How does SQL Server Always On help us to recover the corrupt page in this scenario?
  • What happens if a page corrupts in the AG database on the primary replica?

Prerequisites

You can follow the series articles and prepare a two-node SQL Server Always On Availability Group, as shown below.

  • Current Primary replica: SQLNode1INST1
  • Current Secondary replica: SQLNode2INST1
  • AG Mode: Synchronous commit

My Demo AG setup

Automatic Page Repair in SQL Server Always On Availability Groups

SQL Server supports automatic page repair for the databases participating in AG configuration. If SQL detects specific corruption errors (Error id 823,824 & 829) for an AG database, SQL Server tries to restore the page from the corresponding replica(primary or secondary).

It works for the following consistency errors.

  • Error 823: We get this error if the operating system performs a cyclic redundancy check that failed on the data
  • Error 824: Logical consistency errors such as bad page checksum, torn page detection
  • Error 829: SQL Server raises error 829 if a page is marked as restore pending

Resolve a page corruption issues on the Primary replica in SQL Server Always On Availability Groups

Let us consider that my AG database is in the synchronized state and communicating with the secondary replica. Now, you get the IO error in an AG database on the primary replica. The below diagram shows the high-level steps for the automatic page repair if any page corrupts at the primary replica.

Automatic Page Repair in SQL Server Always On Availability Groups

  • It inserts a row in the suspect_pages table in the MSDB system database. This table logs the corruption information such as the database id, file id, page id, error id
  • It broadcasts a request to all secondary replica
  • It might get a response from the multiple secondary replicas. It gets the page from the replica that responds first. This request also specifies the page ID and LSN that is currently at the end of the log
  • SQL Server marks the specified page as restore pending. If any user tries to access the page status restore pending, he gets error 829
  • Once the secondary replica receives a page request, it waits until it has redone the log up to the LSN primary replica specified in the request
  • Secondary replica access the required page and sends it to the primary replica. If the secondary replica could not access the page, it returns an error to the primary replica. It causes the failure of automatic page repair
  • If the automatic page repair is successful, it marks the status of the page in the suspect_pages table as restored (event_type 5)
  • SQL Server tries to resolve the deferred transactions after the successful page repair

Let’s simulate the automatic page repair in the SQL Server Always On Availability Group.

  • Note: For this article, we corrupt a page in the primary replica intentionally. DO NOT EVER perform these steps in a production database. It is for educational purposes only.

Create a [Employee] table in the [MyNewDB] database and insert sample records in it.

Use the DBCC IND command to check the database pages in the [MyNewDB] database.

DBCC IND command

Verify if the AG dashboard is healthy.

Verify AG dashboard

Stop SQL Services on both Primary and Secondary AG replicas using SQL Server Configuration Manager. You should stop SQL Services on secondary replica before the primary replica.

Stop SQL Services

Stop SQL Services on secondary

We use a Hex Editor in this article to open the database file. You can download it from the URL.

Launch Hex Editor and open the MyNewDB.MDF file. You might get the permission error. Make sure you permit your account on the database file.

Hex editor

It opens the [MyNewDB.MDF] content, along with the decoded text.

Open MDF file with Hex Editor

Now, search for an employee name that we entered initially. Let’s search for it for the employee name Roy.

Modify MDF file content

Modify the numbers on the left-hand side and save the MDF file changes.

Save changes

Start the SQL Services for the primary replica and secondary replica.

If we try to select the records from the [Employee] table, it gives you logical consistency error, as shown below. It reports an incorrect checksum for the page 1:336 in the [MyNewDB] database:

logical consistency error

As highlighted earlier, SQL Server puts an entry in the suspect_pages table in the MSDB database. You can see an entry for the page_id 336. Here, we see event_type 5 that shows the page is restored from the secondary replica:

Suspect pages

We can also validate it using the DMV sys.dm_hadr_auto_page_repair:

DMV output

In the DMV output, we get the following useful columns.

  • Page_id: It is the page id that was restored from the secondary replica
  • Error_type:

    • 1 = 824 error
    • 2 = Bad checksum
    • 3 = Torn Page
    • -1 = All hardware 823 errors
  • Page_status: you might get the following page status in this column

    • 2 = It is queued for a request from the secondary replica
    • 3 = request is sent to the partner
    • 4 = Page is successfully repaired
    • 5 = Page could not be repaired. Automatic page repair will try to repair the page again

In our case, we get event_id 5 and page_status 4 that shows page corruption was due to torn page, but automatic database repair restored it successfully. You can query the [employee] table again. You can notice here that we still have employee Roy details because SQL Server replaced the modified (corrupted) page in the primary replica from the secondary replica.

Verify table records

Resolve a page corruption issues on the secondary replica in SQL Server Always On Availability Groups

Previously, we intentionally corrupted a page in the primary replica. Let’s perform the same demonstration in the secondary replica.

Page corruption simulation

Start the SQL Services and try to access the [Employee] table in the secondary replica database. You might get the following error message.

Corruption message

By default, the secondary replica is not allowed for a connection. You can open the availability group properties from the primary replica and modify the value for Readable Secondary to Yes.

Modify replica property

Now, go back to the secondary replica and access the employee table. It reports a logical consistency I/O error. It has an issue to read the page 1:336.

Logical errors

Wait for some time and check the entry in the suspect_pages table of the MSDB database in the secondary replica. It has an entry for page 1:336 with event_type 5.

You can query the DMV sys.dm_hadr_auto_page_repair and verify that the page is repaired successfully (page_status 4).

check repair status

You can read the page in the secondary replica successfully.

Table records

Your AG dashboard is healthy as well.

AG dashboard

The overall process of automatic page repair if page corrupts in a secondary replica are as below.

  • SQL Server Always On Availability group enters into the suspended state for the secondary replica
  • It enters the corrupt page information in the suspect_page table of the secondary replica
  • Secondary replica puts a request to get a page copy from the primary replica
  • If the primary replica can access the page, it sends the page to the secondary replica
  • Secondary replica repairs the page, and AG is in the synchronized state again

At a high-level, we can summarize the automatic page repair if page corrupts at the secondary replica as shown below:

Overall automatic page repair process

Restrictions of automatic page repair in SQL Server Always On Availability Groups

SQL Server cannot recover or restore the following pages as part of automatic page repair functionality.

  • A file header page
  • A database boot page
  • Global Allocation Map (GAM) Pages
  • Shared Allocation Map (SGAM) pages
  • Page Free Space (PFS) pages

Conclusion

In this article, we explored that SQL Server Always On Availability Groups give us additional flexibility of automatic page repair in case of any corruption issues. It is an asynchronous process and automatically fixes the corruption for you.

Table of contents

Rajendra Gupta
Latest posts by Rajendra Gupta (see all)

Author: admin

Leave a Reply

Your email address will not be published.