New Bug: Change Data Capture (CDC) Fails after ALTER COLUMN

EDIT: Fri 19 Jun 2015 – We received confirmation the Microsoft Support Team the bug can be reproduced however at this point in time there will be NO FIX

EDIT: Thurs 02 Jun 2016 – I have reproduced the bug in SQL Server 2016 (GA release) so unfortunately it now affects all SQL Server versions and all patch levels from SQL Server 2008 to the current version.  Please read below for methods to identify the issue and a work around.


This week we discovered a new SQL Bug that affects all SQL Versions from SQL Server 2008 to SQL Server 2016 (GA Release).  The bug specifically affects the SQL Enterprise feature of Change Data Capture (CDC).  The bug is difficult to identify – however the article below outlines a method to replicate the bug, and a method to remediate it.

If you are not familiar with CDC then have a look here https://msdn.microsoft.com/en-US/library/cc645937.aspx

The bug will occur when you change a data type from TEXT to VARCHAR(MAX) on a table that is marked for CDC, and you then update any row to push the LoB value off page (ie total row size exceeds the page size).  What makes this bug so damn sinister is that you could successfully do the table ALTER and then the problem will sit dormant till whenever you update that LoB column off page.

When the bug occurs it will manifest itself with a broken CDC Log Reader (ie SQL Agent CDC Capture Job) that will NOT move past a specific LSN.

Could not locate text information records for the column "MyColumn", ID 13 during command construction. [SQLSTATE 42000] (Error 18773)
The Log-Scan Process failed to construct a replicated command from log sequence number (LSN) {0038a7d9:000172d4:0010}.
Back up the publication database and contact Customer Support Services. [SQLSTATE 42000] (Error 18805)
Log Scan process failed in processing log records.

I have provided a SQL Script at the end of this post which you can use to replicate the error.

As of writing this post there is no fix yet available for this error.  I will post here again once it becomes available and how you can get it.

And so, lets get into the nitty gritty of how to reproduce and fix the error.

What are the Impacts of the Error?

The severity of the error is nothing less than terminal for CDC;

  1. Your CDC Log Reader will not read past the specific LSN that is in error
  2. Your CDC capture job will fail repeatedly
  3. No CDC changes for ANY tables will be copied to the system capture tables
  4. You will not be able to backup and clear the SQL transaction log.  This means you must decide on a fix quickly else your transaction log could grow significantly.
  5. From the time the error occurs any CDC specific changes in the SQL transaction log against that table will not be retrievable and you will lose 100% of those tracked changes.  Depending on how you are using CDC this may or may not be a concern.

What Can You do About the Error Once it Occurs?

Unfortunately your options are limited, and none are good;

  1. Identify the table that has the CDC error, and drop the CDC capture for that table.  This will allow the CDC Log Reader to continue past that log record as it will no longer be marked for CDC.  Unfortunately this will also read past ANY OTHER change records that are in the transaction log for that table.  All change data for that table is lost.
  2. Drop CDC for the entire database (extreme — but could be the only option if dozens of your CDC capture tables are in error)

How can I Find out What Table is in Error Once it Occurs?

Luckily when CDC errors it reports the LSN where the CDC Log Reader got stuck.  You can use the following SQL command to read the transaction log to identify the table.  Once you know which one it is you can make a call on what to do about it.

SELECT
    [AllocUnitName] as [ObjectName],
    [Page ID],
    [Current LSN],
    [Operation],
    [Context],
    [Transaction ID],
    [Description]
FROM
    fn_dblog (NULL, NULL)
WHERE [Current LSN] = '00000030:000001ab:0007'

How can I Prevent the Error Occurring?  Are there Workarounds?

Again your options are simple and limited;

  1. Do NOT perform any TEXT to VARCHAR(MAX) data type table changes to your code for any tables that have an active CDC in operation.
  2. The only method we could find to workaround the error was to (a) Drop CDC on the table, (b) Perform the data type change, (c) Recreate CDC on the table.  Depending on your application this is likely an outage and may affect applications downstream of CDC.

Also you might think that this data type change will rarely occur?

Think again.

Microsoft has been actively recommending migration away from TEXT and onto VARCHAR(MAX) so its likely this will occur more frequently.

From the SQL BoL;

ntext , text, and image data types will be removed in a future version of Microsoft SQL Server. Avoid using these data types in new development work, and plan to modify applications that currently use them. Use nvarchar(max), varchar(max), and varbinary(max) instead.

How is this SQL CDC Error Related to SQL Replication?

In short its not. 

CDC is similar to Replication (it uses same sp_replcmds procedure) however it ends somewhere around there.

  • You may think you can use Replication commands like sp_repldone however these WILL NOT work on CDC.
  • You may have seen existing reported SQL Bugs like this one here https://support.microsoft.com/en-us/kb/2655789 however the patches listed in this article WILL NOT work on CDC.

SQL Script to Replicate the Issue

Open SSMS, cut/paste the SQL code below and run each code segment in turn.

NOTE that SELECTs from the CDC Tracking Tables or System Views may take a few seconds before the CDC Log Reader actions the changes in the database transaction log.

Ensure your test database is in FULL recovery mode and your SQL Agent is running.


/*================================================================================
TITLE: Change Data Capture Bug Reproduction
--------------------------------------------------------------------------------

HISTORY:
--------------------------------------------------------------------------------
Date:         Developer:                   Description:
--------------------------------------------------------------------------------
2015-05-02    Mr. Fox SQL (Rolf Tesmer)    Created
--------------------------------------------------------------------------------

NOTES:
--------------------------------------------------------------------------------
Disclaimer: https://mrfoxsql.wordpress.com/notes-and-disclaimers/
================================================================================*/

-- CREATE database -- Ensure it is FULL recovery mode
CREATE DATABASE BreakCDCDB
GO

-- use the database
USE BreakCDCDB
GO
execute dbo.sp_changedbowner @loginame = N'sa'
GO

-- enable cdc on DB
execute sys.sp_cdc_enable_db
GO

-- create CDC table with TEXT datatype
-- by default SQL stores TEXT off row
CREATE TABLE dbo.CDCTable
(
     [id]     INT IDENTITY(1, 1) NOT NULL
     , [col1] TEXT NULL -- will later change data type to VARCHAR(max)
)
GO 

-- enable table for CDC capture
-- This will create the system tracking table [cdc].[dbo_CDCTable_CT]
-- This will create and start the following 2x SQL Agent jobs (ENSURE SQL AGENT IS RUNNING)
-- * cdc.BreakCDCDB_capture --> job to read from the SQL DB log file and look for transactions against the [CDCTable], then copy those transactions to [cdc].[dbo_CDCTable_CT]
-- * cdc.BreakCDCDB_cleanup --> job to cleanup old data in the [cdc].[dbo_CDCTable_CT]
execute sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'CDCTable', @role_name = 'cdc_reader'
GO

-- insert some TEXT data in the CDC table to prove CDC works
insert into dbo.CDCTable select 'TEST DATA'
select *, DATALENgth(col1) as RowLength from dbo.CDCTable
GO
-- read from the CDC tracking table to see the inserted row, and check the error table for errors
select * from cdc.dbo_CDCTable_CT
select * from sys.dm_cdc_errors
GO

-- update some TEXT data in the CDC table to prove CDC works for true LOB values
declare @LargeValue varchar(max) = 'LARGEVALUE'
update dbo.CDCTable set col1 = REPLICATE(@LargeValue, 1000) -- big update over 8000
select *, DATALENgth(col1) as RowLength from dbo.CDCTable
GO
-- read from the CDC tracking table to see the updated row, and check the error table for errors
-- NOTE: CDC does not track the before image ($operation = 3) for TEXT columns
select * from cdc.dbo_CDCTable_CT
select * from sys.dm_cdc_errors
GO

-- ALTER table schema definition from TEXT to VARCHAR(MAX)
-- by default SQL stores VARCHAR(MAX) in row
-- NOTE: This will work but is risky. Any changes from this point that push the LoB value off row will break CDC
ALTER TABLE dbo.CDCTable ALTER COLUMN col1 VARCHAR(max)
GO
select * from cdc.ddl_history
GO

-- *** THIS WILL NOT BREAK CDC ***
-- insert some VARCHAR(max) data in the CDC table to prove CDC works
insert into dbo.CDCTable select 'TEST DATA'
select *, DATALENgth(col1) as RowLength from dbo.CDCTable
GO
-- read from the CDC tracking table to see the inserted row, and check the error table for errors
select * from cdc.dbo_CDCTable_CT
select * from sys.dm_cdc_errors
GO

-- *** THIS WILL BREAK CDC ***
-- update some VARCHAR(max) data in the CDC table to a value to push off page
declare @LargeValue varchar(max) = 'LARGEVALUE'
update dbo.CDCTable set col1 = REPLICATE(@LargeValue, 1000) -- big update over 8000
select *, DATALENgth(col1) as RowLength from dbo.CDCTable
GO
-- read from the CDC tracking table to see the inserted row, and check the error table for errors
select * from cdc.dbo_CDCTable_CT
select * from sys.dm_cdc_errors
GO

-- AFTER CDC IS BROKEN NOTHING EVER FLOWS THROUGH CDC AS THE LOG READER CRASHES ON THE BROKEN LOG RECORD
insert into dbo.CDCTable select 'NEW TEST DATA'
select *, DATALENgth(col1) as RowLength from dbo.CDCTable
GO
-- read from the CDC tracking table to see the inserted row, and check the error table for errors
select * from cdc.dbo_CDCTable_CT
select * from sys.dm_cdc_errors
GO

Disclaimer: all content on Mr. Fox SQL blog is subject to the disclaimer found here

Advertisements

6 thoughts on “New Bug: Change Data Capture (CDC) Fails after ALTER COLUMN

  1. mssqlgirl 6 May 2015 / 12:32 AM

    Great post Rolf!

    Thank you for sharing this info. I echo the reminder you’ve posted above, i.e.
    “Microsoft has been actively recommending migration away from TEXT and onto VARCHAR(MAX) …”

    Cheers,
    Julie

    Like

    • Mr. Fox SQL (Rolf Tesmer) 9 May 2015 / 11:57 AM

      Hi Julie, yes you are spot on. The issue occurred for us specifically as my clients’ application vendor was in fact following the MS recommendation which then broke CDC several hours after the schema change. Given how the bug occurs, it was lucky as it could have failed several weeks (even months?) after the change but they just happened to update a row LoB off page a few hours later.

      Like

    • Mr. Fox SQL (Rolf Tesmer) 15 August 2015 / 12:28 PM

      Hi Chris. No unfortunately not. Microsoft has confirmed this is a new bug in all versions which currently has no fix. We actually received word from them last week that there will be no fix but they will create a new KB article explaining the (not so great) workaround options outlined in my post. Thanks.

      Like

  2. Richard Rozema 11 August 2016 / 10:39 PM

    Similar problems when altering column type numeric() to allow for more digits on a table that has an existing capture instance on it. For example: execute alter column numeric(10,2) not null, where the old column was numeric(9,2). At first all seems fine, until the log reader hits it’s first value that will not fit into numeric(9,2). same error of failed to execute sp_replcmds. Plus sys.dm_cdc_errors will contain error messages like “Error converting data type numeric to numeric.”, and “The Log-Scan Process failed to construct a replicated command from log sequence number (LSN) {0026247f:0002f4c9:0007}. Back up the publication database and contact Customer Support Services.” and “Log Scan process failed in processing log records. Refer to previous errors in the current session to identify the cause and correct any associated problems.” I have not been able to find a bug report or connect item for this issue yet. Do you know if one exists?

    Like

    • Mr. Fox SQL (Rolf Tesmer) 11 August 2016 / 11:16 PM

      Hi Richard. I remember looking on CONNECT at the time, and I have done again just now, and there is no entry for this exact bug, though there are some similar ones for Replication (standalone) and Replication & CDC together.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s