Streaming ETL: SQL Change Data Capture (CDC) to Azure Event Hub

I had a recent requirement to capture and stream real-time data changes on several SQL database tables from an on-prem SQL Server to Azure for downstream processing.

Specifically we needed to create a streaming ETL solution that …

  1. Captured intermediate DML operations on tables in an on-prem SQL database
  2. Transmit data securely and real-time into Azure
  3. Store the delta changes as TXT files in Azure Data Lake Store (ADLS)
  4. Visualise the real-time change telemetry on a Power BI dashboard (specifically the number of Inserts, Updates, Deletes over time).

The first part was easy; SQL has a feature called Change Data Capture (CDC) which does an amazing job of tracking DML changes to seperate system tables.  If you dont know about CDC then see here –

The second part wasn’t easy, and after some searching I came across this blog post by Spyros Sakellariadis which gave me inspiration and starter code for my streaming ETL solution.  Excellent post.  See here –

And so, the final architecture looks something like this…

The solution picks up the SQL data changes from the CDC Change Tracking system tables, creates JSON messages from the change rows, and then posts the message to an Azure Event Hub.  Once landed in the Event Hub an Azure Stream Analytics (ASA) Job distributes the changes into the multiple outputs.

What I found pretty cool was that I could transmit SQL delta changes from source to target in as little as 5 seconds end to end!

And so, lets get into some CDC to Event Hub data streaming action!

Continue reading

New Bug: Change Data Capture (CDC) Fails after ALTER COLUMN

EDIT: Fri 19 Jun 2015 – We received confirmation the Microsoft Support Team the bug can be reproduced however at this point in time there will be NO FIX

EDIT: Thurs 02 Jun 2016 – I have reproduced the bug in SQL Server 2016 (GA release) so unfortunately it now affects all SQL Server versions and all patch levels from SQL Server 2008 to the current version.  Please read below for methods to identify the issue and a work around.

This week we discovered a new SQL Bug that affects all SQL Versions from SQL Server 2008 to SQL Server 2016 (GA Release).  The bug specifically affects the SQL Enterprise feature of Change Data Capture (CDC).  The bug is difficult to identify – however the article below outlines a method to replicate the bug, and a method to remediate it.

If you are not familiar with CDC then have a look here

The bug will occur when you change a data type from TEXT to VARCHAR(MAX) on a table that is marked for CDC, and you then update any row to push the LoB value off page (ie total row size exceeds the page size).  What makes this bug so damn sinister is that you could successfully do the table ALTER and then the problem will sit dormant till whenever you update that LoB column off page.

When the bug occurs it will manifest itself with a broken CDC Log Reader (ie SQL Agent CDC Capture Job) that will NOT move past a specific LSN.

Could not locate text information records for the column "MyColumn", ID 13 during command construction. [SQLSTATE 42000] (Error 18773)
The Log-Scan Process failed to construct a replicated command from log sequence number (LSN) {0038a7d9:000172d4:0010}.
Back up the publication database and contact Customer Support Services. [SQLSTATE 42000] (Error 18805)
Log Scan process failed in processing log records.

I have provided a SQL Script at the end of this post which you can use to replicate the error.

As of writing this post there is no fix yet available for this error.  I will post here again once it becomes available and how you can get it.

And so, lets get into the nitty gritty of how to reproduce and fix the error.

Continue reading