Azure Service Logging in the Modern Data Warehouse

[read this post on Mr. Fox SQL blog]

The “modern data platform” architecture is becoming more and more popular as organisations shift towards identifying, collecting and centralising their data assets and driving towards embracing a “data driven culture“.

Microsoft Azure has a suite of best-of-breed PaaS based services which can be plugged together by organisations wishing to create large scale Data Lake / Data Warehouse type platforms to host their critical corporate data.

When working with customers going down the Modern Data Platform path I often hear very similar questions;

  • What is the most suitable and scaleable architecture for my use case?
  • How should I logically structure my Data Lake or Data Warehouse?
  • What is the most efficient ETL/ELT tool to use?
  • How do I manage batch and streaming data simultaneously?
  • …etc

While these are all very valid questions, sorry, but that’s not what this blog is about! (one for another blog perhaps?)

In my view – what often doesn’t get enough attention up front are the critical aspects of monitoring, auditing and availability. Thankfully, these are generally not too difficult to plug-in at any point in the delivery cycle, but as like with most things in cloud there are just so many different options to consider!

So the purpose of this blog is to focus on the key areas of Azure Services Monitoring and Auditing for the Azure Modern Data Platform architecture.

.

What Exactly is an Azure Modern Data Platform Architecture?

Rather than try to explain this one, the diagram below spells out what a typical platform may look like once its been completely built out.

NOTE this is just a small simple cutdown version of what a platform like this may look like, often other components and data pathways are deployed including…

  • Data Streaming Pathway – which may deploy Azure Event Hubs, Azure Stream Analytics, and/or Azure Databricks (streaming)
  • Multiple Data Transformation Buckets – this only shows 2 layers, but there are often multiple including a conformed layer, data science scratchpad area and archive layer. (Argh, One for another blog post perhaps!)
  • Tabular Models and BI Reporting – including Azure Analysis Services and Power BI, Tableau, etc
  • Data Catalog and Lineage – including Azure Data Catalog v2

.

Where and What do we Log?

THE WHERE: First up regardless of which logs, audits, telemetry, or whatever is available from each of these deployed services, we first need somewhere to capture and store it. In Azure we have the very awesome Azure Log Analytics, which is part of the Azure Monitor suite of services. It provides sophisticated tools for collecting and analyzing telemetry that allow you to maximize the performance and availability of your cloud and on-premises resources and applications – https://docs.microsoft.com/en-us/azure/azure-monitor/log-query/get-started-portal

What’s nice is that most Azure services have diagnostics and logging options which allow you to send this data direct to Azure Log Analytics, which makes our “monitoring” job just so much easier.

Now we have a view on where we send our data, lets break down each of these services into a single set of monitoring recommendations.

Now onto THE WAHT

.

Standard Logging to Enable on All Azure Services

Every Azure Service (with a few minor exceptions) will generate the following two log streams; Activity Log and Diagnostic Logs. Capturing these are essential to providing an overall view of the activity within the platform.

.

AZURE ACTIVITY LOG

https://docs.microsoft.com/en-us/azure/azure-monitor/platform/activity-logs-overview

All Azure services have an activity log which captures activities performed on the service, when and by who. Activity Logs provide data about the Control Plane operations on an Azure resource. Activity logs capture events for Admin, Service Health, Resource Health, Alert, Autoscale, Recommendation, Security, Policy. Activity log data will take about 10-15 minutes to be sent to Log Analytics ingestion point. The Activity Log stores the logs in the Azure backend for 90 days.

The activity log event schema is here – https://docs.microsoft.com/en-us/azure/azure-monitor/platform/activity-log-schema
If you would like to retain the data beyond this then log retention can be defined here – https://docs.microsoft.com/en-us/azure/azure-monitor/platform/activity-logs-overview#export-the-activity-log-with-a-log-profile

Recommendations

  • Leverage the 90 days activity log retention into Azure Log Analytics
  • (optional) extend the retention

.

AZURE DIAGNOSTIC LOGS

https://docs.microsoft.com/en-us/azure/azure-monitor/platform/diagnostic-logs-overview

Most (but not all) Azure services provide deeper diagnostic logs containing service specific information, metrics, events, etc.. This can be captured to either a Storage Account, Log Analytics or Event Hub. 

Diagnostics Logs are emitted by a resource and provide information about the operation of that resource at the “data plane” level of the service.  The content of these logs varies by the Azure service and resource type Data from diagnostic logs take 2-15 minutes, depending on the Azure service

The diagnostic log schema is here – https://docs.microsoft.com/en-us/azure/azure-monitor/platform/diagnostic-logs-schema

Recommendations

  • Enable for all Azure Services that support it which includes ADF, AKV, VM, AAS, SQL DB/DW, ADB into Azure Log Analytics
  • Currently only Azure Data Lake Storage Gen2 does not currently support diagnostic logs

.

Specific Logs to Enable on Each Azure Service

In addition to the standard logs above, each Azure Service will also have one or more service specific logs that can be enabled and captured. As above capturing these are essential to providing a deep dive service specific view of the activity within that service.

.

AZURE DATALAKE GEN2 STORAGE

Storage Analytics Logs (Metrics)

https://docs.microsoft.com/en-us/rest/api/storageservices/About-Storage-Analytics-Logging

Storage Analytics logs detailed information about successful and failed requests to a storage service. This information can be used to monitor individual requests and to diagnose issues with a storage service. Requests are logged on a best-effort basis.  Log entries are created only if there are requests made against the service endpoint.

All logs are stored in block blobs in a container named $logs, which is automatically created when Storage Analytics is enabled for a storage account. The $logs container is located in the blob namespace of the storage account, for example: http://<accountname&gt;.blob.core.windows.net/$logs. This container cannot be deleted once Storage Analytics has been enabled, though its contents can be deleted.

All data in the $logs container can be accessed by using the Blob service APIs, including the .NET APIs provided by the Azure managed library. Storage Analytics Logs can be sent to Azure Log Analytics for further analysis. 

There is no automated way to do this so therefore this must be created per storage account so the data is sent to Custom Logging into Azure Log Analytics as per this example here – https://azure.microsoft.com/en-us/blog/query-azure-storage-analytics-logs-in-azure-log-analytics/

Advanced Threat Detection

https://docs.microsoft.com/en-us/azure/storage/common/storage-advanced-threat-protection

Advanced Threat Protection for Azure Storage provides an additional layer of security intelligence that detects unusual and potentially harmful attempts to access or exploit storage accounts These security alerts are sent by default into the Azure Security Center, and are also sent via email to subscription administrators, with details of suspicious activity and recommendations on how to investigate and remediate threats.

Recommendations

  • Enable and Capture Storage Analytics Logs
  • Create automated step to send Storage Analytics Logs into Custom Logging into Azure Log Analytics
  • Enable Advanced Threat Detection

.

AZURE DATA FACTORY

Metrics

https://docs.microsoft.com/en-us/azure/data-factory/monitor-using-azure-monitor#metrics

You can use Azure Data Factory integration with Azure Monitor to route data to Azure Monitor. This integration is useful when you want to write complex queries on a rich set of metrics that is published by Data Factory to Azure Monitor. You can also create custom alerts on these queries via Azure Monitor.

Recommendations

  • Enable Diagnostic Metrics

.

AZURE DATABRICKS

Event Logs

https://docs.azuredatabricks.net/user-guide/clusters/event-log.html

The cluster event log displays important cluster life cycle events that you trigger manually or are automatically triggered by Azure Databricks. Such events affect the operation of a cluster as a whole and the jobs running in the cluster.  Events are stored and available from within the ADB cluster / Workspace for 60 days. Get data using the REST API – https://docs.azuredatabricks.net/api/latest/clusters.html#events

Spark Metrics Logs

https://docs.microsoft.com/en-us/azure/architecture/databricks-monitoring/configure-cluster

Connecting Azure Databricks with Log Analytics allows monitoring and tracing each layer within Spark workloads, including the performance and resource usage on the host and JVM, as well as Spark metrics and application-level logging.

Several types of metrics can be captured using this library extension on the ADB cluster. Spark metrics are automatically collected into the SparkMetric_CL Log Analytics custom log.

Streaming job metrics are automatically collected into the SparkListenerEvent_CL Log Analytics custom log Spark logs are available in the Databricks UI and can be delivered to a storage account. However, Log Analytics is a much more convenient log store since it indexes the logs at high scale and supports a powerful query language. Spark logs are automatically collected into the SparkLoggingEvent_CL Log Analytics custom log.

Recommendation

  • Regularly capture ADB Cluster Events using the API command. Suggested to capture this data every 5-15 minutes.
  • Create automated step to send ADB Cluster Events into Custom Logging into Azure Log Analytics
  • Review and consider optionally enabling detailed Spark cluster Logs and metrics for all pre-configured ADB clusters.  NOTE these can be very detailed logs.

.

AZURE SQL DATA WAREHOUSE

SQL Auditing

https://docs.microsoft.com/en-au/azure/sql-data-warehouse/sql-data-warehouse-auditing-overview

Auditing helps you maintain regulatory compliance, understand database activity, and gain insight into discrepancies and anomalies that could indicate business concerns or suspected security violations, and enables and facilitates adherence to compliance standards, although it doesn’t guarantee compliance

Auditing captures all audit data and logs the captured data to the logging destination which can be loaded into analysis tooling for assessment.

If enabled Audit Logs will capture every single SQL batch sent to the database, which can generate a significant amount of access and logging data.  Therefore its recommended to only enable for sensitive data assets. Auditing can be enabled at the Azure SQL Server level which applies as the default setting to all existing and new Azure SQL Data Warehouses as hosted on the server, or Auditing can be enabled on any specific Azure SQL Data Warehouse directly. If Auditing is enabled on the Server level then it will always apply to the hosted databases regardless of the setting at the database level. If Server level Auditing is enabled and Database level Auditing is enabled, then the Audit occurs twice for that database.

Database – Data discovery & classification

https://docs.microsoft.com/en-au/azure/sql-database/sql-database-data-discovery-and-classification

Data Discovery & Classification (public preview) provides advanced capabilities built into Azure SQL Database and Data Warehouse for discovering, classifying, labeling, and protecting the sensitive data in your databases. Discovering and classifying your utmost sensitive data (business/financial, healthcare, personal data, etc.) can play a pivotal role in your organizational Information protection stature.

Data Discovery & Classification is enabled at the Azure SQL Data Warehouse database level and is not applied a the SQL Logical Server level

It can serve as infrastructure for:  Various security scenarios, such as monitoring (auditing) and alerting on anomalous access to sensitive data,  Controlling access to, and hardening the security of, databases with highly sensitive data,  and Helping meet data privacy standards and regulatory compliance requirements, such as GDPR.

SQL data discovery & classification comes with a default set of built-in set of sensitivity labels and a built-in set of information types and discovery logic. Definition/customization of your classification taxonomy is done in one central place for your entire Azure tenant in the Azure Security Center, defined as part of your Security Policy. Only someone with administrative rights on the Tenant root management group can perform this task.

The classification regularly engine scans your database for columns containing potentially sensitive data based on your classification taxonomy and provides a list of recommended column classifications.  You can also manually classify columns as an alternative.

An important aspect of the information protection paradigm is the ability to monitor access to sensitive data.  Therefore both the existing Auditing feature and the new Data Discovery & Classification feature have been built to closely operate together.  In order to monitor and alert on access to classified database fields, Azure SQL Database Auditing has been enhanced to include a new field in the audit log called “data_sensitivity_information”, which logs the sensitivity classifications (ie labels defined in Data Discovery & Classification) of the actual data that was returned by the query.  Therefore the classified data access logging is written to the same existing log files as the SQL Audit log which are stored in Azure Storage, Event Hubs (preview) or Log Analytics (preview). 

Vulnerability Assessment

https://docs.microsoft.com/en-us/azure/sql-database/sql-vulnerability-assessment

Vulnerability assessment is an easy to configure service that can discover, track, and help remediate potential database vulnerabilities with the goal to proactively improve overall database security. Vulnerability assessment is enabled at the Azure Logical SQL Server level or at the SQL Data Warehouse database level

Vulnerability Assessment is a scanning service built into the Azure Database service. The service employs a knowledge base of rules that flag security vulnerabilities and highlight deviations from best practices, such as misconfigurations, excessive permissions, and unprotected sensitive data. The rules are based on Microsoft’s best practices and focus on the security issues that present the biggest risks to your database and its valuable data. They cover both database-level issues as well as server-level security issues, like server firewall settings and server-level permissions. These rules also represent many of the requirements from various regulatory bodies to meet their compliance standards.

Results of the scan are stored on Azure Storage only and include actionable steps to resolve each issue and provide customized remediation scripts where applicable. Vulnerability Assessment to automatically run a scan on your database once per week (every 7 days).  Vulnerability Assessment Logs can be sent to Azure Log Analytics for further analysis.  There is no automated way to do this so therefore this must be created per storage account so the data is sent to Custom Logging into Azure Log Analytics.  There is currently no available tooling so therefore this must be developed.

Advanced Threat Detection

https://docs.microsoft.com/en-us/azure/sql-database/sql-database-threat-detection-overview

SQL Threat Detection is a unified package for advanced SQL security capabilities and comes at additional cost over the base SQL Data Warehouse.

SQL Threat Detection includes functionality for managing your database vulnerabilities, and detecting anomalous activities that could indicate a threat to your database. It provides a single go-to location for enabling and managing these capabilities. It can detect anomalous activities indicating unusual and potentially harmful attempts to access or exploit your database. It continuously monitors your database for suspicious activities, and provides immediate security alerts on potential vulnerabilities, SQL injection attacks, and anomalous database access patterns.

Threat Detection can be enabled at the Azure SQL Server level and applied as the default to all Azure SQL Data Warehouse as hosted on the server, or Threat Detection can be enabled on any specific Azure SQL Data Warehouse directly. Threat Detection captures all threat data and sends real-time alerts via email only to the subscription owner, or any email account as configured in the Azure Portal.  Threat detection also integrates alerts with Azure Security Center, which includes details of suspicious activity and recommend action on how to investigate and mitigate the threat. 

Metrics and Queries

https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-concept-resource-utilization-query-activity

You can use Azure SQL Datawarehouse integration with Azure Monitor to route data to Azure Monitor. This integration is useful when you want to write complex queries on a rich set of metrics and query history that is published by SQL Datawarehouse to Azure Monitor. You can also create custom alerts on these queries via Azure Monitor.

Recommendation

  • (Optional) Enable SQL Auditing to log to Log Analytics
  • (Optional) if SQL Auditing is enabled, then Enable Data Discovery and Classification which will also log to Log Analytics
  • Enable capture Vulnerability assessment Logs
  • Create automated step to send Vulnerability assessment Logs into Custom Logging into Azure Log Analytics
  • Enable Threat Detection for all SQL Servers / DB / DW
  • Enable Diagnostic Metrics/Queries

Conclusion

Well there you have it, a concise summary of the why’s, how’s, who’s and what’s of implementing your services logging strategy on your own Azure Modern Data Platform Architecture.

As you can imagine with this type of activity, it will depend, a lot, on the purpose and business drivers of the platform, and in particular how those services will be consumed, queried or integrated into or by other downstream business processes.

If you would like to build out your own Azure Data Platform architecture but are not sure where to start, or how it can work for your organisation, then have a look at this awesome fully featured end-to-end lab provided by one of my Microsoft Azure Cloud colleagues Fabio Bragahttps://github.com/fabragaMS/ADPE2E

Finally the Azure Data Architecture Guide is also an awesome resource to get ideas on how you can construct your own Data Platform on Azure – https://docs.microsoft.com/en-us/azure/architecture/data-guide/

So as usual, and as I always so, always give this a try yourself, as your own mileage may vary!

Advertisements

Machine Learning + DevOps = ML DevOps (Together at Last)

[read this post on Mr. Fox SQL blog]

For the longest time data science was often performed in silos, using large machines with copies of production data. This process was not easily repeatable, explainable or scalable and often introduced business and security risk. With modern enterprises now adopting a DevOps engineering culture across their applications stack, no longer can machine learning development practises operate in isolation from the rest of the development teams.

Thankfully – earlier this year Microsoft GA’d a new service called Azure Machine Learning Services which provides data scientists and DevOps engineers a central place in Azure to create order out of what can be a complicated process.

So what exactly is DevOps?

DevOps is a software engineering practice that aims at unifying software development and software operation.  The main characteristic of the DevOps movement is to strongly advocate automation and monitoring at all steps of software construction, from integration, testing, releasing to deployment and management.

PRIMARY GOAL OF DEVOPS: Enable faster time to market, lower failure rate, shortened lead times, and automated compliance with release consistency.

At times I have heard (incorrectly) people interchange or swap terms like DevOps and Agile. So there is no confusion, lets try boil this down to one simple graphic…

But, does Agile make sense for ML projects?

Well now, here’s an interesting footnote for Agile and Machine Learning development!

The intent of Agile is to break application development timeframes into a series of discrete “sprints” comprising (typically) of 1, 2 or 3 weeks. Each sprint delivers a unit of the whole application, till eventually the whole application is delivered, tested and completed.

Most ML development projects often don’t roll this way, and rarely follow a smooth cycle. Much of the effort has relatively unknown timeframes with sometimes relatively unknown outcomes. Its not uncommon for several weeks to be spent investigating, profiling, cleaning and featurising datasets only to discover it… (a) doesn’t contain enough goodness to justify the subsequent ML modelling effort, or that… (b) your model prediction accuracy, error amounts or some other critical metric is just so damn lousy you need shelve the whole idea. Yeah, it happens.

Thus, some data science projects don’t fit well with an Agile methodology. However given above that [Agile <> DevOps] then that does not exclude modelling artefacts like training code, trained models, validation scripts, score scripts, etc from being managed via DevOps build and release pipelines.

So yes, there’s still benefits from DevOps for data scientists everywhere!

So is there really a problem? What is this trying to solve?

In talking with various development teams, customers and DevOps engineers, a lot of the potential problems of meshing ML development into an enterprise DevOps process can be boiled down to a few different areas this aims to address…

  • ML stack might be different from rest of the application stack
  • Testing accuracy of ML model
  • ML code is not always version controlled
  • Hard to reproduce models (ie explainability)
  • Need to re-write featurizing + scoring code into different languages
  • Hard to track breaking changes
  • Difficult to monitor models & determine when to retrain

So DevOps helps with this, right? Right?

Well er, some of them yes, but not all.

DevOps itself wont solve the problems. However when DevOps is integrated with Azure ML Services (see below) it will enable methods by which a consistent, repeatable and reliable process can be applied (automatically in many cases) to ML model build, test and release stages which subsequently supports the data scientist with tooling, logs, and information to perform those activities.

What is the Azure Machine Learning Service?

The Azure Machine Learning service provides a cloud-based environment you can use to prep data, train, test, deploy, manage, and track machine learning models. You can start training on your local machine and then scale out to on-demand compute in the cloud. The AML service fully supports open-source technologies such as PyTorch, TensorFlow, and scikit-learn and can be used for any kind of machine learning, from classical ML to deep learning, supervised and unsupervised learning.

https://azure.microsoft.com/en-us/services/machine-learning-service/

The Azure ML Service workspace (deployed in an Azure Subscription) exposes a number of helpful cloud based sub-services to data scientists and DevOps engineers to track experiments, build training runs, wrap models into docker container images and release containers as consumable web services.

The ML services exposed in the workspace include…

  • Automated Machine Learning (AutoML)
  • ML model training experiment logging and tracking
  • Managing ML model training runs to burst onto on-demand scalable training compute
  • Storing/versioning trained ML model artefacts
  • Storing/versioning Docker container images with the trained ML model artefacts
  • Managing deployments of ML model Docker containers for real-time web scoring
  • Storage and Application Insights services to monitor ML model usage

But I think the real magic sauce that makes DevOps and ML integrate so nicely with the Azure ML Service is the Azure Machine Learning Python SDK. Importing the SDK into your python code, or leveraging the SDK CLI in DevOps pipelines, enables data scientists or engineers to “connect” into the cloud services to access the ML management tooling and features.

The cool thing about this SDK as its totally independent of any development tooling, so that means you can use it in your preferred python development IDE, however exotic that is. So if you like VS Code, or PyCharm, or Jupyter, or Databricks, or Zetor, etc, then no problems, go for it. (yeah OK, I made that last tool up!)

The Azure ML Service SDK documentation is here – https://docs.microsoft.com/en-au/python/api/overview/azure/ml/intro

Code Management / Model Flow (example)

Below is an outline of the data flows and data storage, ML model development IDE and training compute layers, batch scoring compute layer and finally the ML DevOps processes, including retraining triggers.

Its really, REALLY important to note that this is just an example of how an ML DevOps process can be built, its not the example!

A couple of points that can change on this example flow…

  • Its not unreasonable to have different DevOps build and release pipelines for each model. Often the methods of model re/training, cadences, and release can be vastly different per model.
  • The DevOps pipelines should be constructed so that they are inline with the business expectations for the model.
    • For example – if the business expect there to be an approval gate between model testing and release then that needs to be built into the DevOps release pipeline.
    • Or – if the business expect models to be released via an A/B method then that needs to be built into the release pipeline, and so on.
  • Model retraining could be triggered via any number of actions, such as a simple timer schedule, monitoring the model prediction accuracy falling below a set threshold, or some aggregate combination of features exceeding some set bounds.
    Regardless of the trigger method, the trigger action is usually to kick off a new automated model re/training run using the latest data in the “data lake” and, if the trained model passes validation tests, releasing it to production.
  • In my example above, the model training occurs outside the DevOps build pipeline, however its not uncommon for that training to be executed from within the build pipeline itself. So based on some automated or defined schedule, or trigger, the DevOps build process kicks off and…
    1. pulls the train.py code from a git repo,
    2. pulls the latest trained model version from the Azure ML Service model repository
    3. trains the ML model using on-demand training compute defined in the Azure ML service workspace based on the latest data available in the “data lake”,
    4. tests and compares predictive scores of the new model with that of the old model already running in production
    5. if better, registers the updated trained model back into the Azure ML Service model repository as the latest version.
    6. to support “explainability” (see below) also keeps a snapshot copy of the data used to train the model
  • Explainability” and “Reproducibility” for models is fast becoming the norm, especially for models with predictive scoring that affects peoples lives. To help support this its often a combination of…
    1. continued model code management and versioning,
    2. trained model artefacts management and versioning,
    3. automated DevOps release processes (ie no click ops)
    4. training data retention and tagging (ie. as above – snapshotting an exact copy of the entire dataset used to train any one model)
    5. scoring and prediction data retention (ie model monitoring).
    6. Lastly, Azure ML also has a number of explainability features built in that can help this process, see here – https://docs.microsoft.com/en-us/azure/machine-learning/service/machine-learning-interpretability-explainability

I’m a Data Scientist, what do I need to do?

So if you wear the hat of a data scientist you might well think this looks hard, or complex, or just yet another thing you need to do. In reality you don’t need to do much at all, the ML DevOps process does not expect a data scientist to be a DevOps engineer, it just expects them to do the right thing by the process and let DevOps magic take care of the rest!

At a basic level, this essentially means…

  • commit all python model code to a repo, and ensure all changes to that code are always (always!) managed via that repo
  • ensure your code integrates with the Azure ML Service via the SDK so you can leverage experiment tracking/logging features, etc. This is optional, but there are clear benefits to explainability in doing this.
  • never manually release or push your code, or any artefacts created by your code (ie such as a trained model) to any environment other than your local playpen. Let the DevOps process always build and release your code and artefacts to shared environments in a controlled way.

So yep – that’s about it!

I’m a DevOps Engineer, what do I need to do?

So if you wear the hat of a DevOps Engineer you might well think this looks hard, or complex, or just yet another thing you need to do. In reality you are correct, it will be more pipeline work you need to do!

At a basic level, this essentially means that each and every build or release pipeline will likely need to be customised in some way so it meets the production release needs of a particular business process.

  • Some models might need authorisation before release,
  • others may need an A/B test approach,
  • others again might have an edge or on-prem deployment, etc.
  • All different needs, all different pipelines.

When building pipelines it can be done via either integrating the Azure ML Service Python SDK or the Azure ML Service CLI into the DevOps pipeline activities. Either way – both methods enable the pipelines to access Azure ML Service features such as spinning up on-demand training compute, containerising ML models into Docker Images, deploying docker images onto ACI or AKS, enabling model monitoring, and a whole heap more.

Conclusion

Well there you have it, a quick summary of the why’s, how’s, who’s and what’s of implementing your data science model development processes into a modern enterprise DevOps methodology.

As you can imagine with this type of activity, it will depend, a lot, on the purpose and business drivers of the models, and in particular how those models will be consumed or integrated into or by other downstream processes.

So as usual, and as I always so, always give this a try yourself, as your own mileage may vary!

SQL Saturday 769 Melbourne & 771 Sydney

For those not aware there’s some excellent local SQL events coming up here in Melbourne and Sydney

  • SQL Saturday 769 (Sat 30 Jun 2018) Melbourne.  SQL Saturday 771 (Sat 07 Jul 2017) Sydney.  For those looking for some great free local SQL / Azure / BI / etc learning, you simply cannot go past a SQL Saturday anywhere in the world!  And this one right here in Melbourne will again be no exception.  There is a lineup of fantastic local speakers including Microsoft and MVP’s as well as international speakers too.
  • SQL Saturday Pre-Con Training (Fri 29 Jun 2018).  Melbourne.   Leading up to the main event are 3 pre-con training events which cover some very interesting topics around SQL Performance Analysis, Azure SQL Cloud Migrations and Data Science using Azure.

 

SQL Saturday 769 – Melbourne

SQL Saturday is an excellent free learning resource for all things SQL Server – all costs are covered by donations and sponsorships.  Some of the excellent sponsors this year are Microsoft, Wardy IT, SQLBI, and PASS.

Some of the session focus areas include SQL 2017/19 (many deep dives across almost all facets!), SQL DB/DW in Azure, CosmosDB, Azure Machine Learning, R, Data Lakes, BI, DAX, …and so much more!

The event is being held at Northcote Town Hall (189 High Street, Northcote, VIC 3070)

For those wanting to come along here are the links you need to know.  Please go to the website and register to attend.

 

SQL Saturday 769 Pre-Con Training Options

There’s also 3 pre-con training sessions held the day before on Fri 29 Jun 2018.  Definitely work a look in…

Session: Building Streaming Data Pipelines in Azure…

For those attending – I am presenting a pretty fun session on Building Streaming ETL Pipelines Using Azure Cloud Services.  

Session Details here – http://www.sqlsaturday.com/769/Sessions/Details.aspx?sid=78630

We’ll talk though the various different shape, speed and size of data sources available to modern business today, and discuss various PaaS streaming methods available to ingest data at scale all using the Azure Cloud Platform.  I also have a few pretty fun demos which will aim to show how all the Azure services can tie together to perform ETL/ELT!

Feel free to pop in to have a chat!

 

I hope to see you all in Melbourne at SQL Saturday!


Disclaimer: all content on Mr. Fox SQL blog is subject to the disclaimer found here

Using Elastic Query to Support SQL Spatial in Azure SQL DW

[read this post on Mr. Fox SQL blog]

Recently we had a requirement to perform SQL Spatial functions on data that was stored in Azure SQL DWSeems simple enough as spatial has been in SQL for many years, but unfortunately, SQL Spatial functions are not natively supported in Azure SQL DW (yet)!

If interested – this is the link to the Azure Feedback feature request to make this available in Azure SQL DW – https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/10508991-support-for-spatial-data-type

AND SO — to use spatial data in Azure SQL DW we need to look at alternative methods.  Luckily a recent new feature in Azure SQL DB  in the form of Elastic Query to Azure SQL DW now gives us the ability to perform these SQL Spatial functions on data within Azure SQL DW via a very simple method!

So the purpose of this blog is to show how to perform native SQL Spatial functions on data within Azure SQL DW.

Continue reading

Azure Cognitive Services API’s with SQL Integration Services Packages (SSIS)

[read this post on Mr. Fox SQL blog]

I had a recent requirement to integrate multi-language support into a SQL DW via a SQL SSIS ETL solution.  Specifically the SQL DW platform currently only supported English translation data for all Dimension tables, but the business was expanding internationally so there was a need to include other language translations of the Dimensional attributes.

We wanted to do this without having to manually translate English text attributes that exist already, or new ones that are added or modified over time.  We wanted an automated method that simply “worked“.

Enter Azure Cognitive Services Translator Text API service!

So the purpose of this blog is to outline the code/pattern we used to integrate the Azure Cognitive Services API into SQL SSIS ETL packages.

Continue reading

Database Backup Options for SQL on Azure IaaS

[read this post on Mr. Fox SQL blog]

Recently I had a requirement to collate and briefly compare some of the various methods to perform SQL Server backup for databases deployed onto Azure IaaS machines.  The purpose was to provide a few options to cater for the different types (OLTP, DW, etc) and sizes (small to big) of databases that could be deployed there.

Up front, I am NOT saying that these are the ONLY options to perform standard SQL backups!  I am sure there are others – however – the below are both supported and well documented – which when it comes to something as critical as backups is pretty important.

So the purpose of this blog is to provide a quick and brief list of various SQL backup methods!

Continue reading

Tuning Throughput from Azure Event Hub to Azure Stream Analytics

[read this post on Mr. Fox SQL blog]

Recently I had a requirement to load streaming JSON data to provide a data feed for near real-time reporting.  The solution streamed data into an “Ingress” Azure Event Hub, shred the JSON via Azure Stream Analytics and then push subsections of data as micro-batches (1 sec) into a “Egress” Azure Event Hub (for loading into a stage table in Azure SQL DW).

In Event Hubs and Stream Analytics there are only a few performance levers to help tune a solution like this, or said another way, doing nothing with these levers can affect your ongoing performance!

So this blog is to show the performance differences when using different Azure Event Hub partition configurations and the Azure Stream Analytics PARTITION BY clause.

Continue reading

Microsoft Ignite US 2017 – Major Azure Announcements

[read this post on Mr. Fox SQL blog]

Microsoft Ignite is probably the biggest technical event that Microsoft host yearly with many major announcements across the entire solutions portfolio suiteand this year 2017 was certainly no exception to that!

This year it was held in Orlando, FL over 5 days (25 – 29 Sep) and was attended by more than 30,000 people across the two major events of Ignite and Envision.  The event covers all areas of Microsoft solutions including Azure, Office, Power BI, SQL Server, Windows, Dynamics, etc, etc and is a world of technical goodness!

The announcements across the Azure Cloud space in particular are significant and very exciting, and provide a strong lead as to the direction Microsoft are taking their technologies today – and in the very near future.

I have prepared a summary deck of what I think are the major announcements specifically across the Azure Infrastructure and Data space which are important to be aware of.  There are of course even more announcements that this across other solutions areas I mentioned above that I haven’t covered in this deck.

You can download the Azure Data & Infra announcements deck from [MY PRESENTATIONS] page here on my blog site Ignite US 2017 Announcements – https://mrfoxsql.files.wordpress.com/2017/10/azure_igniteus2017_whatsnew_v0-21.pdf

 

In addition to all the technical goodnessSatya also released a new book called “Hit Refresh” which outlines the inside story of Microsoft’s own digital transformation.

Hit Refresh is about individual change, about the transformation happening inside of Microsoft and the technology that will soon impact all of our lives—the arrival of the most exciting and disruptive wave of technology humankind has experienced: artificial intelligence, mixed reality, and quantum computing.

You can read about it here – and also grab a copy if interested to learn more – https://news.microsoft.com/hitrefresh/

 

Happy reading!

…AND of course, as I always say, please review and validate this yourself as your required outcomes may vary!


Disclaimer: all content on Mr. Fox SQL blog is subject to the disclaimer found here

Query Azure CosmosDB from a SQL Server Linked Server

[read this post on Mr. Fox SQL blog]

Recently I had a requirement to combine data that I already had in SQL Server (2016) with JSON document data already stored in Azure CosmosDB.  Both databases were operational and continuously accepting data so I didn’t want to go to the trouble of doing the delta load thing between them, instead I just wanted to be able to query directly on demand.

And so – the purpose of this article is to outline the method to connect direct to Azure CosmosDB from SQL Server using a SQL Linked Server.

Finally … SQL & NoSQL … together at last!

For those interested to learn more about Azure CosmosDB, check out my previous blog post here – https://mrfoxsql.wordpress.com/2016/05/11/azure-documentdb-preparing-loading-querying-data/

Or the official documentation here – https://docs.microsoft.com/en-us/azure/cosmos-db/

And so right up front – this solution only works for SQL Server on VM/IaaS – and is not supported for Azure SQL DB (ASDB) – mainly as ASDB doesn’t support SQL Linked Servers! (Damn, they say!)

Continue reading

Streaming Reporting: SQL Change Data Capture (CDC) to Power BI

[read this post on Mr. Fox SQL blog]

Extending on my previous post about redirecting SQL CDC changes to Azure Event Hub, I have had a few people ask for details/options to stream SQL data into the Power BI API.

Specifically – they were looking for an easy method to leverage the ADD ROWS functionality of the Power BI API so they could push real-time data into a Power BI service dataset.

This method provides the ability to update the Power BI Dataset with new rows every few seconds, instead of a Power BI report having to either use Direct Connect or Scheduled data refresh capability which can be very limiting.

If interested in how the SQL CDC and Event Hubs work together, then read here from my previous post – https://mrfoxsql.wordpress.com/2017/07/12/streaming-etl-send-sql-change-data-capture-cdc-to-azure-event-hub/

The purpose of this post is to quickly show how to extend and explore pushing new SQL data rows via Azure Stream Analytics into Power BI.

And so, lets get into some CDC to Power BI streaming action!

Continue reading