I’ve never blogged in the past about Azure Data Factory, one of the most powerful services available in Cortana Intelligence Suite, useful to orchestrate data movement and transformation of data that reside on-premises and in the cloud. I am using it since Microsoft launched the service and it gives its best when working with heterogeneous data (e.g. Azure Blob Storage, Hadoop HDFS, relational data, etc).
Its core is based on 3 main components :
Linked Servers – used to connect to a data source (e.g. a SQL Server server). If the data source is located on-premises or it’s hosted inside a VM, you’d also need a Data Management Gateway, a service that is able to act as a bridge between the local server and Azure.
DataSet – used to map a Linked Server with the schema of the object to which you need to get access to read/write data
Pipeline – it is a collection of Activities, and each activity could be simple as a copy task, or it could be more complex if it requires to write code to solve a specific task, like a Hive query on a Hadoop cluster.
All the information regarding the components above are stored using JSON, and I usually rely on the .NET SDK or Powershell to write the logic of my workflows.
There have been a few welcomed additions in the last months, and one of them is a code-free copy wizard that could be very useful in many cases, like copying test data to Azure, moving datasets from a source to a destination, etc. It is still in a preview stage, and I suppose that additional features will be added in the future.
Below you could find a few screenshots that I have taken during my step-by-step test and that represent a simple copy from a MySQL Server on-premises to a Azure SQL Database
1 – I have created a new Azure Data Factory service, called adfcopy
2 – I have created two databases, one on-premises on a MySQL Server, and one on Azure using Azure SQLDB
3 – I’ve installed the Data Management Gateway locally, and I’ve registered it within the ADF service, using a shared key taken from the portal.
4 – now it’s time to execute the copy wizard
First thing to decide is if it has to be run once or based on a schedule
Data source selection – MySQL is one of the data sources that require a gateway
Filter selection, useful to select data to copy. I don’t have a Datetime column in my example, but it could be useful to apply a filter on the data to upload.
SQL Database connection info
5 – monitoring activity
To monitor the copy of data, the Data Factory resource explorer, recently added too, it’s the best web tool to use.
6 – JSON
We haven’t written a single line of code to accomplish this task, but the wizard did it for us, and below you could find the JSON generated for the SQL Database dataset