Test driving Azure Data Factory v2 and Azure SQL Database

I decided to test drive Azure Data Factory v2, which has been in public preview since 25.9.2017, so quite long time. Main improvement from my point of view in this version 2 is support for SSIS packages. Basically you can now do your SSIS package ETL development with Visual Studio as before, but deploy and execute those packages in the cloud.

Pricing is based on:

  • Number of activities run.
  • Volume of data moved.
  • SQL Server Integration Services (SSIS) compute hours.
  • Whether a pipeline is active or not.

SQL Database on the other hand is a cloud database as a service, which is based on SQL Server technology. “As a service” sounds quite compelling, because service provider performs all the patching and updating of the code base, and manages the underlying infrastructure. Scaling up should also be quite easy. Start small, grow big, easily 😊. Pricing is dependent on number of cores, amount of memory and storage capacity.

In a nutshell, you should now be able to do your data warehouse development and management all in cloud. Sounds good, so let’s try it.

I started by setting up Data Factory environment. I didn’t have to create any Azure account, role or storage account, because I had those already. I needed storage account, because first I tried simple tutorial file to file data integration solution.

I created two new folders to my blob storage and uploaded a file into input folder.

Next step was to create Data Factory, which I found under Analytics category from Azure services. Creation was easy, but note that name of your Data Factory has to be globally unique. I chose closest region to my location, which is at the moment West-Europe.

Next step was to create linked service to link my storage account to Data Factory, because I wanted to read and write files into my blob storage. First I clicked “Author & Monitor” from main page, which opened a new page and there on the left side I selected pencil to edit.

From Connections I selected “New”, then from different choices “Azure Blob Storage”. There are tens of different possible services including AWS S3, FTP, HTTP, Dynamics, Teradata, MySQL and so on. Then I basically gave a name for my linked service and selected my existing storage account.

Next step was to create input and output datasets. Again I chose “Azure Blob Storage” as dataset type. For input dataset I selected from connection tab my input file with “Browse” option. Output dataset path and name I defined manually, because I included some logic to name definition:

Next step was to create a pipeline to do the actual work. All of these can be found from the left side of the screen.

There are different types of activities possible to use and I chose Copy under Dataflow. You could also add “Iteration & Conditionals” if you would like to use some logic.

Then I simply chose my predefined input dataset in source tab and output dataset in sink tab (interesting name, why not target?). I validated my pipeline and clicked “Test Run”. Test run was successful and I verified that new file was created into storage output folder. Nice! Still one step to go, publisihing. I clicked “Publish” button and now my pipeline was ready to use. After publishing you could define triggers, e.g scheduling for your pipeline or call it from e.g. Azure Logic App.

Next challenge was to test Data Factory with SSIS and SQL Database. First I created SQL Database. To be able to use SSIS packages in Data Factory you still need a repository / SSIS Catalog (SSISDB database) for your packages. And I wanted to use that SQL database as my data target.

One thing in creation of that database is that you need to activate “Allow access to Azure services”. Otherwise creation that database is pretty straightforward – name, subscription, capacity etc. You also need to define a server, with names, admin user ids etc. One thing to keep in mind is that recommendation is to select the same location as you shall do with your “Integration Runtime” later on in Data Factory. I chose North Europe. You can have one to many servers with different capacities, but one database can be connected only to one server.

I created one database to act as my data store and that other one was created automatically later on. One thing I noticed afterwards, that you can’t stop your database or server to pause billing. Only way is to delete those. Take a backup if you want to use that database again. That I haven’t tried yet.

After two weeks use, database costs are minimal, under 2 €. So, not being able to stop it, is not critical.

Next I returned to my Data Factory. There I needed to create one more thing – SSIS Integration Runtime.

Quite basic stuff – name, location (remember, same as your SQL Database Server), size. On the next page you select before created database server as “Catalog Database Server Endpoint”. At this point before shown SSISDB database was created. After this my Data Factory was ready for SSIS package executions.

Even though I selected minimum capacity for my Integration Runtime, my bill is 80 € after under two weeks and I have executed only few tests. Luckily you are able to stop that Integration Runtime.

My test script included reading file from Azure blob storage and writing data into Azure SQL Database. And here begun real challenges. To be able to interact with Azure, Visual Studio Data Tools is not enough. Also “Azure Feature Pack for Integration Services” is required. Actually only connection to file in Azure blob storage requires feature pack, connection to SQL Database works as any OLE DB connection.

More info about all Feature Pack features (extra SSIS task and data flow components): https://docs.microsoft.com/en-us/sql/integration-services/azure-feature-pack-for-integration-services-ssis?view=sql-server-2017

SQL Database required a little bit tweaking on Azure side. I had to add a firewall rule to be able to access it from my workstation.

But back to those challenges. After installing the Feature Pack and development of my SSIS package, I couldn’t get debugger working. I got error message:

[SSIS.Pipeline] Error: Cannot find the connection manager with ID ”{5104701A-B9E4-49F3-9AE7-475A472F645C}” in the connection manager collection due to error code 0xC0010009. That connection manager is needed by ”Azure Blob Source.Connections[Azure Storage]” in the connection manager collection of ”Azure Blob Source”. Verify that a connection manager in the connection manager collection, Connections, has been created with that ID.

Nothing else helped, but total reinstallation of Visual Studio, Data Tools and Feature Pack. Same had done that one guy I found with Google, who had had the same problem. Even with reinstallations I had challenges, because VS couldn’t find first my Data Tools additions.

Anyway finally I was able finish development and testing. Next step was to deploy my package into Azure. Fortunately that was easy. I just selected as destination server my sql server in Azure. Address is format yourservername.database.windows.net.

After that back to Data Factory. I created a new pipeline and now I selected “Execute SSIS Package” activity.

On settings page I selected my previously created Integration Runtime (Azure-SSIS IR) and then the package I had just deployed.

Then I just did a test run and package smoothly finished successfully. I also verified that data was populated into my target table. Some overhead there is. Same package, which took 7 secs to execute on my workstation took 1 min 7 sec in the Data Factory. That is also something, which I would like to test with longer lasting packages. Is that overhead fixed or does it grow linearly?

Without some challenges with Visual Studio, I would say I’m quite pleased with Azure Data Factory v2 and SQL Database. However I would like to do some more testing with more complicated work- and data flows maybe with some scripts, but anyway I really think this could be a good solution and architecture to do whole data warehouse in a cloud on a “as a service” platform.

 

Yksi vastaus artikkeliiin “Test driving Azure Data Factory v2 and Azure SQL Database

  1. I checked afterwards is it possible to use different collations on SQL Database, because it’s quite important aspect. Yes, it is possible. However it seems this was not the case in earlier versions.

Kommentointi on suljettu.